PlanetSquires Forums

Support Forums => PlanetSquires Software => Topic started by: Bumblebee on April 07, 2021, 12:10:21 PM

Title: Listbox - unicode items
Post by: Bumblebee on April 07, 2021, 12:10:21 PM
I may not have noticed this before, but does the listbox support unicode characters?
Using the latest version 2.2.0

e.g. Élégance is displayed as Élégance
Title: Re: Listbox - unicode items
Post by: Paul Squires on April 07, 2021, 12:37:55 PM
Hi, yes the Listbox code does support unicode. The following correctly displays your code:


   for i as long = 0 to 5
      frmMain.List1.Items.Add( "Élégance" & i )
   next

Title: Re: Listbox - unicode items
Post by: Bumblebee on April 07, 2021, 12:49:13 PM
Might be a problem with the string variables I'm using.
Do I need to use CWSTR to preserve unicode characters?
Title: Re: Listbox - unicode items
Post by: Paul Squires on April 07, 2021, 12:59:40 PM
CWSTR will work. WSTRING will work as well.
I doubt that STRING will work reliably.
Title: Re: Listbox - unicode items
Post by: Bumblebee on April 07, 2021, 01:37:50 PM
I use regular string variables to write the file that contains unicode characters. It seems to work with no issues.
Nor did I specify utf-x encoding when writing the file.

Replacing string with cwstr causes an invalid data type in the input statement.
Title: Re: Listbox - unicode items
Post by: José Roca on April 07, 2021, 02:38:21 PM
What do you understand by unicode characters? Accented characters like á, é, í, ó, ú aren't unicode.

FB ansi variables can't hold unicode characters like Серге́й Серге́евич Проко́фьев.
Title: Re: Listbox - unicode items
Post by: Bumblebee on April 07, 2021, 02:45:20 PM
I'm having an issue reading accented characters, as per the example.
According to Notepad, the file I'm parsing is in UTF-8 Unix (LF)

When I change variable type to CWSTR, an error occurs with the Line Input# statement.
It wants a string variable.
Title: Re: Listbox - unicode items
Post by: José Roca on April 07, 2021, 02:58:02 PM
If the file is utf-8, you have to read it using ansi strings and then convert it to ansi or unicode, since a listbox (or any other Windows control) doesn't understand utf-8.
Title: Re: Listbox - unicode items
Post by: Bumblebee on April 07, 2021, 04:25:56 PM
WStr() function will not convert strings read with Line Input#
It does work with literal strings.
#include "Afx\CWStr.inc"
dim a as string
dim b as CWSTR
a = "Élégance"
b = "Élégance2"
print a
print b
print wstr(a)
b = wstr(a)
print b
print "- read utf8 file -"
a=""
open "test.txt" for input as #1
line input #1,a
print a
b = a
?b
?wstr(a)
?wstr(b)
close
sleep
end
Title: Re: Listbox - unicode items
Post by: José Roca on April 07, 2021, 05:27:37 PM
> WStr() function will not convert strings read with Line Input#

Of course not. WSTR will convert ASCII to UNICODE, not UTF-8 to UNICODE.

You can either use the Windows API function MutibyteToWideChar or...

DIM cws AS CWSTR = CWSTR(<UTF-8 string>, CP_UTF8)

e.g.:

DIM s AS STRING = "José Roca"   ' My name in UTF-8
DIM cws AS CWSTR = CWSTR(s, CP_UTF8)
print cws
Title: Re: Listbox - unicode items
Post by: Bumblebee on April 07, 2021, 11:05:33 PM
I don't understand any of this, but it works. Thanks!
Title: Re: Listbox - unicode items
Post by: José Roca on April 08, 2021, 05:26:52 AM
You should learn the differences between ASCII, ANSI, UTF-8 and UNICODE.
Title: Re: Listbox - unicode items
Post by: Bumblebee on April 08, 2021, 11:17:24 PM

Dim z as String
~
Line Input #1, z
z = CWSTR(z,CP_UTF8)


You said that accented characters are not unicode, so this works.
When z is written to a text file, the file is ANSI.

What could I do if there were Cyrillic characters in the UTF-8 source file?
Title: Re: Listbox - unicode items
Post by: José Roca on April 09, 2021, 07:17:00 AM
An UTF-8 file can't contain cyrillic characters, it has to be UTF-16.

If you need to read files with unicode content, you can't use Line Input. As I said, FB support for unicode is weak. You can use my class CTextStream: https://github.com/JoseRoca/WinFBX/blob/master/docs/File%20Management/CTextStream%20Class.md
Title: Re: Listbox - unicode items
Post by: Bumblebee on April 09, 2021, 11:00:29 AM
I took the Cyrillic characters you posted and saved them in Notepad. It says the file is UTF-8.
Is this dependent on my language settings?

#include "Afx\Cwstr.inc"
dim a as string
dim b as CWSTR
'cyrillic characters encoded as utf-8
open "test.txt" for input as #1
line input #1,a
close
print a
b = a
print b
print cwstr(a,cp_utf8)
print cwstr(b,cp_utf8)
sleep
end


Ouput in terminal window:

Серге́й Серге́евич В'роко́фьев
СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂ¹ СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂµÃÂ²ÃÂ¸Ã'‡ ПÃ'â,¬ÃÂ¾ÃÂºÃÂ¾ÃŒÂÃ'„Ã'Å'ев
Серге́й Серге́евич Проко́фьев
Серге́й Серге́евич Проко́фьев

So maybe it would work.
When I was working on my file backup program, CWSTR was able to handle every filename, including those that had korean characters. However, I wasn't writing/reading those names from text files. Everything was done within CWSTR arrays.
Title: Re: Listbox - unicode items
Post by: José Roca on April 09, 2021, 11:22:55 AM
I mean that it doesn't contain cyrillic characters, but an ansi encoding of them:
СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂ¹ СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂµÃÂ²ÃÂ¸Ã'‡ ПÃ'â,¬ÃÂ¾ÃÂºÃÂ¾ÃŒÂÃ'„Ã'Å'ев

You won't see Серге́й Серге́евич Проко́фьев in an ansi or UTF-8 file, but you can see them in an unicode file.

> When I was working on my file backup program, CWSTR was able to handle every filename, including those that had korean characters.

Yes, CWSTR can handle them, but the FB intrinsic file functions, such OPEN, can't use unicode for the file paths. Try to open a file using OPEN with a path containing Korean characters...

My CTextSTream and CFileStream classes can use unicode characters in the path.

Windows doesn't use UTF-8 natively. Therefore, if you want to use UTF-8 in files, you will have to convert the read UTF-8 text to unicode and convert from unicode to UTF-8 before writing to the file.
Title: Re: Listbox - unicode items
Post by: philbar on April 09, 2021, 11:41:16 AM
QuoteAn UTF-8 file can't contain cyrillic characters, it has to be UTF-16.

Not exactly. UTF-8 and UTF-16 both encode the entire Unicode character set, which as I remember it, consists of about 150,000 characters. UTF-8 does it with a stream of bytes, and UTF-16 does it with a stream of 16-bit words. Neither bytes nor words can represent 150,000 things in a single unit, so they both have to resort to variable length characters. Obviously, a 16-bit word can represent more things in one unit (something less than 65536 because there are forbidden codes), so Cyrillic, Greek and more will fit into a single UTF-16 "letter." A single UTF-8 byte can only hold the ASCII characters, barely adequate for English, so the rest of the Unicode requires 2, 3, or 4 bytes - a pain if you need to know how many letters are in a given string.

What is true is that your Windows computer may not display a UTF-8 string correctly, because Windows is historically set up to display only a certain code page: an interpretation of bytes as characters. A Western code page can't display Cyrillic, and a Cyrillic code page can't display Greek. Windows 10 now has a code page called UTF-8. I switched my computer to the UTF-8 code page, and now I can display Latin, Cyrillic, Greek, even Arabic on the same screen (It's iffy which direction the Arabic will go). That even applies to command prompt screens.

A FreeBasic program can read a UTF-8 file into a string, but it won't know the difference between characters and bytes, so you're on your own there. To display the string correctly, you'll have to change your computer to the UTF-8 code page (there are side effects). Windows is beginning to accept UTF-8 as a real Unicode encoding, but it has a way to go yet.

end rant.

José, you changed your post while I was still writing. Sorry.

The part about the UTF-8 code page still stands.
Title: Re: Listbox - unicode items
Post by: José Roca on April 09, 2021, 02:07:59 PM
> Windows is beginning to accept UTF-8 as a real Unicode encoding, but it has a way to go yet.

Well, it is doing some tweaking, like letting you to set the active code page to UTF-8 in a manifest file. It probably also did some tweaks to the UCRT (Universal Windows Runtime), but I don't have precise information.

Most of the Windows API "A" functions can be used with UTF-8 strings if you set the active code page to UTF-8 because what they do is to convert ansi strings to unicode using MultibyteToWideChar with AC_ACP (the system default Windows ANSI code page) as the first parameter and then call the "W" function.

But you will have to change all the FB functions that deal with strings, and also automatic string conversions. Doable, but a lot of work.

String manipulation will be more slow that using ansi or unicode. Knowing that you have to deal with bytes or words instead of a variable number of bytes is a blessing.

Title: Re: Listbox - unicode items
Post by: Bumblebee on April 09, 2021, 08:22:51 PM
In this application, I don't have to worry about unicode characters in the file path.

On one level, the data is there, just not in human readable format.
This won't be a problem, as long as the conversions can be done by code.

I attached the test file I'm using. It has a Cyrillic entry, and a Korean/English entry typically found on YouTube.
Notepad says it's UTF-8 and it displays properly.
If I force Notepad to load it as ANSI, then strange characters show up.

Terminal window dump:
Серге́й Серге́евич В'роко́фьев
СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂ¹ СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂµÃÂ²ÃÂ¸Ã'‡ ПÃ'â,¬ÃÂ¾ÃÂºÃÂ¾ÃŒÂÃ'„Ã'Å'ев
Серге́й Серге́евич Проко́фьев
Серге́й Серге́евич Проко́фьев
Treasure - Ω▒╕Ω╖╕δú╣ ∞èñ∞£ä∞╣ÿδ▓á리(Switchberry) ∞Â¥╕∞▓£ Ω│╡∞ù░ chulwoo H ∞ºü∞║á(Fancam)
Treasure - 걸그룹 스ìÅ"„치베리(Switchberry) 인ì²Å" ê³µìâ€"° chulwoo H 직캠(Fancam)
Treasure - 걸그룹 스위치베리(Switchberry) 인천 공연 chulwoo H 직캠(Fancam)
Treasure - 걸그룹 스위치베리(Switchberry) 인천 공연 chulwoo H 직캠(Fancam)
Title: Re: Listbox - unicode items
Post by: Paul Squires on April 10, 2021, 12:18:20 AM
How about using Jose's cTextStream class to read the file and utilize CWSTR's built in utf8 conversion:


#define UNICODE
#include once "Afx\AfxFile.inc"
using Afx

dim pStream AS CTextStream
dim as CWSTR wst

if pStream.Open( "test.txt" ) = S_OK then

   do until pStream.EOS
      wst.utf8 = pStream.ReadLine
      AfxMsg( wst )
   loop
   pStream.Close

end if



Title: Re: Listbox - unicode items
Post by: Bumblebee on April 10, 2021, 06:22:25 AM
I could use it, although it appears the original FB IO operations are sufficient.
When it comes to Print# and Write#, CWSTR variables can be written.

What does #define unicode do?

#include once "Afx\AfxFile.inc"
#include "Afx\Cwstr.inc"
dim a as string
open "test.txt" for input as #1
open "output.txt" for output Encoding "utf-16" as #2
do until eof(1)
  line input #1,a
  AfxMsg (cwstr(a,cp_utf8))
  print #2, cwstr(a,cp_utf8) 'write line
  write #2, cwstr(a,cp_utf8) 'write as variable
loop
close
sleep
end
Title: Re: Listbox - unicode items
Post by: Paul Squires on April 10, 2021, 12:01:51 PM
Just as an aside, I changed all of my file I/O some time ago to completely avoid PB's built in intrinsic functions for the very reason that Jose explained earlier: They do not play well with unicode. Rather than try to juggle in my brain all the different scenarios where I could inter-disperse FB native code and WinFBX code, I decided to go 100% with WinFBX. Once you use Jose's classes a few times for binary and text streams, it becomes extremely easy. The classes also work perfectly with his CWSTR and CBSTR string types. It is easy to use the classes to write the variables as you have shown in your example above (ie. Print/Write). This post is not meant to pursuade you to abandon your approach, it is just my experience that in the longer run, using the WinFBX approach has been easier and more consistent for me.
Title: Re: Listbox - unicode items
Post by: Bumblebee on April 11, 2021, 05:28:55 AM
I couldn't have written a backup program without CWSTR variables and arrays. My previous backup program written in VB6 did not manage files with non-latin characters properly. It was always on my "to be resolved" list.

I don't do enough programming to encounter enough problems to learn to replace old ways of doing things, with the new.