Listbox - unicode items

Started by Bumblebee, April 07, 2021, 12:10:21 PM

Previous topic - Next topic

José Roca

#15
I mean that it doesn't contain cyrillic characters, but an ansi encoding of them:
СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂ¹ СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂµÃÂ²ÃÂ¸Ã'‡ ПÃ'â,¬ÃÂ¾ÃÂºÃÂ¾ÃŒÂÃ'„Ã'Å'ев

You won't see Серге́й Серге́евич Проко́фьев in an ansi or UTF-8 file, but you can see them in an unicode file.

> When I was working on my file backup program, CWSTR was able to handle every filename, including those that had korean characters.

Yes, CWSTR can handle them, but the FB intrinsic file functions, such OPEN, can't use unicode for the file paths. Try to open a file using OPEN with a path containing Korean characters...

My CTextSTream and CFileStream classes can use unicode characters in the path.

Windows doesn't use UTF-8 natively. Therefore, if you want to use UTF-8 in files, you will have to convert the read UTF-8 text to unicode and convert from unicode to UTF-8 before writing to the file.

philbar

#16
QuoteAn UTF-8 file can't contain cyrillic characters, it has to be UTF-16.

Not exactly. UTF-8 and UTF-16 both encode the entire Unicode character set, which as I remember it, consists of about 150,000 characters. UTF-8 does it with a stream of bytes, and UTF-16 does it with a stream of 16-bit words. Neither bytes nor words can represent 150,000 things in a single unit, so they both have to resort to variable length characters. Obviously, a 16-bit word can represent more things in one unit (something less than 65536 because there are forbidden codes), so Cyrillic, Greek and more will fit into a single UTF-16 "letter." A single UTF-8 byte can only hold the ASCII characters, barely adequate for English, so the rest of the Unicode requires 2, 3, or 4 bytes - a pain if you need to know how many letters are in a given string.

What is true is that your Windows computer may not display a UTF-8 string correctly, because Windows is historically set up to display only a certain code page: an interpretation of bytes as characters. A Western code page can't display Cyrillic, and a Cyrillic code page can't display Greek. Windows 10 now has a code page called UTF-8. I switched my computer to the UTF-8 code page, and now I can display Latin, Cyrillic, Greek, even Arabic on the same screen (It's iffy which direction the Arabic will go). That even applies to command prompt screens.

A FreeBasic program can read a UTF-8 file into a string, but it won't know the difference between characters and bytes, so you're on your own there. To display the string correctly, you'll have to change your computer to the UTF-8 code page (there are side effects). Windows is beginning to accept UTF-8 as a real Unicode encoding, but it has a way to go yet.

end rant.

José, you changed your post while I was still writing. Sorry.

The part about the UTF-8 code page still stands.

José Roca

#17
> Windows is beginning to accept UTF-8 as a real Unicode encoding, but it has a way to go yet.

Well, it is doing some tweaking, like letting you to set the active code page to UTF-8 in a manifest file. It probably also did some tweaks to the UCRT (Universal Windows Runtime), but I don't have precise information.

Most of the Windows API "A" functions can be used with UTF-8 strings if you set the active code page to UTF-8 because what they do is to convert ansi strings to unicode using MultibyteToWideChar with AC_ACP (the system default Windows ANSI code page) as the first parameter and then call the "W" function.

But you will have to change all the FB functions that deal with strings, and also automatic string conversions. Doable, but a lot of work.

String manipulation will be more slow that using ansi or unicode. Knowing that you have to deal with bytes or words instead of a variable number of bytes is a blessing.


Bumblebee

#18
In this application, I don't have to worry about unicode characters in the file path.

On one level, the data is there, just not in human readable format.
This won't be a problem, as long as the conversions can be done by code.

I attached the test file I'm using. It has a Cyrillic entry, and a Korean/English entry typically found on YouTube.
Notepad says it's UTF-8 and it displays properly.
If I force Notepad to load it as ANSI, then strange characters show up.

Terminal window dump:
Серге́й Серге́евич В'роко́фьев
СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂ¹ СеÃ'â,¬ÃÂ³ÃÂµÃŒÂÃÂµÃÂ²ÃÂ¸Ã'‡ ПÃ'â,¬ÃÂ¾ÃÂºÃÂ¾ÃŒÂÃ'„Ã'Å'ев
Серге́й Серге́евич Проко́фьев
Серге́й Серге́евич Проко́фьев
Treasure - Ω▒╕Ω╖╕δú╣ ∞èñ∞£ä∞╣ÿδ▓á리(Switchberry) ∞Â¥╕∞▓£ Ω│╡∞ù░ chulwoo H ∞ºü∞║á(Fancam)
Treasure - 걸그룹 스ìÅ"„치베리(Switchberry) 인ì²Å" ê³µìâ€"° chulwoo H 직캠(Fancam)
Treasure - 걸그룹 스위치베리(Switchberry) 인천 공연 chulwoo H 직캠(Fancam)
Treasure - 걸그룹 스위치베리(Switchberry) 인천 공연 chulwoo H 직캠(Fancam)
Failed pollinator.

Paul Squires

How about using Jose's cTextStream class to read the file and utilize CWSTR's built in utf8 conversion:


#define UNICODE
#include once "Afx\AfxFile.inc"
using Afx

dim pStream AS CTextStream
dim as CWSTR wst

if pStream.Open( "test.txt" ) = S_OK then

   do until pStream.EOS
      wst.utf8 = pStream.ReadLine
      AfxMsg( wst )
   loop
   pStream.Close

end if



Paul Squires
PlanetSquires Software

Bumblebee

#20
I could use it, although it appears the original FB IO operations are sufficient.
When it comes to Print# and Write#, CWSTR variables can be written.

What does #define unicode do?

#include once "Afx\AfxFile.inc"
#include "Afx\Cwstr.inc"
dim a as string
open "test.txt" for input as #1
open "output.txt" for output Encoding "utf-16" as #2
do until eof(1)
  line input #1,a
  AfxMsg (cwstr(a,cp_utf8))
  print #2, cwstr(a,cp_utf8) 'write line
  write #2, cwstr(a,cp_utf8) 'write as variable
loop
close
sleep
end
Failed pollinator.

Paul Squires

Just as an aside, I changed all of my file I/O some time ago to completely avoid PB's built in intrinsic functions for the very reason that Jose explained earlier: They do not play well with unicode. Rather than try to juggle in my brain all the different scenarios where I could inter-disperse FB native code and WinFBX code, I decided to go 100% with WinFBX. Once you use Jose's classes a few times for binary and text streams, it becomes extremely easy. The classes also work perfectly with his CWSTR and CBSTR string types. It is easy to use the classes to write the variables as you have shown in your example above (ie. Print/Write). This post is not meant to pursuade you to abandon your approach, it is just my experience that in the longer run, using the WinFBX approach has been easier and more consistent for me.
Paul Squires
PlanetSquires Software

Bumblebee

I couldn't have written a backup program without CWSTR variables and arrays. My previous backup program written in VB6 did not manage files with non-latin characters properly. It was always on my "to be resolved" list.

I don't do enough programming to encounter enough problems to learn to replace old ways of doing things, with the new.
Failed pollinator.