I may not have noticed this before, but does the listbox support unicode characters?
Using the latest version 2.2.0
e.g. Élégance is displayed as Élégance
Hi, yes the Listbox code does support unicode. The following correctly displays your code:
for i as long = 0 to 5
frmMain.List1.Items.Add( "Élégance" & i )
next
Might be a problem with the string variables I'm using.
Do I need to use CWSTR to preserve unicode characters?
CWSTR will work. WSTRING will work as well.
I doubt that STRING will work reliably.
I use regular string variables to write the file that contains unicode characters. It seems to work with no issues.
Nor did I specify utf-x encoding when writing the file.
Replacing string with cwstr causes an invalid data type in the input statement.
What do you understand by unicode characters? Accented characters like á, é, Ã, ó, ú aren't unicode.
FB ansi variables can't hold unicode characters like Серге́й Серге́евич Проко́фьев.
I'm having an issue reading accented characters, as per the example.
According to Notepad, the file I'm parsing is in UTF-8 Unix (LF)
When I change variable type to CWSTR, an error occurs with the Line Input# statement.
It wants a string variable.
If the file is utf-8, you have to read it using ansi strings and then convert it to ansi or unicode, since a listbox (or any other Windows control) doesn't understand utf-8.
WStr() function will not convert strings read with Line Input#
It does work with literal strings.
#include "Afx\CWStr.inc"
dim a as string
dim b as CWSTR
a = "Élégance"
b = "Élégance2"
print a
print b
print wstr(a)
b = wstr(a)
print b
print "- read utf8 file -"
a=""
open "test.txt" for input as #1
line input #1,a
print a
b = a
?b
?wstr(a)
?wstr(b)
close
sleep
end
> WStr() function will not convert strings read with Line Input#
Of course not. WSTR will convert ASCII to UNICODE, not UTF-8 to UNICODE.
You can either use the Windows API function MutibyteToWideChar or...
DIM cws AS CWSTR = CWSTR(<UTF-8 string>, CP_UTF8)
e.g.:
DIM s AS STRING = "José Roca" ' My name in UTF-8
DIM cws AS CWSTR = CWSTR(s, CP_UTF8)
print cws
I don't understand any of this, but it works. Thanks!
You should learn the differences between ASCII, ANSI, UTF-8 and UNICODE.
Dim z as String
~
Line Input #1, z
z = CWSTR(z,CP_UTF8)
You said that accented characters are not unicode, so this works.
When z is written to a text file, the file is ANSI.
What could I do if there were Cyrillic characters in the UTF-8 source file?
An UTF-8 file can't contain cyrillic characters, it has to be UTF-16.
If you need to read files with unicode content, you can't use Line Input. As I said, FB support for unicode is weak. You can use my class CTextStream: https://github.com/JoseRoca/WinFBX/blob/master/docs/File%20Management/CTextStream%20Class.md
I took the Cyrillic characters you posted and saved them in Notepad. It says the file is UTF-8.
Is this dependent on my language settings?
#include "Afx\Cwstr.inc"
dim a as string
dim b as CWSTR
'cyrillic characters encoded as utf-8
open "test.txt" for input as #1
line input #1,a
close
print a
b = a
print b
print cwstr(a,cp_utf8)
print cwstr(b,cp_utf8)
sleep
end
Ouput in terminal window:
╨Ã╨╡╤Ç╨│╨╡╠ü╨╣ ╨Ã╨╡╤Ç╨│╨╡╠ü╨╡╨▓╨╕╤ç ╨Æ'╤Ç╨╛╨║╨╛╠ü╤ä╤î╨╡╨▓
áõÃ'â,¬Ã³ÃµÌÂù áõÃ'â,¬Ã³ÃµÌÂõòøÃ'‡ ßÃ'â,¬Ã¾ÃºÃ¾ÌÂÃ'„Ã'Å'õò
Серге́й Серге́евич Проко́фьев
Серге́й Серге́евич Проко́фьев
So maybe it would work.
When I was working on my file backup program, CWSTR was able to handle every filename, including those that had korean characters. However, I wasn't writing/reading those names from text files. Everything was done within CWSTR arrays.
I mean that it doesn't contain cyrillic characters, but an ansi encoding of them:
áõÃ'â,¬Ã³ÃµÌÂù áõÃ'â,¬Ã³ÃµÌÂõòøÃ'‡ ßÃ'â,¬Ã¾ÃºÃ¾ÌÂÃ'„Ã'Å'õò
You won't see Серге́й Серге́евич Проко́фьев in an ansi or UTF-8 file, but you can see them in an unicode file.
> When I was working on my file backup program, CWSTR was able to handle every filename, including those that had korean characters.
Yes, CWSTR can handle them, but the FB intrinsic file functions, such OPEN, can't use unicode for the file paths. Try to open a file using OPEN with a path containing Korean characters...
My CTextSTream and CFileStream classes can use unicode characters in the path.
Windows doesn't use UTF-8 natively. Therefore, if you want to use UTF-8 in files, you will have to convert the read UTF-8 text to unicode and convert from unicode to UTF-8 before writing to the file.
QuoteAn UTF-8 file can't contain cyrillic characters, it has to be UTF-16.
Not exactly. UTF-8 and UTF-16 both encode the entire Unicode character set, which as I remember it, consists of about 150,000 characters. UTF-8 does it with a stream of bytes, and UTF-16 does it with a stream of 16-bit words. Neither bytes nor words can represent 150,000 things in a single unit, so they both have to resort to variable length characters. Obviously, a 16-bit word can represent more things in one unit (something less than 65536 because there are forbidden codes), so Cyrillic, Greek and more will fit into a single UTF-16 "letter." A single UTF-8 byte can only hold the ASCII characters, barely adequate for English, so the rest of the Unicode requires 2, 3, or 4 bytes - a pain if you need to know how many letters are in a given string.
What is true is that your Windows computer may not display a UTF-8 string correctly, because Windows is historically set up to display only a certain code page: an interpretation of bytes as characters. A Western code page can't display Cyrillic, and a Cyrillic code page can't display Greek. Windows 10 now has a code page called UTF-8. I switched my computer to the UTF-8 code page, and now I can display Latin, Cyrillic, Greek, even Arabic on the same screen (It's iffy which direction the Arabic will go). That even applies to command prompt screens.
A FreeBasic program can read a UTF-8 file into a string, but it won't know the difference between characters and bytes, so you're on your own there. To display the string correctly, you'll have to change your computer to the UTF-8 code page (there are side effects). Windows is beginning to accept UTF-8 as a real Unicode encoding, but it has a way to go yet.
end rant.
José, you changed your post while I was still writing. Sorry.
The part about the UTF-8 code page still stands.
> Windows is beginning to accept UTF-8 as a real Unicode encoding, but it has a way to go yet.
Well, it is doing some tweaking, like letting you to set the active code page to UTF-8 in a manifest file. It probably also did some tweaks to the UCRT (Universal Windows Runtime), but I don't have precise information.
Most of the Windows API "A" functions can be used with UTF-8 strings if you set the active code page to UTF-8 because what they do is to convert ansi strings to unicode using MultibyteToWideChar with AC_ACP (the system default Windows ANSI code page) as the first parameter and then call the "W" function.
But you will have to change all the FB functions that deal with strings, and also automatic string conversions. Doable, but a lot of work.
String manipulation will be more slow that using ansi or unicode. Knowing that you have to deal with bytes or words instead of a variable number of bytes is a blessing.
In this application, I don't have to worry about unicode characters in the file path.
On one level, the data is there, just not in human readable format.
This won't be a problem, as long as the conversions can be done by code.
I attached the test file I'm using. It has a Cyrillic entry, and a Korean/English entry typically found on YouTube.
Notepad says it's UTF-8 and it displays properly.
If I force Notepad to load it as ANSI, then strange characters show up.
Terminal window dump:
╨Ã╨╡╤Ç╨│╨╡╠ü╨╣ ╨Ã╨╡╤Ç╨│╨╡╠ü╨╡╨▓╨╕╤ç ╨Æ'╤Ç╨╛╨║╨╛╠ü╤ä╤î╨╡╨▓
áõÃ'â,¬Ã³ÃµÌÂù áõÃ'â,¬Ã³ÃµÌÂõòøÃ'‡ ßÃ'â,¬Ã¾ÃºÃ¾ÌÂÃ'„Ã'Å'õò
Серге́й Серге́евич Проко́фьев
Серге́й Серге́евич Проко́фьев
Treasure - Ω▒╕Ω╖╕δú╣ ∞èñ∞£ä∞╣ÿδ▓á리(Switchberry) ∞Â¥╕∞▓£ Ω│╡∞ù░ chulwoo H ∞ºü∞║á(Fancam)
Treasure - 걸그룹 스ìÅ"„치ë² 리(Switchberry) ì¸ì²Å" ê³µìâ€"° chulwoo H ì§Âìº (Fancam)
Treasure - 걸그룹 스위치베리(Switchberry) 인천 공연 chulwoo H 직캠(Fancam)
Treasure - 걸그룹 스위치베리(Switchberry) 인천 공연 chulwoo H 직캠(Fancam)
How about using Jose's cTextStream class to read the file and utilize CWSTR's built in utf8 conversion:
#define UNICODE
#include once "Afx\AfxFile.inc"
using Afx
dim pStream AS CTextStream
dim as CWSTR wst
if pStream.Open( "test.txt" ) = S_OK then
do until pStream.EOS
wst.utf8 = pStream.ReadLine
AfxMsg( wst )
loop
pStream.Close
end if
I could use it, although it appears the original FB IO operations are sufficient.
When it comes to Print# and Write#, CWSTR variables can be written.
What does #define unicode do?
#include once "Afx\AfxFile.inc"
#include "Afx\Cwstr.inc"
dim a as string
open "test.txt" for input as #1
open "output.txt" for output Encoding "utf-16" as #2
do until eof(1)
line input #1,a
AfxMsg (cwstr(a,cp_utf8))
print #2, cwstr(a,cp_utf8) 'write line
write #2, cwstr(a,cp_utf8) 'write as variable
loop
close
sleep
end
Just as an aside, I changed all of my file I/O some time ago to completely avoid PB's built in intrinsic functions for the very reason that Jose explained earlier: They do not play well with unicode. Rather than try to juggle in my brain all the different scenarios where I could inter-disperse FB native code and WinFBX code, I decided to go 100% with WinFBX. Once you use Jose's classes a few times for binary and text streams, it becomes extremely easy. The classes also work perfectly with his CWSTR and CBSTR string types. It is easy to use the classes to write the variables as you have shown in your example above (ie. Print/Write). This post is not meant to pursuade you to abandon your approach, it is just my experience that in the longer run, using the WinFBX approach has been easier and more consistent for me.
I couldn't have written a backup program without CWSTR variables and arrays. My previous backup program written in VB6 did not manage files with non-latin characters properly. It was always on my "to be resolved" list.
I don't do enough programming to encounter enough problems to learn to replace old ways of doing things, with the new.