Unicode usage precautions vs codepage

Marc Pons · July 18, 2016, 07:25:05 AM

Hi,
thanks to Jose, Paul...
the dynamic unicode string type is born!

Great job.

what do you think about advices/precautions like the following :

Code Select

' ========================================================================================
' Warning : 
'	if you want to distribute your executable including unicode chars or share your code source 
'	with others (possibly using different codepage as you are using). You could face some dificulties!
'
'	Some directions to avoid these unicode problems.
'
'	On the source code:
'	it is better to not use direct keyboard input for char coded >127  (outside ASCII definition) 
'	because that codes are codepage dependant and can produce strange behaviour depending 
'	on the user codepage.
'
'	so it is advisable to use :
'	the escape sequence of that needed char instead  (notice the ! to use escape sequence),
'			eg : wstr(!"\u20AC")  for euro symbol, (even you have it available in your keyboard)
'	or use wchr function for individual char,
'			eg : wchr(&h20AC) or wchr(8364) : hex or decimal values for euro symbol
'
'	These 2 previous methods work well, but it is not very readable/easy...
'
'	If you prefer direct input keyboard method,  
'	but insure your executable will run correctly or be able to share your code source :
' 
'	Just input normally your code using direct keyboard input(codepage dependant),
'	compile, make your modifications and when your executable is running as you want in your PC,
'	convert that code to utf8, wich is not codepage dependant.
'	The converted utf8 source code will be compiled as is, by freebasic compiler, 
'	to produce your final sharing executable and you can also share that converted source code,
'	with users who can compile on their side without any problem.
' ========================================================================================

I'm sure, the risk is important on the codepage

marc

Marc Pons · July 18, 2016, 11:36:22 AM

function to enter unicode codepoints to CWSTR to extend the wchr function wich does not convert code > &hFFFF

Code Select

''::::: converts unicode codepoint (also > FFFF,  makes surrogate pair) to CWSTR
private function uchr(U1 as Ulong) as CWSTR
   dim hi                as Ulong
   dim lo                as Ulong
   if (U1 >= &h10000 and U1 <= &h10FFFF) then
      hi = ((U1 - &h10000) / &h400) + &hD800
      lo = ((U1 - &h10000) mod &h400) + &hDC00
      return wchr(hi , lo)
   elseif U1 < &h10000 then
      return wchr(U1)
   end if
   return ""
END FUNCTION

José Roca · July 18, 2016, 11:43:35 AM

Any example of use?

Marc Pons · July 18, 2016, 12:07:01 PM

updated, sorry

Code Select

#define unicode
#INCLUDE ONCE "windows.bi"

#INCLUDE ONCE "AFX/CWStr.inc"
using Afx

''::::: converts unicode codepoint (also > FFFF,  makes surrogate pair) to CWSTR
private function uchr(U1 as Ulong) as CWSTR
   dim hi                as Ulong
   dim lo                as Ulong
   if (U1 >= &h10000 and U1 <= &h10FFFF) then
      hi = ((U1 - &h10000) / &h400) + &hD800
      lo = ((U1 - &h10000) mod &h400) + &hDC00
      return wchr(hi , lo)
   elseif U1 < &h10000 then
      return wchr(U1)
   end if
   return ""
END FUNCTION

'extended codepage  &h1D11E  and the equivalent surrogate pair (&hD834 , &hDD1E)
dim as CWSTR u11 = uchr(&h1D11E) & wchr(&hD834 , &hDD1E)
print "str(u11)= >" & str(u11) & "<"
messagebox(0 , "str(u11)= >" & str(u11) & "<" , "string view" , MB_OK)
'very few fonts can show the extended codes, if not possible to show an empty square represents the extended char
'at least you should view the 2 squares in the messagebox  for the 2 input forms
'and 4 characters depending of your console codepage in console 

for x as long = 0 to len(u11) - 1
   print " u11[" & x & "] = " & u11[x]
NEXT
print "Press any key..."
sleep

Marc Pons · July 18, 2016, 12:17:59 PM

just corrected previous post

marc

José Roca · July 18, 2016, 05:32:45 PM

There is one thing that I don't understand. Each time that I call it, the hex value changes.

MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK

It displays an hex number with 6 digits. The last four are always the same, 71D0, but the first two change.

Marc Pons · July 19, 2016, 04:11:55 AM

Jose,

I don't know why you want to show : HEX(uchr(&h1D11E))

here the hex declare extract

Code Select

Declare Function Hex ( ByVal number As Const Any Ptr ) As String

the only result you can get from your code is the hex value of the memory
where is stored the uchr function result. (an CWSTR type),
wich is in fact seen by Hex function as pointer, via the implicit conversion

Code Select

DECLARE OPERATOR CAST () AS ANY PTR , done by the CWSTR class

I don't know why it is changing the value
because for me if I do
MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK ' shows 337DC0
'if second
MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK ' shows 337DC0 same

José Roca · July 19, 2016, 08:35:11 AM

Guess I got confused trying to understading it to write an explanation. I have incorporated it to AfxWin.inc as follows:

Code Select


' ========================================================================================
' Converts unicode codepoint. Code points from the other planes (called Supplementary Planes)
' are encoded as two 16-bit code units called surrogate pairs, by the following scheme:
' &h010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
' The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first
' 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
' The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit
' code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
' Example: DIM uch AS CWSTR = AfxUChr(&h1D11E) & WCHR(&hD834, &hDD1E)
' Converts unicode codepoint &h1D11E and makes surrogate pairs (WCHR(&hD834, &hDD1E)).
' ========================================================================================
PRIVATE FUNCTION AfxUChr(BYVAL uch AS ULONG) AS CWSTR
   DIM hi AS ULONG, lo AS ULONG
   IF (uch >= &h10000 AND uch <= &h10FFFF) THEN
      hi = ((uch - &h10000) / &h400) + &hD800
      lo = ((uch - &h10000) MOD &h400) + &hDC00
      RETURN WCHR(hi, lo)
   ELSEIF uch < &h10000 THEN
      RETURN WCHR(uch)
   END IF
   RETURN ""
END FUNCTION
' ========================================================================================

I don't think that nobody will use it ever, except maybe you.

José Roca · July 20, 2016, 01:42:23 AM

Quote from: Marc Pons on July 18, 2016, 07:25:05 AM
Hi,
thanks to Jose, Paul...
the dynamic unicode string type is born!

Great job.

what do you think about advices/precautions like the following :

Code Select Expand
' ======================================================================================== ' Warning : ' if you want to distribute your executable including unicode chars or share your code source ' with others (possibly using different codepage as you are using). You could face some dificulties! ' ' Some directions to avoid these unicode problems. ' ' On the source code: ' it is better to not use direct keyboard input for char coded >127 (outside ASCII definition) ' because that codes are codepage dependant and can produce strange behaviour depending ' on the user codepage. ' ' so it is advisable to use : ' the escape sequence of that needed char instead (notice the ! to use escape sequence), ' eg : wstr(!"\u20AC") for euro symbol, (even you have it available in your keyboard) ' or use wchr function for individual char, ' eg : wchr(&h20AC) or wchr(8364) : hex or decimal values for euro symbol ' ' These 2 previous methods work well, but it is not very readable/easy... ' ' If you prefer direct input keyboard method, ' but insure your executable will run correctly or be able to share your code source : ' ' Just input normally your code using direct keyboard input(codepage dependant), ' compile, make your modifications and when your executable is running as you want in your PC, ' convert that code to utf8, wich is not codepage dependant. ' The converted utf8 source code will be compiled as is, by freebasic compiler, ' to produce your final sharing executable and you can also share that converted source code, ' with users who can compile on their side without any problem. ' ========================================================================================

I'm sure, the risk is important on the codepage

marc

I think that you're thinking with the mentality of a Linux user, with all that utf8 stuff. With the new WinFBE editor you will be able to choose the charset, and the string literals will be stored as the ansi representation of them. Using a code page with the CBSTR/CWSTR classes will convert these ansi codes to the correct representation of unicode characters.

Marc Pons · July 22, 2016, 04:46:29 PM

Code Select

With the new WinFBE editor you will be able to choose the charset, 
and the string literals will be stored as the ansi representation of them. 
Using a code page with the CBSTR/CWSTR classes will convert these ansi codes to the correct representation of unicode characters.

Jose, no problem , if winFBE ,helps better, but it will only be able to use only 1 charset at the time, sometimes not enougth with internationnal prog.
and for me, not very usefull, because i'm using Xp, and winFBE will not work under XP.
in fact that proposal (charset)is solving only the case for executable distribution

second possibity, Using a code page with the CBSTR/CWSTR classes , this proposition is more complete, it solves the sharing source code also but it is still very difficult to read as your sample for AfxUcode usage is showing

Code Select

' Usage example (Russian ANSI string to BSTR):
'   DIM bs AS AFX_BSTR
'   bs = AfxUcode(CHR(209, 229, 236, 229, 237), 1251)

I think unicode, if used for international development is never easy.

And the developers ( if any) need different solutions, they will use the direction that fit better for them
and it is better to have more than 1 choice to play with that difficult way.

last point,

Code Select

...you're thinking with the mentality of a Linux user...
that comment is according my feeling quite slighting, isn't it?
i hope not, i do not think either the linux people(wich i'm not) are stupids.
And in some case the usage of utf8 , is missing in windows word , specially on console, (try to show an unicode char) not in linux.

José Roca · July 22, 2016, 06:24:53 PM

I don't remember having called anybody stupid.

Frankly, I never I'm going to use things like wstr(!"\u20AC"), not even CHR(something), or utf-8 to code string literals (and probably nobody else).

Windows doesn't speak utf-8. but utf-16, and this is what I'm using.

José Roca · July 22, 2016, 06:47:10 PM

Quote
second possibity, Using a code page with the CBSTR/CWSTR classes , this proposition is more complete, it solves the sharing source code also but it is still very difficult to read as your sample for AfxUcode usage is showing

Code: [Select]

' Usage example (Russian ANSI string to BSTR):
' DIM bs AS AFX_BSTR
' bs = AfxUcode(CHR(209, 229, 236, 229, 237), 1251)

Using CSED, if you choose the Russian charset, you can do:

Code Select


DIM cbs AS CBSTR = AfxUcode("Закрыть", 1251)
Button_SetText(hButton, cbs)

What the FB developers have to do is to add an optional code page parameter to functions like WSTR.

José Roca · July 22, 2016, 07:44:01 PM

I have removed a constructor from the CWSTR class that was being called instead of the one thata accepts a code page.

After removing it, we can do:

Code Select


DIM cws AS CWSTR = 1251   ' Russian code page
cws = "Закрыть"
Button_SetText(hButton, cws)

--or--

Code Select


DIM cws AS CWSTR = CWSTR("Закрыть", 1251)
Button_SetText(hButton, cws)

This means that you can use different code pages in the same application.

José Roca · July 22, 2016, 08:39:08 PM

Ok. I have added support for UTF8 to both CBSTR ad CWSTR.

Now you can do:

Code Select


DIM cws AS CWSTR = CP_UTF8
cws = "Ãâ€ÃÂ¼ÃÂ¸ÃŒÂÃ'â€šÃ'â,¬ÃÂ¸ÃÂ¹ Ãâ€ÃÂ¼ÃÂ¸ÃŒÂÃ'â€šÃ'â,¬ÃÂ¸ÃÂµÃÂ²ÃÂ¸Ã'â€¡"
SetWindowText(hwnd, cws)

--or--

Code Select


DIM cws AS CWSTR = CWSTR("Ãâ€ÃÂ¼ÃÂ¸ÃŒÂÃ'â€šÃ'â,¬ÃÂ¸ÃÂ¹ Ãâ€ÃÂ¼ÃÂ¸ÃŒÂÃ'â€šÃ'â,¬ÃÂ¸ÃÂµÃÂ²ÃÂ¸Ã'â€¡", CP_UTF8)
SetWindowText(hwnd, cws)

Does this make you happy?

I will need an UTF8 converter. Looks more strange to me than Russian

José Roca · July 22, 2016, 09:22:15 PM

Of course, if using a code page, we must pass variables to the functions with string parameters, e.g.

Code Select


DIM cws AS CWSTR = CWSTR("Закрыть", 1251)   ' 1251, Russian code page
SetWindowText hwnd, cws

and not

Code Select


SetWindowText hwnd, "Закрыть"

But, hey, now you can use CP_UTF8 as the code page and an UTF8 encoded string.

And we can also do things like:

Code Select


DIM cws AS CWSTR = "Josй "
DIM cws2 AS CWSTR = CWSTR("Закрыть", 1251)
cws = cws & cws2 & " Roca"
SetWindowText hwnd, cws

that using the default charset looks like:

Code Select


DIM cws AS CWSTR = "Jose "
DIM cws2 AS CWSTR = CWSTR("Ã‡Ã ÃªÃ°Ã»Ã²Ã¼", 1251)
cws = cws & cws2 & " Roca"
SetWindowText hwnd, cws

mixing two strings that use different code pages.