Unicode usage precautions vs codepage

Started by Marc Pons, July 18, 2016, 07:25:05 AM

Previous topic - Next topic

Marc Pons

Hi,
thanks to Jose, Paul...
the dynamic unicode string type is born!

Great job.

what do you think about advices/precautions like the following :

' ========================================================================================
' Warning :
' if you want to distribute your executable including unicode chars or share your code source
' with others (possibly using different codepage as you are using). You could face some dificulties!
'
' Some directions to avoid these unicode problems.
'
' On the source code:
' it is better to not use direct keyboard input for char coded >127  (outside ASCII definition)
' because that codes are codepage dependant and can produce strange behaviour depending
' on the user codepage.
'
' so it is advisable to use :
' the escape sequence of that needed char instead  (notice the ! to use escape sequence),
' eg : wstr(!"\u20AC")  for euro symbol, (even you have it available in your keyboard)
' or use wchr function for individual char,
' eg : wchr(&h20AC) or wchr(8364) : hex or decimal values for euro symbol
'
' These 2 previous methods work well, but it is not very readable/easy...
'
' If you prefer direct input keyboard method, 
' but insure your executable will run correctly or be able to share your code source :
'
' Just input normally your code using direct keyboard input(codepage dependant),
' compile, make your modifications and when your executable is running as you want in your PC,
' convert that code to utf8, wich is not codepage dependant.
' The converted utf8 source code will be compiled as is, by freebasic compiler,
' to produce your final sharing executable and you can also share that converted source code,
' with users who can compile on their side without any problem.
' ========================================================================================


I'm sure, the risk is important on the codepage

marc

Marc Pons

function to enter unicode codepoints to CWSTR to extend the wchr function wich does not convert code > &hFFFF

''::::: converts unicode codepoint (also > FFFF,  makes surrogate pair) to CWSTR
private function uchr(U1 as Ulong) as CWSTR
   dim hi                as Ulong
   dim lo                as Ulong
   if (U1 >= &h10000 and U1 <= &h10FFFF) then
      hi = ((U1 - &h10000) / &h400) + &hD800
      lo = ((U1 - &h10000) mod &h400) + &hDC00
      return wchr(hi , lo)
   elseif U1 < &h10000 then
      return wchr(U1)
   end if
   return ""
END FUNCTION

José Roca


Marc Pons

#3
updated, sorry

#define unicode
#INCLUDE ONCE "windows.bi"

#INCLUDE ONCE "AFX/CWStr.inc"
using Afx

''::::: converts unicode codepoint (also > FFFF,  makes surrogate pair) to CWSTR
private function uchr(U1 as Ulong) as CWSTR
   dim hi                as Ulong
   dim lo                as Ulong
   if (U1 >= &h10000 and U1 <= &h10FFFF) then
      hi = ((U1 - &h10000) / &h400) + &hD800
      lo = ((U1 - &h10000) mod &h400) + &hDC00
      return wchr(hi , lo)
   elseif U1 < &h10000 then
      return wchr(U1)
   end if
   return ""
END FUNCTION

'extended codepage  &h1D11E  and the equivalent surrogate pair (&hD834 , &hDD1E)
dim as CWSTR u11 = uchr(&h1D11E) & wchr(&hD834 , &hDD1E)
print "str(u11)= >" & str(u11) & "<"
messagebox(0 , "str(u11)= >" & str(u11) & "<" , "string view" , MB_OK)
'very few fonts can show the extended codes, if not possible to show an empty square represents the extended char
'at least you should view the 2 squares in the messagebox  for the 2 input forms
'and 4 characters depending of your console codepage in console

for x as long = 0 to len(u11) - 1
   print " u11[" & x & "] = " & u11[x]
NEXT
print "Press any key..."
sleep

Marc Pons


José Roca

There is one thing that I don't understand. Each time that I call it, the hex value changes.

MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK

It displays an hex number with 6 digits. The last four are always the same, 71D0, but the first two change.

Marc Pons

Jose,

I don't know why you want to show :  HEX(uchr(&h1D11E))

here the hex declare extract
Declare Function Hex ( ByVal number As Const Any Ptr ) As String

the only result you can get from your code is the hex value of the memory
where is stored the uchr function result. (an CWSTR type),
wich is in fact seen by Hex function as pointer, via the implicit conversion   
DECLARE OPERATOR CAST () AS ANY PTR ,  done by the CWSTR class


I don't know why it is changing the value
because for me if I do
MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK    ' shows 337DC0
'if second
MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK    ' shows 337DC0  same

José Roca

Guess I got confused trying to understading it to write an explanation. I have incorporated it to AfxWin.inc as follows:


' ========================================================================================
' Converts unicode codepoint. Code points from the other planes (called Supplementary Planes)
' are encoded as two 16-bit code units called surrogate pairs, by the following scheme:
' &h010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
' The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first
' 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
' The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit
' code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
' Example: DIM uch AS CWSTR = AfxUChr(&h1D11E) & WCHR(&hD834, &hDD1E)
' Converts unicode codepoint &h1D11E and makes surrogate pairs (WCHR(&hD834, &hDD1E)).
' ========================================================================================
PRIVATE FUNCTION AfxUChr(BYVAL uch AS ULONG) AS CWSTR
   DIM hi AS ULONG, lo AS ULONG
   IF (uch >= &h10000 AND uch <= &h10FFFF) THEN
      hi = ((uch - &h10000) / &h400) + &hD800
      lo = ((uch - &h10000) MOD &h400) + &hDC00
      RETURN WCHR(hi, lo)
   ELSEIF uch < &h10000 THEN
      RETURN WCHR(uch)
   END IF
   RETURN ""
END FUNCTION
' ========================================================================================


I don't think that nobody will use it ever, except maybe you.

José Roca

Quote from: Marc Pons on July 18, 2016, 07:25:05 AM
Hi,
thanks to Jose, Paul...
the dynamic unicode string type is born!

Great job.

what do you think about advices/precautions like the following :

' ========================================================================================
' Warning :
' if you want to distribute your executable including unicode chars or share your code source
' with others (possibly using different codepage as you are using). You could face some dificulties!
'
' Some directions to avoid these unicode problems.
'
' On the source code:
' it is better to not use direct keyboard input for char coded >127  (outside ASCII definition)
' because that codes are codepage dependant and can produce strange behaviour depending
' on the user codepage.
'
' so it is advisable to use :
' the escape sequence of that needed char instead  (notice the ! to use escape sequence),
' eg : wstr(!"\u20AC")  for euro symbol, (even you have it available in your keyboard)
' or use wchr function for individual char,
' eg : wchr(&h20AC) or wchr(8364) : hex or decimal values for euro symbol
'
' These 2 previous methods work well, but it is not very readable/easy...
'
' If you prefer direct input keyboard method, 
' but insure your executable will run correctly or be able to share your code source :
'
' Just input normally your code using direct keyboard input(codepage dependant),
' compile, make your modifications and when your executable is running as you want in your PC,
' convert that code to utf8, wich is not codepage dependant.
' The converted utf8 source code will be compiled as is, by freebasic compiler,
' to produce your final sharing executable and you can also share that converted source code,
' with users who can compile on their side without any problem.
' ========================================================================================


I'm sure, the risk is important on the codepage

marc

I think that you're thinking with the mentality of a Linux user, with all that utf8 stuff. With the new WinFBE editor you will be able to choose the charset, and the string literals will be stored as the ansi representation of them. Using a code page with the CBSTR/CWSTR classes will convert these ansi codes to the correct representation of unicode characters.

Marc Pons

With the new WinFBE editor you will be able to choose the charset,
and the string literals will be stored as the ansi representation of them.
Using a code page with the CBSTR/CWSTR classes will convert these ansi codes to the correct representation of unicode characters.


Jose, no problem , if winFBE ,helps better, but it will only be able to use only 1 charset at the time, sometimes not enougth with internationnal prog.
and for me, not very usefull, because i'm using Xp, and winFBE will not work under XP.
in fact that proposal (charset)is solving only the case for executable distribution

second possibity, Using a code page with the CBSTR/CWSTR classes , this proposition is more complete, it solves the sharing source code also but it is still very difficult to read as your sample  for AfxUcode usage is showing

' Usage example (Russian ANSI string to BSTR):
'   DIM bs AS AFX_BSTR
'   bs = AfxUcode(CHR(209, 229, 236, 229, 237), 1251)


I think unicode, if used for international development is never easy.

And the developers ( if any) need different solutions, they will use the direction that fit better for them
and it is better to have more than 1 choice to play with that difficult way.

last point, ...you're thinking with the mentality of a Linux user...
that comment is according my feeling quite slighting, isn't it? 
i hope not, i do not think either the linux people(wich i'm not) are stupids.
And in some case the usage of utf8 , is missing  in windows word , specially on console, (try to show an unicode char) not in linux.

José Roca

I don't remember having called anybody stupid.

Frankly, I never I'm going to use things like wstr(!"\u20AC"), not even CHR(something), or utf-8 to code string literals (and probably nobody else).

Windows doesn't speak utf-8. but utf-16, and this is what I'm using.

José Roca

#11
Quote
second possibity, Using a code page with the CBSTR/CWSTR classes , this proposition is more complete, it solves the sharing source code also but it is still very difficult to read as your sample  for AfxUcode usage is showing

Code: [Select]

' Usage example (Russian ANSI string to BSTR):
'   DIM bs AS AFX_BSTR
'   bs = AfxUcode(CHR(209, 229, 236, 229, 237), 1251)

Using CSED, if you choose the Russian charset, you can do:


DIM cbs AS CBSTR = AfxUcode("Закрыть", 1251)
Button_SetText(hButton, cbs)


What the FB developers have to do is to add an optional code page parameter to functions like WSTR.

José Roca

I have removed a constructor from the CWSTR class that was being called instead of the one thata accepts a code page.

After removing it, we can do:


DIM cws AS CWSTR = 1251   ' Russian code page
cws = "Закрыть"
Button_SetText(hButton, cws)


--or--


DIM cws AS CWSTR = CWSTR("Закрыть", 1251)
Button_SetText(hButton, cws)


This means that you can use different code pages in the same application.

José Roca

#13
Ok. I have added support for UTF8 to both CBSTR ad CWSTR.

Now you can do:


DIM cws AS CWSTR = CP_UTF8
cws = "Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂ¹ Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂµÃÂ²ÃÂ¸Ã'‡"
SetWindowText(hwnd, cws)


--or--


DIM cws AS CWSTR = CWSTR("Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂ¹ Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂµÃÂ²ÃÂ¸Ã'‡", CP_UTF8)
SetWindowText(hwnd, cws)


Does this make you happy? :)

I will need an UTF8 converter. Looks more strange to me than Russian :)

José Roca

#14
Of course, if using a code page, we must pass variables to the functions with string parameters, e.g.


DIM cws AS CWSTR = CWSTR("Закрыть", 1251)   ' 1251, Russian code page
SetWindowText hwnd, cws


and not


SetWindowText hwnd, "Закрыть"


But, hey, now you can use CP_UTF8 as the code page and an UTF8 encoded string.

And we can also do things like:


DIM cws AS CWSTR = "Josй "
DIM cws2 AS CWSTR = CWSTR("Закрыть", 1251)
cws = cws & cws2 & " Roca"
SetWindowText hwnd, cws


that using the default charset looks like:


DIM cws AS CWSTR = "Jose "
DIM cws2 AS CWSTR = CWSTR("Çàêðûòü", 1251)
cws = cws & cws2 & " Roca"
SetWindowText hwnd, cws


mixing two strings that use different code pages.