PlanetSquires Forums

Support Forums => General Board => Topic started by: Marc Pons on July 18, 2016, 07:25:05 AM

Title: Unicode usage precautions vs codepage
Post by: Marc Pons on July 18, 2016, 07:25:05 AM
Hi,
thanks to Jose, Paul...
the dynamic unicode string type is born!

Great job.

what do you think about advices/precautions like the following :

' ========================================================================================
' Warning :
' if you want to distribute your executable including unicode chars or share your code source
' with others (possibly using different codepage as you are using). You could face some dificulties!
'
' Some directions to avoid these unicode problems.
'
' On the source code:
' it is better to not use direct keyboard input for char coded >127  (outside ASCII definition)
' because that codes are codepage dependant and can produce strange behaviour depending
' on the user codepage.
'
' so it is advisable to use :
' the escape sequence of that needed char instead  (notice the ! to use escape sequence),
' eg : wstr(!"\u20AC")  for euro symbol, (even you have it available in your keyboard)
' or use wchr function for individual char,
' eg : wchr(&h20AC) or wchr(8364) : hex or decimal values for euro symbol
'
' These 2 previous methods work well, but it is not very readable/easy...
'
' If you prefer direct input keyboard method, 
' but insure your executable will run correctly or be able to share your code source :
'
' Just input normally your code using direct keyboard input(codepage dependant),
' compile, make your modifications and when your executable is running as you want in your PC,
' convert that code to utf8, wich is not codepage dependant.
' The converted utf8 source code will be compiled as is, by freebasic compiler,
' to produce your final sharing executable and you can also share that converted source code,
' with users who can compile on their side without any problem.
' ========================================================================================


I'm sure, the risk is important on the codepage

marc
Title: Re: Unicode usage precautions vs codepage
Post by: Marc Pons on July 18, 2016, 11:36:22 AM
function to enter unicode codepoints to CWSTR to extend the wchr function wich does not convert code > &hFFFF

''::::: converts unicode codepoint (also > FFFF,  makes surrogate pair) to CWSTR
private function uchr(U1 as Ulong) as CWSTR
   dim hi                as Ulong
   dim lo                as Ulong
   if (U1 >= &h10000 and U1 <= &h10FFFF) then
      hi = ((U1 - &h10000) / &h400) + &hD800
      lo = ((U1 - &h10000) mod &h400) + &hDC00
      return wchr(hi , lo)
   elseif U1 < &h10000 then
      return wchr(U1)
   end if
   return ""
END FUNCTION
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 18, 2016, 11:43:35 AM
Any example of use?
Title: Re: Unicode usage precautions vs codepage
Post by: Marc Pons on July 18, 2016, 12:07:01 PM
updated, sorry

#define unicode
#INCLUDE ONCE "windows.bi"

#INCLUDE ONCE "AFX/CWStr.inc"
using Afx

''::::: converts unicode codepoint (also > FFFF,  makes surrogate pair) to CWSTR
private function uchr(U1 as Ulong) as CWSTR
   dim hi                as Ulong
   dim lo                as Ulong
   if (U1 >= &h10000 and U1 <= &h10FFFF) then
      hi = ((U1 - &h10000) / &h400) + &hD800
      lo = ((U1 - &h10000) mod &h400) + &hDC00
      return wchr(hi , lo)
   elseif U1 < &h10000 then
      return wchr(U1)
   end if
   return ""
END FUNCTION

'extended codepage  &h1D11E  and the equivalent surrogate pair (&hD834 , &hDD1E)
dim as CWSTR u11 = uchr(&h1D11E) & wchr(&hD834 , &hDD1E)
print "str(u11)= >" & str(u11) & "<"
messagebox(0 , "str(u11)= >" & str(u11) & "<" , "string view" , MB_OK)
'very few fonts can show the extended codes, if not possible to show an empty square represents the extended char
'at least you should view the 2 squares in the messagebox  for the 2 input forms
'and 4 characters depending of your console codepage in console

for x as long = 0 to len(u11) - 1
   print " u11[" & x & "] = " & u11[x]
NEXT
print "Press any key..."
sleep
Title: Re: Unicode usage precautions vs codepage
Post by: Marc Pons on July 18, 2016, 12:17:59 PM
just corrected previous post

marc
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 18, 2016, 05:32:45 PM
There is one thing that I don't understand. Each time that I call it, the hex value changes.

MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK

It displays an hex number with 6 digits. The last four are always the same, 71D0, but the first two change.
Title: Re: Unicode usage precautions vs codepage
Post by: Marc Pons on July 19, 2016, 04:11:55 AM
Jose,

I don't know why you want to show :  HEX(uchr(&h1D11E))

here the hex declare extract
Declare Function Hex ( ByVal number As Const Any Ptr ) As String

the only result you can get from your code is the hex value of the memory
where is stored the uchr function result. (an CWSTR type),
wich is in fact seen by Hex function as pointer, via the implicit conversion   
DECLARE OPERATOR CAST () AS ANY PTR ,  done by the CWSTR class


I don't know why it is changing the value
because for me if I do
MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK    ' shows 337DC0
'if second
MessageBox 0, HEX(uchr(&h1D11E)), "", MB_OK    ' shows 337DC0  same
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 19, 2016, 08:35:11 AM
Guess I got confused trying to understading it to write an explanation. I have incorporated it to AfxWin.inc as follows:


' ========================================================================================
' Converts unicode codepoint. Code points from the other planes (called Supplementary Planes)
' are encoded as two 16-bit code units called surrogate pairs, by the following scheme:
' &h010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
' The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first
' 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
' The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit
' code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
' Example: DIM uch AS CWSTR = AfxUChr(&h1D11E) & WCHR(&hD834, &hDD1E)
' Converts unicode codepoint &h1D11E and makes surrogate pairs (WCHR(&hD834, &hDD1E)).
' ========================================================================================
PRIVATE FUNCTION AfxUChr(BYVAL uch AS ULONG) AS CWSTR
   DIM hi AS ULONG, lo AS ULONG
   IF (uch >= &h10000 AND uch <= &h10FFFF) THEN
      hi = ((uch - &h10000) / &h400) + &hD800
      lo = ((uch - &h10000) MOD &h400) + &hDC00
      RETURN WCHR(hi, lo)
   ELSEIF uch < &h10000 THEN
      RETURN WCHR(uch)
   END IF
   RETURN ""
END FUNCTION
' ========================================================================================


I don't think that nobody will use it ever, except maybe you.
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 20, 2016, 01:42:23 AM
Quote from: Marc Pons on July 18, 2016, 07:25:05 AM
Hi,
thanks to Jose, Paul...
the dynamic unicode string type is born!

Great job.

what do you think about advices/precautions like the following :

' ========================================================================================
' Warning :
' if you want to distribute your executable including unicode chars or share your code source
' with others (possibly using different codepage as you are using). You could face some dificulties!
'
' Some directions to avoid these unicode problems.
'
' On the source code:
' it is better to not use direct keyboard input for char coded >127  (outside ASCII definition)
' because that codes are codepage dependant and can produce strange behaviour depending
' on the user codepage.
'
' so it is advisable to use :
' the escape sequence of that needed char instead  (notice the ! to use escape sequence),
' eg : wstr(!"\u20AC")  for euro symbol, (even you have it available in your keyboard)
' or use wchr function for individual char,
' eg : wchr(&h20AC) or wchr(8364) : hex or decimal values for euro symbol
'
' These 2 previous methods work well, but it is not very readable/easy...
'
' If you prefer direct input keyboard method, 
' but insure your executable will run correctly or be able to share your code source :
'
' Just input normally your code using direct keyboard input(codepage dependant),
' compile, make your modifications and when your executable is running as you want in your PC,
' convert that code to utf8, wich is not codepage dependant.
' The converted utf8 source code will be compiled as is, by freebasic compiler,
' to produce your final sharing executable and you can also share that converted source code,
' with users who can compile on their side without any problem.
' ========================================================================================


I'm sure, the risk is important on the codepage

marc

I think that you're thinking with the mentality of a Linux user, with all that utf8 stuff. With the new WinFBE editor you will be able to choose the charset, and the string literals will be stored as the ansi representation of them. Using a code page with the CBSTR/CWSTR classes will convert these ansi codes to the correct representation of unicode characters.
Title: Re: Unicode usage precautions vs codepage
Post by: Marc Pons on July 22, 2016, 04:46:29 PM
With the new WinFBE editor you will be able to choose the charset,
and the string literals will be stored as the ansi representation of them.
Using a code page with the CBSTR/CWSTR classes will convert these ansi codes to the correct representation of unicode characters.


Jose, no problem , if winFBE ,helps better, but it will only be able to use only 1 charset at the time, sometimes not enougth with internationnal prog.
and for me, not very usefull, because i'm using Xp, and winFBE will not work under XP.
in fact that proposal (charset)is solving only the case for executable distribution

second possibity, Using a code page with the CBSTR/CWSTR classes , this proposition is more complete, it solves the sharing source code also but it is still very difficult to read as your sample  for AfxUcode usage is showing

' Usage example (Russian ANSI string to BSTR):
'   DIM bs AS AFX_BSTR
'   bs = AfxUcode(CHR(209, 229, 236, 229, 237), 1251)


I think unicode, if used for international development is never easy.

And the developers ( if any) need different solutions, they will use the direction that fit better for them
and it is better to have more than 1 choice to play with that difficult way.

last point, ...you're thinking with the mentality of a Linux user...
that comment is according my feeling quite slighting, isn't it? 
i hope not, i do not think either the linux people(wich i'm not) are stupids.
And in some case the usage of utf8 , is missing  in windows word , specially on console, (try to show an unicode char) not in linux.
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 22, 2016, 06:24:53 PM
I don't remember having called anybody stupid.

Frankly, I never I'm going to use things like wstr(!"\u20AC"), not even CHR(something), or utf-8 to code string literals (and probably nobody else).

Windows doesn't speak utf-8. but utf-16, and this is what I'm using.
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 22, 2016, 06:47:10 PM
Quote
second possibity, Using a code page with the CBSTR/CWSTR classes , this proposition is more complete, it solves the sharing source code also but it is still very difficult to read as your sample  for AfxUcode usage is showing

Code: [Select]

' Usage example (Russian ANSI string to BSTR):
'   DIM bs AS AFX_BSTR
'   bs = AfxUcode(CHR(209, 229, 236, 229, 237), 1251)

Using CSED, if you choose the Russian charset, you can do:


DIM cbs AS CBSTR = AfxUcode("Закрыть", 1251)
Button_SetText(hButton, cbs)


What the FB developers have to do is to add an optional code page parameter to functions like WSTR.
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 22, 2016, 07:44:01 PM
I have removed a constructor from the CWSTR class that was being called instead of the one thata accepts a code page.

After removing it, we can do:


DIM cws AS CWSTR = 1251   ' Russian code page
cws = "Закрыть"
Button_SetText(hButton, cws)


--or--


DIM cws AS CWSTR = CWSTR("Закрыть", 1251)
Button_SetText(hButton, cws)


This means that you can use different code pages in the same application.
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 22, 2016, 08:39:08 PM
Ok. I have added support for UTF8 to both CBSTR ad CWSTR.

Now you can do:


DIM cws AS CWSTR = CP_UTF8
cws = "Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂ¹ Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂµÃÂ²ÃÂ¸Ã'‡"
SetWindowText(hwnd, cws)


--or--


DIM cws AS CWSTR = CWSTR("Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂ¹ Дми́Ã'‚Ã'â,¬ÃÂ¸ÃÂµÃÂ²ÃÂ¸Ã'‡", CP_UTF8)
SetWindowText(hwnd, cws)


Does this make you happy? :)

I will need an UTF8 converter. Looks more strange to me than Russian :)
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 22, 2016, 09:22:15 PM
Of course, if using a code page, we must pass variables to the functions with string parameters, e.g.


DIM cws AS CWSTR = CWSTR("Закрыть", 1251)   ' 1251, Russian code page
SetWindowText hwnd, cws


and not


SetWindowText hwnd, "Закрыть"


But, hey, now you can use CP_UTF8 as the code page and an UTF8 encoded string.

And we can also do things like:


DIM cws AS CWSTR = "Josй "
DIM cws2 AS CWSTR = CWSTR("Закрыть", 1251)
cws = cws & cws2 & " Roca"
SetWindowText hwnd, cws


that using the default charset looks like:


DIM cws AS CWSTR = "Jose "
DIM cws2 AS CWSTR = CWSTR("Çàêðûòü", 1251)
cws = cws & cws2 & " Roca"
SetWindowText hwnd, cws


mixing two strings that use different code pages.
Title: Re: Unicode usage precautions vs codepage
Post by: José Roca on July 22, 2016, 10:11:32 PM
This is what I'm going to put in the help file:

CWSTR ad CBSTR are classes to implement dynamic unicode data types. Free Basic has a dynamic string data type (STRING) and a fixed length unicode data type (WSTRING). What it lacks are dynamic unicode strings. CBSTR uses Windows BSTRrings and is slower than CWSTR, that uses a dynamic buffer. Therefore, its use should be reserved for COM programming and when needing to use unicode strings with embedded nulls.

CBSTR and CWSTR almost behave as if they were native data types, working directly with most intrinsic Free Basic string functions and operators, with some exceptions such LEFT, RIGHT and VAL, that need that you use a double indirection, i.e. LEFT(**cws, 10),  to pass a pointer to the string data. The reason that these functions don't work using,  e.g. LEFT(cws, 10), is because they don't generate temporaty strings and the operators of the CBSTR and CWSTR classes aren't called.

They work transparently with Free Basic native strings and literals, e.g.


DIM cws AS CWSTR = "One"
DIM s AS STRING = "Three"
cws = cws & " Two " & s
PRINT cws


They can be used like native strings to call Windows API functions, e.g.


PRIVATE FUNCTION AfxGetWindowText (BYVAL hwnd AS HWND) AS CWSTR
   DIM nLen AS LONG = SendMessageW(hwnd, WM_GETTEXTLENGTH, 0, 0)
   DIM wszText AS CWSTR = SPACE(nLen + 1)
   SendMessageW(hwnd, WM_GETTEXT, nLen + 1, cast(LPARAM, *wszText))
   RETURN wszText
END FUNCTION


For using them with languages that don't use the Latin alphabet, you can specify the code page (CP_UTF8 is also supported):


DIM cws AS CWSTR = CWSTR("Закрыть", 1251)   ' 1251, Russian code page
SetWindowText hwnd, cws


Important remark:  When returning a CBSTR or CWSTR as the result of a function, use always RETURN <variable name> and not FUNCTION = <variable name>. This is because the different behavior between RETURN and FUNCTION when returning temporaty types with constructors.

When using RETURN <variable name>, the compiler correctly calls the constructor of the temporary type, allowing the class to copy the data of the string to be returned, and then calls the destructor of the copied CBSTR or CWSTR when  the variable goes out of scope.

When using FUNCTION  = <variable name>, the compiler first calls the destructor of the string to be copied and then the constructor of new temporary type, making it impossible to the class to copy the data. Although it generally works with CBSTR strings, because Windows caches by default BSTRings that have been freed with SysFreeString, it will certainly crash when returning a CWSTR.