unicode strings in windows are UTF16, so could include surrogate pairs

Started by Marc Pons, July 18, 2016, 07:56:12 AM

Previous topic - Next topic

Marc Pons

Hi
Trying to simplify as much as possible.(hope everything is correct)

The unicode strings in windows are coded as UTF-16LE ,
that means 2 bytes for codepoints <= &hFFFF (65535)
but, 4 bytes for codepoints >=&h10000(65536) , it is known as surrogate pairs

even these extended codepoints are not very frequent in normal usage , they can exist,
and in that case some functions playing with unicode
have to be adapted to accept that possibility ( if not the risk is to get bad char)

functions not working "correctly" according that extended  codepoints

standard string functions ( probably not exhaustive ) are the following :
len ; mid ; left ; right ; asc ; wchr

but also more sophisticated functions like
reverse, parse , parsecount , split ...

in fact all kind of operation counting chars , position in string may be affected by that surrogate pair story

So, be sure when playing with unicode with normal functions ,
you are not using the extended unit codes (only using UCS-2)

marc


Marc Pons

Hi,

To continue on the subject : utf16 and surrogate pairs

here is a link to get free unicode font able to display the extended unitcode : above the Basic multilingual Plane (surrogate pairs)

http://unifoundry.com/unifont.html
you can use this one
Glyphs above the Unicode Basic Multilingual Plane: unifont_upper-9.0.01.ttf (1 Mbyte)
or
Glyphs above the Unicode Basic Multilingual Plane with CSUR PUA Glyphs: unifont_upper_csur-9.0.01.ttf (1 Mbyte)

on that web page, you will be also able to see the full unicode chars ,
GNU Unifont Glyphs Unicode Basic Multilingual Plane
or
GNU Unifont Glyphs Unicode Supplemental Multilingual Plane

Marc