unicode strings in windows are UTF16, so could include surrogate pairs

Marc Pons · July 18, 2016, 07:56:12 AM

Hi
Trying to simplify as much as possible.(hope everything is correct)

The unicode strings in windows are coded as UTF-16LE ,
that means 2 bytes for codepoints <= &hFFFF (65535)
but, 4 bytes for codepoints >=&h10000(65536) , it is known as surrogate pairs

even these extended codepoints are not very frequent in normal usage , they can exist,
and in that case some functions playing with unicode
have to be adapted to accept that possibility ( if not the risk is to get bad char)

functions not working "correctly" according that extended codepoints

standard string functions ( probably not exhaustive ) are the following :
len ; mid ; left ; right ; asc ; wchr

but also more sophisticated functions like
reverse, parse , parsecount , split ...

in fact all kind of operation counting chars , position in string may be affected by that surrogate pair story

So, be sure when playing with unicode with normal functions ,
you are not using the extended unit codes (only using UCS-2)

marc

Marc Pons · July 22, 2016, 04:07:41 PM

Hi,

To continue on the subject : utf16 and surrogate pairs

here is a link to get free unicode font able to display the extended unitcode : above the Basic multilingual Plane (surrogate pairs)

http://unifoundry.com/unifont.html
you can use this one
Glyphs above the Unicode Basic Multilingual Plane: unifont_upper-9.0.01.ttf (1 Mbyte)
or
Glyphs above the Unicode Basic Multilingual Plane with CSUR PUA Glyphs: unifont_upper_csur-9.0.01.ttf (1 Mbyte)

on that web page, you will be also able to see the full unicode chars ,
GNU Unifont Glyphs Unicode Basic Multilingual Plane
or
GNU Unifont Glyphs Unicode Supplemental Multilingual Plane

Marc

PlanetSquires Forums

unicode strings in windows are UTF16, so could include surrogate pairs

Marc Pons

Marc Pons