AfxStr - Unicode String Functions

Started by José Roca, July 07, 2016, 01:35:30 AM

Previous topic - Next topic

José Roca

Because there is some misleading information in the web about BSTRs, this is the description from Microsoft:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms221069(v=vs.85).aspx

A BSTR (Basic string or binary string) is a string data type that is used by COM, Automation, and Interop functions. Use the BSTR data type in all interfaces that will be accessed from script.

C++


typedef WCHAR OLECHAR;
typedef OLECHAR* BSTR;
typedef BSTR* LPBSTR;


Remarks

A BSTR is a composite data type that consists of a length prefix, a data string, and a terminator. The following table describes these components.


Item Description
Length prefix A four-byte integer that contains the number of bytes in the following data string.
                It appears immediately before the first character of the data string.
                This value does not include the terminating null character.
Data string A string of Unicode characters. May contain multiple embedded null characters.
Terminator Two null characters.


A BSTR is a pointer. The pointer points to the first character of the data string, not to the length prefix.

BSTRs are allocated using COM memory allocation functions, so they can be returned from methods without concern for memory allocation.

The following code is incorrect:


BSTR MyBstr = L"I am a happy BSTR";


This code builds (compiles and links) correctly, but it will not function properly because the string does not have a length prefix. If you use a debugger to examine the memory location of this variable, you will not see a four-byte length prefix preceding the data string.

Instead, use the following code:


BSTR MyBstr = SysAllocString(L"I am a happy BSTR");


A debugger that examines the memory location of this variable will now reveal a length prefix containing the value 34. This is the expected value for a 17-byte single-character string that is converted to a wide-character string through the inclusion of the "L" string modifier. The debugger will also show a two-byte terminating null character (0x0000) that appears after the data string.

If you pass a simple Unicode string as an argument to a COM function that is expecting a BSTR, the COM function will fail.

José Roca

The problem are multiple concatenations. Therefore, the way to improve the speed is, obviously, to reduce or eliminate them. We need to allocate a buffer big enough and replace contents instead of concatenate strings. If the buffer is bigger than the final content, a fast way to reduce it is to call SysReallocStringLen, that, because does not need to allocate new memory in this case, will just change the prefix length of the BSTR. You did mention an string builder class...


Paul Squires

Quote from: Jose Roca on July 08, 2016, 03:18:23 AM
The problem are multiple concatenations. Therefore, the way to improve the speed is, obviously, to reduce or eliminate them. We need to allocate a buffer big enough and replace contents instead of concatenate strings. If the buffer is bigger than the final content, a fast way to reduce it is to call SysReallocStringLen, that, because does not need to allocate new memory in this case, will just change the prefix length of the BSTR. You did mention an string builder class...

I am working on the string builder class right now.

I do like the idea of allocating a larger buffer for CBSTR strings rather than setting them to exact lengths when they are created. FB strings have a built in buffer (I'd have to look at the FBC code to see exactly how big) and they perform extremely well. The small amount of extra memory allocated to these buffers would be immaterial in the overall grand scheme of things.
Paul Squires
PlanetSquires Software

James Fuller


Is the focus here to use BSTR's for all unicode needs in the form of a CBStr wrapper?

I did try Marc's uStringW and I would like to see it included in any bench-marks along with native Fb Strings for a speed comparison.

James
AfxClipLeft -> uswClipLeft

#define unicode
#include Once "windows.bi"
#define __VERBOSE_MODE__
#Include Once "Dyn_Wstring.bi"
Function uswClipLeft(uswMain As uStringW,Byval nCount As Long) As uStringW
    Dim As uStringW uswOut = uswMain
    If nCount <= 0 Then
        Return uswOut
    EndIf
    Dim As Long nLen = Len(uswMain)
    nCount = IIF(nLen < nCount,nLen,nCount)
    uswOut = Mid(uswMain,nCount + 1)
    Return uswOut
End Function
Dim As uStringW uswOne = uswClipLeft("abcdefghijk",4)
? uswOne

sleep


Marc Pons

hi james

to verify speed , you sould use it without __VERBOSE_MODE__
because print will slow drasticaly the action , that verbose define is more to debugg and see behind the curtain...

you even dont need  UNICODE define , nor windows.bi


'#define unicode
'#include Once "windows.bi"
'#define __VERBOSE_MODE__
#Include Once "Dyn_Wstring.bi"
Function uswClipLeft(uswMain As uStringW,Byval nCount As Long) As uStringW
    Dim As uStringW uswOut = uswMain
    If nCount <= 0 Then
        Return uswOut
    EndIf
    Dim As Long nLen = Len(uswMain)
    nCount = IIF(nLen < nCount,nLen,nCount)
    uswOut = Mid(uswMain,nCount + 1)
    Return uswOut
End Function
Dim As uStringW uswOne = uswClipLeft("abcdefghijk",4)
? uswOne

sleep


and you could directly use the u_mid function

you are only testing ascii codes , it will be more fun with real unicode values
it is why, i use in my tests euro symbol to be more representative...    \u20AC

last remark , if speed is very important you also can comment  __U_CLEAN_MEM__ option in the Dyn_Wstring.bi
#ifndef __U_CLEAN_MEM__                       ' to "free" the remaining allocated memory when program ends
      '#define __U_CLEAN_MEM__                    ' if no used you can reduce around 2048 bytes on your executable
   #endif                                        ' to not compile, simply comment the define

the pseudo linked list will not be active , so better speed...
the type destructor is normally suffisant for automatic free ( if some remain,  no problem , they will be cleaned when progr exits)

I have let that option for "cleaner/nicer" coding only, and because i was not completely sure at beginning how to do

José Roca

> Is the focus here to use BSTR's for all unicode needs in the form of a CBStr wrapper?

The class is lightweight and without complexities, and thanks to the use of temporary types, we can use it almost as if it was a native type.

It is not the class what causes speed problems, but string concatenations. And this problem happens no matter if you use a class or a native type. This is why Bob implemented an string builder class in PB.

In my first versions of the TypeLib Browser, I used the easy way of multiple string concatenations and it was slow when parsing big type libraries such Excel. I wrote a procedure that used a global string of 1 MB (as it was to be only used by this application there was not need for further complications) and suddenly became the fatest COM browser available.

In all languages, the MID statement or its equivalent is fast because it just replaces the contents, without memory allocations/reallocations (only using assembler or pointers to avoid the overhead of calling a function can improve it), whereas the LEFT, MID and RIGHT functions are slow because they create new temporary strings. So if you use MID(s, 5, 4) = "Paul", it is fast, but s = LEFT(s, 4) & "Paul" & MID(s, 9) is slow becase it creates temporary strings and also has to allocate a new one to store the concatenation of them and deallocate the old one. If you do it repeteadly or the strings are very big it can become painfully slow.

Marc Pons

Jose, Paul ...

excuse me in advance, if my remarks are not relevant, or if i'm interfering too much

as James is asking  , but in more brutal form : is it needed to go via the BSTR story for unicode ?
( for me BSTR seem not so fast/easy way)

all the points you said about concatenation ... is true, it is time consuming but on practice, the normal variable fb string type is already doing the job quite well , just by avoiding to allocate/ reallocate for each byte , it will do it by steps (32 as i remember)
i've already digged a lot on the string manipulation functions to make my own  lib for replace/ split ... and compare at that time with PowerBasic (  comparable results for fb when using pointers),
and it is the same model i am experiencing (by 16 in the post) in my uStringW , it could be changed to 32 or 48 or more ( just to balance speed  vs size), the real size allocation and the real len are included in the type definition to ease that purpose

And last point, if the deal is to work with real unicode char (not ansi with 2 bytes) , the complexity is not as that level , it is at the surrogate pair state, that's why i added the information on how many surrogate are in the uStringW to bypass when not needed that botleneck
waiting your answer

note : I sure also BSTR are also needed for COM, i'am just not sure it is relevant for unicode variable type

José Roca

My main purpose is to use it with COM, so I need to work with real BSTRs. There is not problem to pass unicode content to APIs that simply expect an unicode string, but COM needs BSTRs.

Paul Squires

#38
Quote from: TechSupport on July 08, 2016, 09:20:48 AM

I am working on the string builder class right now.

I didn't get as much time today to work on the code as I had hoped but I have just been able to finish the "add" portion of the string builder class. It uses overload functions to add CBSTR, Ansi/STRING, or WSTRING strings. It returns a CBSTR.

A simple test of 40,000 string concatenations of "Paul Squires" shows this dramatic difference:

CBSTR: 24.8125 seconds
FB Strings: 0 seconds
StringBuilder Class: 0.0117 seconds

Not bad so far.


Paul Squires
PlanetSquires Software

Paul Squires

#39
...and this is very preliminary code for the stringbuilder class. Most are just stubs/placeholders at this point. The Add is working.

Jose, take a look at the very last function. Did you say in another post that you wouldn't need to use <type> if assigning directly to a CBSTR as the return value of a function?

Edit: Final code can now be found in this thread: http://www.planetsquires.com/protect/forum/index.php?topic=3892.0
Paul Squires
PlanetSquires Software

José Roca

> Did you say in another post that you wouldn't need to use <type> if assigning directly to a CBSTR as the return value of a function?

Yes, I did. As the returning type isn't a plain structure, but a class, both the constructor of the class and the LET operator (if the class has an overloaded LET constructor) are called, so the constructor creates an empty BSTR and the LET operator deletes it and creates a new BSTR with the contents pointed by the returned handle.

José Roca

I didn't kinow that returning a TYPE will cause a call to the constructor and LET operator of that class. Now that I know it, seems logical, but I was not used to it, because returning a TYPE (structure) in PB just returns an array of bytes. Maybe if we were using the word CLASS instead of TYPE I would have figured it sooner.

I wonder why some of the most useful features of this compiler aren't well explained or not explained at all.

José Roca

> A simple test of 40,000 string concatenations of "Paul Squires" shows this dramatic difference:

The bottlenecks are always the string concatenations.

Paul Squires

Jose, do these functions (AfxStr.inc) need to be modified again now that you have modified the CBSTR class?
Paul Squires
PlanetSquires Software

José Roca

Oh, yes. It was mainly a test. The ones that are concatenation intensive will benefit from using the new string builder class.

Many of the code that I post is intended to exchange ideas and, very often, after I post it other ideas come to my mind. The "official" code is the one that I upload in the CWindow package.