` Printed Icetips Article
Icetips Article

Windows API: Returning Unicode data in C6 using BSTRINGs
2007-04-12 -- Vadim Berman
While the new C7 has full Unicode support, it is also possible to enable it
in prior versions using a bit of Windows API and direct memory access. 

It is not Unicode-enabled GUI (while it is also possible, the window /
control requires to be rebuilt for that), but simply returning BSTRING data
in Unicode. It is particularly relevant to the situations when
Clarion-created COM objects are used in ASP pages or by other Unicode
applications.

A little background on BSTRING and Unicode conversions. SoftVelocity has
courteously made 1-byte string (STRINGs, CSTRINGs, PSTRINGs) conversion to
BSTRING seamless, so simple assignment such as this:

my1ByteStr            CSTRING(31)

myBStr                  BSTRING

...

    CODE

...

    my1ByteStr = 'This is a test'

    myBStr = my1Byt1Str

- this assignment will allocate enough space for the new string (length * 2
+ 2), will copy byte-by-byte the entire string, initializing the other byte
of each and adding 2 NULLs to the end of the new BSTRING (unlike in regular
zero-terminated strings, two NULLs are required to terminate it, because
some 2 byte characters contain 0 in either high or low byte). You have to
appreciate what SoftVelocity has done, because it usually takes about 4-6
more error prone lines in C/C++.

What it doesn't do, however, is convert your characters to Unicode. The
assignment is straightforward. As a result, location of national characters
(Greek, Cyrillic, Arabic, etc.) may overlap with other unrelated characters,
and the new BSTRING will look funny.

Microsoft has created a function called MultiByteToWideChar, that does
roughly the same as that assignment, plus conversion from the specified
codepage to Unicode. (I haven't tested it with Asian languages, but I think
it also should work.) The only downside is that it can't handle Clarion
BSTRINGs. This is because it writes to a string of 2-byte characters rather
than a classic BSTRING. Classic BSTRING, on the other hand, is:  

1. 4-byte pointer to the actual set of 2-byte characters. When using
ADDRESS() on a BSTRING variable, this is where it points.

2. 4-byte segment holding the length of the string, preceding the pointer.

3. The actual set of 2-byte characters.

Therefore, in order to use MultiByteToWideChar, we need to:

1. Allocate space for the newly created array.

2. Pass the pointer to the array, which we'll read from the 4-byte pointer.



Another minor problem is that Clarion runtime does not hold information
about code pages; we only have PROP:Charset. But this is easy to solve by
either using TranslateCharsetInfo (which is a bit unreliable) or creating a
homebrew conversion procedure (not a big effort, so this is what I did). The
result is listed below:



!=============================================
ANSIToUnicode PROCEDURE(STRING pSrc,SHORT pCharset)
l:LenA        SIGNED
l:LenW        SIGNED
l:CodePage    UNSIGNED(CP ACP)
l:RVAddress   ULONG
l:RV          BSTRING
  CODE
  IF pSrc = ''
    RETURN pSrc
  END
  l:CodePage = CharsetToCodepage(pCharset)
  l:LenA = LEN(CLIP(pSrc))
  l:LenW = MultiByteToWideChar(l:CodePage, 0, ADDRESS(pSrc), l:LenA, 0, 0)
  IF l:LenW > 0
    l:RV = ALL(' ',l:LenW)  ! resize the BSTRING, there's no other way
    PEEK(ADDRESS(l:RV),l:RVAddress) ! get the address of the actual wide
char string
    MultiByteToWideChar(l:CodePage, 0, ADDRESS(pSrc), l:LenA, l:RVAddress,
l:LenW)
  ELSE
    l:RV = pSrc
  END

  RETURN l:RV


!=============================================
! reference: http://www.treodesktop.com/codepages.htm
!
http://blogs.msdn.com/shawnste/archive/2006/09/29/list-of-ansi-code-pages-us
ed-by-windows.aspx
CharsetToCodepage PROCEDURE(SHORT pCharset)
  CODE
  CASE pCharset
  OF CHARSET:SHIFTJIS
    RETURN 932  ! Japanese
  OF CHARSET:HANGEUL
    RETURN 949  ! Korean (Hangeul is a set of precombined Korean characters;
about 4,260 unique Hanja characters exist)
  OF CHARSET:JOHAB
    RETURN 1361 ! Korean (Johab is a set of combinations between Hangul
characters totaling to about 12,000)
                ! WARNING: not always supported
  OF CHARSET:GB2312
    RETURN 936  ! simplified Chinese - used in mainland China and Singapore
  OF CHARSET:CHINESEBIG5
    RETURN 950  ! traditional Chinese - used in Hong Kong and Taiwan
  OF CHARSET:GREEK
    RETURN 1253 ! Greek
  OF CHARSET:TURKISH
    RETURN 1254 ! Turkish
  OF CHARSET:HEBREW
    RETURN 1255 ! Hebrew, also suitable for Yiddish
  OF CHARSET:ARABIC
    RETURN 1256 ! Arabic and languages using Arabic scripts - Urdu
(Pakistan), Persian (Iran)
  OF CHARSET:BALTIC
    RETURN 1257 ! Estonian, Latvian, Lithuanian
  OF CHARSET:CYRILLIC
    RETURN 1251 ! languages using Cyrillic scripts: Azerbaijanian (sometimes
can be Latin), Belarussian, Bulgarian,
                ! Macedonian, Kazakh, Kyrgyz, Mongolian, Russian, Serbian
(Cyrillic), Ukrainian,
                ! Uzbek (sometimes can be Latin)
  OF CHARSET:THAI
    RETURN 874  ! Thai
  OF CHARSET:EASTEUROPE
    RETURN 1250 ! Central / Eastern European languages: Albanian, Croatian,
Czech, Hungarian, Polish, Romanian,
                ! Serbian (Latin), Slovak, Slovenian
  OF CHARSET:DEFAULT
    RETURN CP ACP ! system default charset / codepage
  END
  RETURN 1252   ! West European

!===========================================================================
======



The API function is prototyped like this:

    MultiByteToWideChar(long CodePage, long dwFlags, long lpMultiByteStr,
long cbMultiByte, |
                        long lpWideCharStr, long cchWideCharStr),long,pascal
Printed May 3, 2024, 4:28 pm
This article has been viewed/printed 35143 times.
Google search has resulted in 287 hits on this article since January 25, 2004.