Reading the Characters of a Wide String

Zen · June 02, 2016, 09:53:58 AM

Hi, MASMers,
I'm writing a program in which I must read the characters in a wide (Unicode string) that has been returned from a COM interface method.
The string has a format like this: "v4.0.30319". It represents the version of a .NET Framework Runtime installed on my computer. The rest of the program runs OK,...no problem. I assumed that this would be simple, so, I'm just trying to write a reliable, but, simple procedure. I've already completely screwed it up and I need help. :icon_eek:
Initially, I thought, I could just load the address of the string and read it two bytes at a time, using CMP with an immediate value and a jump based on the flag value. The way I did it didn't work correctly, so I switched to reading the string, byte by byte. And, this works. I did something like this:

Code Select

     mov ebx, 0    
     mov esi, AlphaWideStr    ;    AlphaWideStr is the address of the wide string.    
     mov bl, BYTE PTR [esi]    ;    copy one byte of data from the beginning of wide string.    
     .IF bl==76h    ;    The character "v" is equivalent to 76h.    
     JMP CheckNext    
     .ENDIF    
     mov eax, 0    ;    return error code, if routine fails.    
     RET    
     
CheckNext: 
     mov ebx, 0    
     INC esi    
     mov bl, BYTE PTR [esi]    
     .IF bl==0
     JMP testOK 
     .ENDIF    
     mov eax, 0    ;     return error code, if routine fails.         
     RET  

testOK:
     mov eax, 40h    ;    Indicated success detecting initial "v" character.      
     RET

...And, then I just check the return code for either zero (failure), or 40h (success). This works just fine, but, what I'd like to do is read the string two bytes at a time, so, initially I tried code like this:

Code Select

     mov ebx, 0    
     mov esi, AlphaWideStr    
     mov ebx, [esi]     ;     Copy the first 4 bytes of the wide string into the register.    
     SHR ebx, 16    ;    Shift to right by 16 bits, leaving the first 2 bytes of the string characters in the register.    
     CMP bx, 7600h    ;    76h is "v". In unicode, the two bytes should look like this: 76 00    
     JZ testOK 
     mov eax, 0    ;    return error code, if routine fails.    
     RET    
testOK:
     mov eax, 40h    ;    64 in decimal. Indicated success detecting initial "v" character.      
     RET

I tried several variants of the above code and I get a failure code returned each time. For instance, I shifted the initial 4 bytes by 24 bits and then compared just what should have been just the "v" character (76h), then using a JZ instruction, to jump to the success code. I'm obviously making an INCREDIBLY STUPID MISTAKE in my thinking. And, this is so simple. What am I doing wrong ???

jj2007 · June 02, 2016, 10:27:58 AM

It works, but why so complicated?

Code Select

include \masm32\include\masm32rt.inc
__UNICODE__=1
.code
start:
  mov esi, chr$("v4.0.30319")	; AlphaWideStr A
  .if dword ptr [esi]==340076h
	print esi, " is version 4", 13, 10
  .elseif word ptr [esi]==76h
	print esi, " is another version"
  .else
	print esi, " is not version 4", 13, 10
  .endif

  mov esi, chr$("v5.0.30319")	; AlphaWideStr B
  .if dword ptr [esi]==340076h
	print esi, " is version 4", 13, 10
  .elseif word ptr [esi]==76h
	print esi, " is another version", 13, 10
  .else
	print esi, " is not version 4", 13, 10
  .endif

  mov esi, chr$("X4.0.30319")	; AlphaWideStr C
  .if dword ptr [esi]==340076h
	inkey esi, " is version 4"
  .elseif word ptr [esi]==76h
	inkey esi, " is another version"
  .else
	inkey esi, " is not version 4"
  .endif

  exit

end start

Code Select

     mov ebx, [esi]     ;     Copy the first 4 bytes of the wide string into the register.   
     SHR ebx, 16    ;    Shift to right by 16 bits, leaving the first 2 bytes of the string characters in the register.   
     CMP bx, 7600h    ;    76h is "v". In unicode, the two bytes should look like this: 76 00

That shr is wrong: x86 is little-endian, i.e. the two bytes are already in bx: cmp bx, 76h should work

Zen · June 02, 2016, 10:31:12 AM

JOCHEN,
The function will evolve. Eventually, I must compare two or three (or more) of these version strings to determine which wide string represents the most recent .NET Framework version installed on the user's computer (and, it must be version four, or greater). The strings were returned from ICLRMetaHost.EnumerateInstalledRuntimes, then, IEnumUnknown.Next, and, finally, ICLRRuntimeInfo.GetVersionString.

Quote from: JOCHENThat shr is wrong: x86 is little-endian, i.e. the two bytes are already in bx: cmp bx, 76h should work.

AH,...HAH,...yes, that's the answer, THANKS. That little-endian/big endian stuff always drove me insane,...(they must have invented it just to destroy our brains.)

mabdelouahab · June 02, 2016, 03:01:31 PM

Quote from: Zen on June 02, 2016, 10:31:12 AM
... I must compare two or three (or more) of these version strings to determine which wide string represents the most recent .NET Framework version installed on the user's computer (and, it must be version four, or greater). The strings were returned from ICLRMetaHost.EnumerateInstalledRuntimes, then, IEnumUnknown.Next, and, finally, ICLRRuntimeInfo.GetVersionString.

crt_wcsncmp

Code Select


	invoke crt_wcsncmp ,chr$("v4.0.30319"),chr$("v1.0.3705") ,10
	.IF sdword ptr eax > 0
		invoke crt_wprintf,cfm$("\n is: v4.0.30319")
	.ELSE
		invoke crt_wprintf,cfm$("\n is: v1.0.3705")
	.ENDIF

mineiro · June 02, 2016, 11:31:14 PM

I do not have sure, but I suppose that on utf16 a symbol can have a size of 3 bytes.

Code Select

include \masm32\include\masm32rt.inc

.data
widechar db "%ws",00h
AlphaWideStr db 76h,00h,34h,00h,2Eh,00h,30h,00h,2Eh,00h,33h,00h,30h,00h,33h,00h,31h,00h,39h,00h,00h,00h
buffer db 120 dup (0)

.data?
szansi dd ?
houtput dd ?
temp dd ?

.code
start:

	invoke GetStdHandle,STD_OUTPUT_HANDLE
	mov houtput,eax

	invoke WideCharToMultiByte,CP_UTF8,0,addr AlphaWideStr,-1,0,0,0,0
	mov szansi,eax
	invoke WideCharToMultiByte,CP_UTF8,0,addr AlphaWideStr,-1,addr buffer,eax,0,0

	invoke WriteFile,houtput,addr buffer,szansi,temp,0

	invoke wsprintf,addr buffer,addr widechar,addr AlphaWideStr
	invoke WriteFile,houtput,addr buffer,eax,temp,0

    inkey
    exit

end start

hutch-- · June 03, 2016, 04:01:23 AM

Zen,

Have a look at the unicode library modules in the masm32 library. They all start with "uc" in the file list. Unicode is not difficult to work with, you just read a WORD at a time rather than a BYTE. Instead of incrementing the position 1 byte at a time, add 2 instead.

Zen · June 03, 2016, 05:15:31 AM

MABDELOUAHAB and HUTCH,
Excellent suggestions from both of you. THANKS.

For those of you MASM Forum members that are novices (or, like me, oblivious to reality),...I've found a number of useful webpages with explanations of the terms: big endian and little endian. It's really very simple,...I just wasn't thinking when I posted my original question.
This is the best, lengthiest, and clearest explanation: Understanding Big and Little Endian Byte Order.
Here is the official Microsoft explanation: Explanation of Big Endian and Little Endian Architecture, Microsoft Support
Here is the exhaustive Wikipedia page: Endianness
...And, here is the explanation from an Assembly language tutorial: Big Endian and Little Endian

The MASM Forum

News:

Reading the Characters of a Wide String

Zen

jj2007

Zen

mabdelouahab

mineiro

hutch--

Zen