Unicode characters

aw27 · July 16, 2017, 02:01:19 AM

"lenghtof" tells the number of elements in an array.
Why unicode characters that are not representations of ascii characters count twice for the lengthof?

Code Select


option casemap:none

includelib \masm32\lib64\msvcrt.lib
printf proto :vararg
includelib \masm32\lib64\kernel32.lib
ExitProcess   proto  :dword 

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0
	
.code

main proc
	invoke printf, addr format, lengthof one
	invoke printf, addr format, lengthof two
	invoke printf, addr format, lengthof three
	invoke printf, addr format, lengthof four
	invoke ExitProcess, 0
main endp	
	
end

Results:
12
6
12
6

jj2007 · July 16, 2017, 05:05:23 AM

Uasm, I suppose? Others don't accept the syntax. Which editor?

Your setup seems a bit non-standard - it doesn't assemble. Here is a compatible version:

Code Select

include \masm32\include\masm32rt.inc

; includelib \masm32\lib64\msvcrt.lib
; printf proto :vararg
; includelib \masm32\lib64\kernel32.lib
; ExitProcess   proto  :dword

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0
	
.code

main proc
  mov eax, offset one
  invoke crt_printf, addr format, lengthof one
  invoke crt_printf, addr format, lengthof two
  invoke crt_printf, addr format, lengthof three
  invoke crt_printf, addr format, lengthof four
  invoke ExitProcess, 0
main endp	
	
end main

One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6. Only with Utf8 enabled, I get your results.

aw27 · July 16, 2017, 02:51:13 PM

@Jochen
It is for 64-bit :icon_exclaim: ;)

Quote
One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6.

It is expected, the characters you see will be converted to your local code page.

jj2007 · July 16, 2017, 06:03:11 PM

Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.

Code Select

one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'

Code Select

one dw "ÐŸÑ€Ð¾Ñ‰Ð°Ð¹"
two dw "abcdef"
three dw 'ÐŸ','Ñ€','Ð¾','Ñ‰','Ð°','Ð¹'
four dw 'a','b','c','d','e','f'

Right. Which editor are you using?

aw27 · July 16, 2017, 07:47:19 PM

Quote from: jj2007 on July 16, 2017, 06:03:11 PM
Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.

Code Select Expand
one dw "??????" two dw "abcdef" three dw '?','?','?','?','?','?' four dw 'a','b','c','d','e','f'

Code Select Expand
one dw "ÐŸÑ€Ð¾Ñ‰Ð°Ð¹" two dw "abcdef" three dw 'ÐŸ','Ñ€','Ð¾','Ñ‰','Ð°','Ð¹' four dw 'a','b','c','d','e','f'

Right. Which editor are you using?

Notepad++ :P

jj2007 · July 16, 2017, 08:37:32 PM

OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.

aw27 · July 16, 2017, 09:39:00 PM

Quote from: jj2007 on July 16, 2017, 08:37:32 PM
OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.

Agreed. :t

aw27 · July 17, 2017, 12:16:10 AM

Quote

4.9) String Literal Support

Wide character literal data can now be declared with:

awideStr dw "wide caption ",0

It is indeed a bug.

habran · July 17, 2017, 09:58:19 AM

Consider this:

Code Select


one dw "Прощай",0                    ;"ÐŸÑ€Ð.Ñ.Ð°Ð."                 = { d09fd180d0bed189d0b0d0b9 }
two dw "abcdef",0                    ;"abcdef"                       = { 616263646566 }
three dw 'П','р','о','щ','а','й'     ;'ÐŸ', 'Ñ€','Ð.','Ñ.','Ð°','Ð.' = {d09f, d180, d0be, d189, d0b0, d0b9}
four dw 'a','b','c','d','e','f'      ;'a','b','c','d','e','f'        = { 61,62,63,64,65,66 }

As you can see, text editor has to use 2 byte to represent each character in "Прощай"
UASM sees and counts each byte for the length
Any suggestion how to persuade UASM to speak Russian?

habran · July 17, 2017, 10:22:53 AM

I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:

Code Select

d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0

habran · July 17, 2017, 11:50:16 AM

I have succeeded to teach UASM Russian ;)
Coming in next release 8)

jj2007 · July 17, 2017, 04:39:46 PM

Hi Habran,

In the second case, elements separated by quotes, this logic could be used:

Code Select

txUnicode1 dw "xx", "yy"  ; some word >255d
txUnicode2 dw "x", "y"  ; some word <=255d

In both cases... a WORD, not just a byte.

But I have no clear idea for the first case, one dw "Прощай",0 - this is always ambiguous :(

Besides, the editor will give you UTF-8, not Unicode, so even the number of bytes will be ambiguous.

aw27 · July 17, 2017, 04:51:35 PM

Quote from: habran on July 17, 2017, 10:22:53 AM
I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
Code Select Expand
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0

Since you are working in C, you assume the string is UTF8 and do a MultiByteToWideChar when parsing that stuff in the data section.

nidud · July 18, 2017, 01:08:49 AM

deleted

jj2007 · July 18, 2017, 05:01:33 AM

Quote from: nidud on July 18, 2017, 01:08:49 AMTo persuade in this case means to install a Russian version of Windows. Then everything becomes Russian, Uasm included.

Sounds familiar :P
http://masm32.com/board/index.php?topic=6221.msg66429#msg66429
http://masm32.com/board/index.php?topic=6252.msg66791#msg66791
http://masm32.com/board/index.php?topic=6275.msg67256#msg67256
http://masm32.com/board/index.php?topic=6206.msg66112#msg66112

The MASM Forum

News:

Unicode characters

aw27

jj2007

aw27

jj2007

aw27

jj2007

aw27

aw27

habran

habran

habran

jj2007

aw27

nidud

jj2007