News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Unicode characters

Started by aw27, July 16, 2017, 02:01:19 AM

Previous topic - Next topic

aw27

"lenghtof" tells the number of elements in an array.
Why unicode characters that are not representations of ascii characters count twice for the lengthof?


option casemap:none

includelib \masm32\lib64\msvcrt.lib
printf proto :vararg
includelib \masm32\lib64\kernel32.lib
ExitProcess   proto  :dword

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0

.code

main proc
invoke printf, addr format, lengthof one
invoke printf, addr format, lengthof two
invoke printf, addr format, lengthof three
invoke printf, addr format, lengthof four
invoke ExitProcess, 0
main endp

end


Results:
12
6
12
6

jj2007

Uasm, I suppose? Others don't accept the syntax. Which editor?

Your setup seems a bit non-standard - it doesn't assemble. Here is a compatible version:include \masm32\include\masm32rt.inc

; includelib \masm32\lib64\msvcrt.lib
; printf proto :vararg
; includelib \masm32\lib64\kernel32.lib
; ExitProcess   proto  :dword

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0

.code

main proc
  mov eax, offset one
  invoke crt_printf, addr format, lengthof one
  invoke crt_printf, addr format, lengthof two
  invoke crt_printf, addr format, lengthof three
  invoke crt_printf, addr format, lengthof four
  invoke ExitProcess, 0
main endp

end main


One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6. Only with Utf8 enabled, I get your results.

aw27

@Jochen
It is for 64-bit  :icon_exclaim:  ;)

Quote
One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6.
It is expected, the characters you see will be converted to your local code page.

jj2007

Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.

one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'


one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'


Right. Which editor are you using?

aw27

Quote from: jj2007 on July 16, 2017, 06:03:11 PM
Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.

one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'


one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'


Right. Which editor are you using?

Notepad++  :P

jj2007

OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.

aw27

Quote from: jj2007 on July 16, 2017, 08:37:32 PM
OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.
Agreed.  :t

aw27

Quote


4.9) String Literal Support

Wide character literal data can now be declared with:

awideStr dw "wide caption ",0




It is indeed a bug.



habran

Consider this:

one dw "Прощай",0                    ;"ПрÐ.Ñ.аÐ."                 = { d09fd180d0bed189d0b0d0b9 }
two dw "abcdef",0                    ;"abcdef"                       = { 616263646566 }
three dw 'П','р','о','щ','а','й'     ;'П', 'Ñ€','Ð.','Ñ.','а','Ð.' = {d09f, d180, d0be, d189, d0b0, d0b9}
four dw 'a','b','c','d','e','f'      ;'a','b','c','d','e','f'        = { 61,62,63,64,65,66 }

As you can see, text editor has to use 2 byte to represent each character in "Прощай"
UASM sees and counts each byte for the length
Any suggestion how to persuade UASM to speak Russian?
Cod-Father

habran

I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume  it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0

Cod-Father

habran

I have succeeded to teach UASM Russian ;)
Coming in next release 8)
Cod-Father

jj2007

Hi Habran,

In the second case, elements separated by quotes, this logic could be used:
txUnicode1 dw "xx", "yy"  ; some word >255d
txUnicode2 dw "x", "y"  ; some word <=255d


In both cases... a WORD, not just a byte.

But I have no clear idea for the first case, one dw "Прощай",0 - this is always ambiguous :(

Besides, the editor will give you UTF-8, not Unicode, so even the number of bytes will be ambiguous.

aw27

Quote from: habran on July 17, 2017, 10:22:53 AM
I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume  it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0


Since you are working in C, you assume the string is UTF8 and do a MultiByteToWideChar when parsing that stuff in the data section.

nidud

#13
deleted

jj2007

Quote from: nidud on July 18, 2017, 01:08:49 AMTo persuade in this case means to install a Russian version of Windows. Then everything becomes Russian, Uasm included.

Sounds familiar :P
http://masm32.com/board/index.php?topic=6221.msg66429#msg66429
http://masm32.com/board/index.php?topic=6252.msg66791#msg66791
http://masm32.com/board/index.php?topic=6275.msg67256#msg67256
http://masm32.com/board/index.php?topic=6206.msg66112#msg66112