"lenghtof" tells the number of elements in an array.
Why unicode characters that are not representations of ascii characters count twice for the lengthof?
option casemap:none
includelib \masm32\lib64\msvcrt.lib
printf proto :vararg
includelib \masm32\lib64\kernel32.lib
ExitProcess proto :dword
.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'
format db "%d",13,10,0
.code
main proc
invoke printf, addr format, lengthof one
invoke printf, addr format, lengthof two
invoke printf, addr format, lengthof three
invoke printf, addr format, lengthof four
invoke ExitProcess, 0
main endp
end
Results:
12
6
12
6
Uasm, I suppose? Others don't accept the syntax. Which editor?
Your setup seems a bit non-standard - it doesn't assemble. Here is a compatible version:include \masm32\include\masm32rt.inc
; includelib \masm32\lib64\msvcrt.lib
; printf proto :vararg
; includelib \masm32\lib64\kernel32.lib
; ExitProcess proto :dword
.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'
format db "%d",13,10,0
.code
main proc
mov eax, offset one
invoke crt_printf, addr format, lengthof one
invoke crt_printf, addr format, lengthof two
invoke crt_printf, addr format, lengthof three
invoke crt_printf, addr format, lengthof four
invoke ExitProcess, 0
main endp
end main
One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6. Only with Utf8 enabled, I get your results.
@Jochen
It is for 64-bit :icon_exclaim: ;)
Quote
One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6.
It is expected, the characters you see will be converted to your local code page.
Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.
one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'
Right. Which editor are you using?
Quote from: jj2007 on July 16, 2017, 06:03:11 PM
Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.
one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'
Right. Which editor are you using?
Notepad++ :P
OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.
Quote from: jj2007 on July 16, 2017, 08:37:32 PM
OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.
Agreed. :t
Quote
4.9) String Literal Support
Wide character literal data can now be declared with:
awideStr dw "wide caption ",0
It is indeed a bug.
Consider this:
one dw "Прощай",0 ;"ПрÐ.Ñ.аÐ." = { d09fd180d0bed189d0b0d0b9 }
two dw "abcdef",0 ;"abcdef" = { 616263646566 }
three dw 'П','р','о','щ','а','й' ;'П', 'Ñ€','Ð.','Ñ.','а','Ð.' = {d09f, d180, d0be, d189, d0b0, d0b9}
four dw 'a','b','c','d','e','f' ;'a','b','c','d','e','f' = { 61,62,63,64,65,66 }
As you can see, text editor has to use 2 byte to represent each character in "Прощай"
UASM sees and counts each byte for the length
Any suggestion how to persuade UASM to speak Russian?
I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0
I have succeeded to teach UASM Russian ;)
Coming in next release 8)
Hi Habran,
In the second case, elements separated by quotes, this logic could be used:
txUnicode1 dw "xx", "yy" ; some word >255d
txUnicode2 dw "x", "y" ; some word <=255d
In both cases... a WORD, not just a byte.
But I have no clear idea for the first case, one dw "Прощай",0 - this is always ambiguous :(
Besides, the editor will give you UTF-8, not Unicode, so even the number of bytes will be ambiguous.
Quote from: habran on July 17, 2017, 10:22:53 AM
I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0
Since you are working in C, you assume the string is UTF8 and do a MultiByteToWideChar when parsing that stuff in the data section.
deleted
Quote from: nidud on July 18, 2017, 01:08:49 AMTo persuade in this case means to install a Russian version of Windows. Then everything becomes Russian, Uasm included.
Sounds familiar :P
http://masm32.com/board/index.php?topic=6221.msg66429#msg66429
http://masm32.com/board/index.php?topic=6252.msg66791#msg66791
http://masm32.com/board/index.php?topic=6275.msg67256#msg67256
http://masm32.com/board/index.php?topic=6206.msg66112#msg66112
deleted
Hi Nidud,
I think we have solved that problem in next release
If you use Russian windows than I suppose western characters will be 2 bytes, than that problem would remain unsolved
This way both can be used at the same time
we can cover first 3 UTF-8 cases:
1.) 00 to 7F hex (0 to 127): first and only byte of a sequence.
2.) 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
3.) C2 to DF hex (194 to 223): first byte of a two-byte sequence.
4.) E0 to EF hex (224 to 239): first byte of a three-byte sequence.
5.) F0 to FF hex (240 to 255): first byte of a four-byte sequence.
We hope it'll works as expected, we'll see. :biggrin:
Addition that we've added doesn't affect normal cases only above 7Fh
Hopefully, Johnsa will upload it today, if he get some spare time 8)
Quote from: habran on July 18, 2017, 10:29:45 AMIf you use Russian windows than I suppose western characters will be 2 bytes, than that problem would remain unsolved
Which problem? "everything becomes Russian", no need for Western characters :bgrin:
I meant it is nice to be able to use mixed cases on either side 8)
Hopefully it'll work as I expected :greenclp:
deleted