Print Page - Unicode characters

Title: Unicode characters
Post by: aw27 on July 16, 2017, 02:01:19 AM

"lenghtof" tells the number of elements in an array.
Why unicode characters that are not representations of ascii characters count twice for the lengthof?

Code Select


option casemap:none

includelib \masm32\lib64\msvcrt.lib
printf proto :vararg
includelib \masm32\lib64\kernel32.lib
ExitProcess   proto  :dword 

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0
	
.code

main proc
	invoke printf, addr format, lengthof one
	invoke printf, addr format, lengthof two
	invoke printf, addr format, lengthof three
	invoke printf, addr format, lengthof four
	invoke ExitProcess, 0
main endp	
	
end

Results:
12
6
12
6

Title: Re: Unicode characters
Post by: jj2007 on July 16, 2017, 05:05:23 AM

Uasm, I suppose? Others don't accept the syntax. Which editor?

Your setup seems a bit non-standard - it doesn't assemble. Here is a compatible version:

Code Select

include \masm32\include\masm32rt.inc

; includelib \masm32\lib64\msvcrt.lib
; printf proto :vararg
; includelib \masm32\lib64\kernel32.lib
; ExitProcess   proto  :dword

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0
	
.code

main proc
  mov eax, offset one
  invoke crt_printf, addr format, lengthof one
  invoke crt_printf, addr format, lengthof two
  invoke crt_printf, addr format, lengthof three
  invoke crt_printf, addr format, lengthof four
  invoke ExitProcess, 0
main endp	
	
end main

One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6. Only with Utf8 enabled, I get your results.

Title: Re: Unicode characters
Post by: aw27 on July 16, 2017, 02:51:13 PM

@Jochen
It is for 64-bit :icon_exclaim: ;)

Quote
One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6.

It is expected, the characters you see will be converted to your local code page.

Title: Re: Unicode characters
Post by: jj2007 on July 16, 2017, 06:03:11 PM

Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.

Code Select

one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'

Code Select

one dw "ÐŸÑ€Ð¾Ñ‰Ð°Ð¹"
two dw "abcdef"
three dw 'ÐŸ','Ñ€','Ð¾','Ñ‰','Ð°','Ð¹'
four dw 'a','b','c','d','e','f'

Right. Which editor are you using?

Title: Re: Unicode characters
Post by: aw27 on July 16, 2017, 07:47:19 PM

Quote from: jj2007 on July 16, 2017, 06:03:11 PM
Quote from: aw27 on July 16, 2017, 02:51:13 PM
It is expected, the characters you see will be converted to your local code page.

Code Select Expand
one dw "??????" two dw "abcdef" three dw '?','?','?','?','?','?' four dw 'a','b','c','d','e','f'

Code Select Expand
one dw "ÐŸÑ€Ð¾Ñ‰Ð°Ð¹" two dw "abcdef" three dw 'ÐŸ','Ñ€','Ð¾','Ñ‰','Ð°','Ð¹' four dw 'a','b','c','d','e','f'

Right. Which editor are you using?

Notepad++ :P

Title: Re: Unicode characters
Post by: jj2007 on July 16, 2017, 08:37:32 PM

OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.

Title: Re: Unicode characters
Post by: aw27 on July 16, 2017, 09:39:00 PM

Quote from: jj2007 on July 16, 2017, 08:37:32 PM
OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.

Agreed. :t

Title: Re: Unicode characters
Post by: aw27 on July 17, 2017, 12:16:10 AM

Quote

4.9) String Literal Support

Wide character literal data can now be declared with:

awideStr dw "wide caption ",0

It is indeed a bug.

Title: Re: Unicode characters
Post by: habran on July 17, 2017, 09:58:19 AM

Consider this:

Code Select


one dw "Прощай",0                    ;"ÐŸÑ€Ð.Ñ.Ð°Ð."                 = { d09fd180d0bed189d0b0d0b9 }
two dw "abcdef",0                    ;"abcdef"                       = { 616263646566 }
three dw 'П','р','о','щ','а','й'     ;'ÐŸ', 'Ñ€','Ð.','Ñ.','Ð°','Ð.' = {d09f, d180, d0be, d189, d0b0, d0b9}
four dw 'a','b','c','d','e','f'      ;'a','b','c','d','e','f'        = { 61,62,63,64,65,66 }

As you can see, text editor has to use 2 byte to represent each character in "Прощай"
UASM sees and counts each byte for the length
Any suggestion how to persuade UASM to speak Russian?

Title: Re: Unicode characters
Post by: habran on July 17, 2017, 10:22:53 AM

I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:

Code Select

d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0

Title: Re: Unicode characters
Post by: habran on July 17, 2017, 11:50:16 AM

I have succeeded to teach UASM Russian ;)
Coming in next release 8)

Title: Re: Unicode characters
Post by: jj2007 on July 17, 2017, 04:39:46 PM

Hi Habran,

In the second case, elements separated by quotes, this logic could be used:

Code Select

txUnicode1 dw "xx", "yy"  ; some word >255d
txUnicode2 dw "x", "y"  ; some word <=255d

In both cases... a WORD, not just a byte.

But I have no clear idea for the first case, one dw "Прощай",0 - this is always ambiguous :(

Besides, the editor will give you UTF-8, not Unicode, so even the number of bytes will be ambiguous.

Title: Re: Unicode characters
Post by: aw27 on July 17, 2017, 04:51:35 PM

Quote from: habran on July 17, 2017, 10:22:53 AM
I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
Code Select Expand
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0

Since you are working in C, you assume the string is UTF8 and do a MultiByteToWideChar when parsing that stuff in the data section.

Title: Re: Unicode characters
Post by: nidud on July 18, 2017, 01:08:49 AM

deleted

Title: Re: Unicode characters
Post by: jj2007 on July 18, 2017, 05:01:33 AM

Quote from: nidud on July 18, 2017, 01:08:49 AMTo persuade in this case means to install a Russian version of Windows. Then everything becomes Russian, Uasm included.

Sounds familiar :P
http://masm32.com/board/index.php?topic=6221.msg66429#msg66429
http://masm32.com/board/index.php?topic=6252.msg66791#msg66791
http://masm32.com/board/index.php?topic=6275.msg67256#msg67256
http://masm32.com/board/index.php?topic=6206.msg66112#msg66112

Title: Re: Unicode characters
Post by: nidud on July 18, 2017, 05:39:31 AM

deleted

Title: Re: Unicode characters
Post by: habran on July 18, 2017, 10:29:45 AM

Hi Nidud,
I think we have solved that problem in next release
If you use Russian windows than I suppose western characters will be 2 bytes, than that problem would remain unsolved
This way both can be used at the same time
we can cover first 3 UTF-8 cases:
1.) 00 to 7F hex (0 to 127): first and only byte of a sequence.
2.) 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
3.) C2 to DF hex (194 to 223): first byte of a two-byte sequence.
4.) ~~E0 to EF hex (224 to 239): first byte of a three-byte sequence.~~
5.) ~~F0 to FF hex (240 to 255): first byte of a four-byte sequence.~~

We hope it'll works as expected, we'll see. :biggrin:
Addition that we've added doesn't affect normal cases only above 7Fh
Hopefully, Johnsa will upload it today, if he get some spare time 8)

Title: Re: Unicode characters
Post by: jj2007 on July 18, 2017, 05:01:25 PM

Quote from: habran on July 18, 2017, 10:29:45 AMIf you use Russian windows than I suppose western characters will be 2 bytes, than that problem would remain unsolved

Which problem? "everything becomes Russian", no need for Western characters :bgrin:

Title: Re: Unicode characters
Post by: habran on July 18, 2017, 06:10:56 PM

I meant it is nice to be able to use mixed cases on either side 8)
Hopefully it'll work as I expected :greenclp:

Title: Re: Unicode characters
Post by: nidud on July 19, 2017, 12:10:00 AM

deleted

The MASM Forum

64 bit assembler => UASM Assembler Development => Topic started by: aw27 on July 16, 2017, 02:01:19 AM