Author Topic: Unicode characters  (Read 9878 times)

aw27

  • Guest
Unicode characters
« on: July 16, 2017, 02:01:19 AM »
"lenghtof" tells the number of elements in an array.
Why unicode characters that are not representations of ascii characters count twice for the lengthof?

Code: [Select]
option casemap:none

includelib \masm32\lib64\msvcrt.lib
printf proto :vararg
includelib \masm32\lib64\kernel32.lib
ExitProcess   proto  :dword

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0

.code

main proc
invoke printf, addr format, lengthof one
invoke printf, addr format, lengthof two
invoke printf, addr format, lengthof three
invoke printf, addr format, lengthof four
invoke ExitProcess, 0
main endp

end

Results:
12
6
12
6

jj2007

  • Member
  • *****
  • Posts: 13957
  • Assembly is fun ;-)
    • MasmBasic
Re: Unicode characters
« Reply #1 on: July 16, 2017, 05:05:23 AM »
Uasm, I suppose? Others don't accept the syntax. Which editor?

Your setup seems a bit non-standard - it doesn't assemble. Here is a compatible version:
Code: [Select]
include \masm32\include\masm32rt.inc

; includelib \masm32\lib64\msvcrt.lib
; printf proto :vararg
; includelib \masm32\lib64\kernel32.lib
; ExitProcess   proto  :dword

.data
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

format db "%d",13,10,0

.code

main proc
  mov eax, offset one
  invoke crt_printf, addr format, lengthof one
  invoke crt_printf, addr format, lengthof two
  invoke crt_printf, addr format, lengthof three
  invoke crt_printf, addr format, lengthof four
  invoke ExitProcess, 0
main endp

end main

One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6. Only with Utf8 enabled, I get your results.

aw27

  • Guest
Re: Unicode characters
« Reply #2 on: July 16, 2017, 02:51:13 PM »
@Jochen
It is for 64-bit  :icon_exclaim:  ;)

Quote
One odd thing is that with ascii mode in RichMasm, I get 6 6 6 6.
It is expected, the characters you see will be converted to your local code page.

jj2007

  • Member
  • *****
  • Posts: 13957
  • Assembly is fun ;-)
    • MasmBasic
Re: Unicode characters
« Reply #3 on: July 16, 2017, 06:03:11 PM »
It is expected, the characters you see will be converted to your local code page.

Code: [Select]
one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'

Code: [Select]
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

Right. Which editor are you using?

aw27

  • Guest
Re: Unicode characters
« Reply #4 on: July 16, 2017, 07:47:19 PM »
It is expected, the characters you see will be converted to your local code page.

Code: [Select]
one dw "??????"
two dw "abcdef"
three dw '?','?','?','?','?','?'
four dw 'a','b','c','d','e','f'

Code: [Select]
one dw "Прощай"
two dw "abcdef"
three dw 'П','р','о','щ','а','й'
four dw 'a','b','c','d','e','f'

Right. Which editor are you using?

Notepad++  :P

jj2007

  • Member
  • *****
  • Posts: 13957
  • Assembly is fun ;-)
    • MasmBasic
Re: Unicode characters
« Reply #5 on: July 16, 2017, 08:37:32 PM »
OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.

aw27

  • Guest
Re: Unicode characters
« Reply #6 on: July 16, 2017, 09:39:00 PM »
OK. Re your initial question: I would consider it a bug. If the syntax dw "a string", 0 is defined as "create a Unicode string in the .data section", then it should be Unicode also for plain Ascii strings. Btw ML doesn't digest this syntax.
Agreed.  :t

aw27

  • Guest
Re: Unicode characters
« Reply #7 on: July 17, 2017, 12:16:10 AM »
Quote

4.9) String Literal Support

Wide character literal data can now be declared with:

awideStr dw "wide caption ",0



It is indeed a bug.



habran

  • Member
  • *****
  • Posts: 1228
    • uasm
Re: Unicode characters
« Reply #8 on: July 17, 2017, 09:58:19 AM »
Consider this:
Code: [Select]
one dw "Прощай",0                    ;"ПрÐ.Ñ.аÐ."                 = { d09fd180d0bed189d0b0d0b9 }
two dw "abcdef",0                    ;"abcdef"                       = { 616263646566 }
three dw 'П','р','о','щ','а','й'     ;'П', 'Ñ€','Ð.','Ñ.','а','Ð.' = {d09f, d180, d0be, d189, d0b0, d0b9}
four dw 'a','b','c','d','e','f'      ;'a','b','c','d','e','f'        = { 61,62,63,64,65,66 }
As you can see, text editor has to use 2 byte to represent each character in "Прощай"
UASM sees and counts each byte for the length
Any suggestion how to persuade UASM to speak Russian?
Cod-Father

habran

  • Member
  • *****
  • Posts: 1228
    • uasm
Re: Unicode characters
« Reply #9 on: July 17, 2017, 10:22:53 AM »
I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume  it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
Code: [Select]
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0
Cod-Father

habran

  • Member
  • *****
  • Posts: 1228
    • uasm
Re: Unicode characters
« Reply #10 on: July 17, 2017, 11:50:16 AM »
I have succeeded to teach UASM Russian ;)
Coming in next release 8)
Cod-Father

jj2007

  • Member
  • *****
  • Posts: 13957
  • Assembly is fun ;-)
    • MasmBasic
Re: Unicode characters
« Reply #11 on: July 17, 2017, 04:39:46 PM »
Hi Habran,

In the second case, elements separated by quotes, this logic could be used:
Code: [Select]
txUnicode1 dw "xx", "yy"  ; some word >255d
txUnicode2 dw "x", "y"  ; some word <=255d

In both cases... a WORD, not just a byte.

But I have no clear idea for the first case, one dw "Прощай",0 - this is always ambiguous :(

Besides, the editor will give you UTF-8, not Unicode, so even the number of bytes will be ambiguous.

aw27

  • Guest
Re: Unicode characters
« Reply #12 on: July 17, 2017, 04:51:35 PM »
I can think about one way:
because the highest character in western is 'z' which is 7Ah if characters are bigger than that we can
presume  it is not western and divide by 2 for the length, however, we have to leave it as it is because it doesn't need added '0', otherwise it'll become a junk, because now UASM produces this data:
Code: [Select]
d0 00 9f 00 d1 00 80 00 d0 00 be 00 d1 00 89 00 d0 00 b0 00 d0 00 b9 00 0

Since you are working in C, you assume the string is UTF8 and do a MultiByteToWideChar when parsing that stuff in the data section.

nidud

  • Member
  • *****
  • Posts: 2388
    • https://github.com/nidud/asmc
Re: Unicode characters
« Reply #13 on: July 18, 2017, 01:08:49 AM »
deleted
« Last Edit: February 26, 2022, 01:06:45 AM by nidud »