The MASM Forum

General => The Campus => Topic started by: JK on March 30, 2020, 02:42:00 AM

Title: Unicode literals
Post by: JK on March 30, 2020, 02:42:00 AM
Defining "real" wide string constants and using wide string literals in code seems no so easy. Of course there are macros for wide strings like "this is a wide string", which is then converted to a wide string representation of this former ASCII sequence. But what about "фывап" (Russian characters) or Chinese? Masm et al. don´t like UTF16 encoded files, at least it keeps failing for me (with and without BOM). There is no ASCII representation of "фывап", you need an UTF16 or UTF8 encoded source file.

So i could use UTF8 in my editor, which indeed is accepted (without BOM). In this case everything outside quotes remains ASCII (no problem for the assembler) and UTF8 encoding is present only inside quoted strings. The problem with UTF8 is that macros for wide strings don´t process literals like "фывап" properly, because the necessary conversion isn´t ASCII -> UTF16, but UTF8 -> UTF16.

Are there already macros for this task, one for defining constants and one for use in expressions, or would i have to try it myself (MultiByteToWideChar...)?


JK
Title: Re: Unicode literals
Post by: nidud on March 30, 2020, 05:27:10 AM
deleted
Title: Re: Unicode literals
Post by: jj2007 on March 30, 2020, 11:27:21 AM
include \masm32\MasmBasic\MasmBasic.inc         ; download (http://masm32.com/board/index.php?topic=94.0)
  Init
  PrintLine "Добро пожаловать"
  uMsgBox 0, "歡迎", "That looks Chinese:", MB_OK
EndOfCode
Title: Re: Unicode literals
Post by: JK on March 31, 2020, 04:04:11 AM
Thanks for your replies.

@nidud,

my post wasn´t clear enough, of course there is an ASCII codepage with Russian characters. But dealing with codepages is what i wanted to avoid, jj2007´s example demonstrates much better what i meant.

@jj2007,

how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?


JK
Title: Re: Unicode literals
Post by: nidud on March 31, 2020, 05:39:41 AM
deleted
Title: Re: Unicode literals
Post by: vitsoft on March 31, 2020, 08:38:41 AM
Quote from: JK on March 31, 2020, 04:04:11 AM
how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?

I faced similar challenge, source code should be written in UTF-8 without BOM and it will be emitted in wide (UTF-16) encoding, if the assembler is told which source encoding is used. This is done in MASM with  option codepage:CP_UTF8 and in €ASM with option EUROASM CodePage=UTF-8, see an example at https://euroassembler.eu/prowin64/cpmix64.htm (https://euroassembler.eu/prowin64/cpmix64.htm)
Title: Re: Unicode literals
Post by: JK on March 31, 2020, 08:47:14 AM
thanks nidud,

i see now, the code file for asmc can be UTF8 and by telling asmc that it is UTF8 encoded, all literals are converted to UTF16 in the generated code. So no need for extra macros or functions for UTF8 to UTF16 conversion, it´s built in into asmc.


JK
Title: Re: Unicode literals
Post by: JK on March 31, 2020, 09:01:13 AM
QuoteThis is done in MASM with  option codepage:CP_UTF8

???, i cannot find an option "codepage" at Microsoft´s MASM pages (https://docs.microsoft.com/en-us/cpp/assembler/masm/option-masm?view=vs-2019)


JK

Title: Re: Unicode literals
Post by: vitsoft on March 31, 2020, 09:06:20 AM
Quote from: nidud on March 31, 2020, 05:39:41 AM
The information the assembler needs is the code page of the string (UTF8 in this case):

    asmc64 -ws65001 test.asm
...
include winnls.inc
    option codepage:CP_UTF8

I have thought that nidud talks about MASM, sorry.

Title: Re: Unicode literals
Post by: Mikl__ on March 31, 2020, 11:11:00 AM
Hi, JK!
It was my first post in masm32 forum "Ansii and unicode strings" (http://masm32.com/board/index.php?topic=717.msg10211#msg10211) I use macro "du" which allows to display latin and cyrillic letters as unicode charactersdu macro string
local bslash
bslash = 0
irpc c,<string>
if bslash eq 0
if '&c' eq "/"
        bslash = 1
elseif '&c'gt 127
db ('&c'- 0B0h),4
else
dw '&c'
endif
else
           bslash = 0
           if '&c' eq "n"
           DW 0Dh,0Ah
           elseif '&c' eq "/"
           dw '/'
           elseif '&c' eq "r"
           dw 0Dh
           elseif '&c' eq "l"
           dw 0Ah
           elseif '&c' eq "s"
           dw 20h
           elseif '&c' eq "c"
           dw 3Bh
           elseif '&c' eq "t"
           dw 9
   endif
endif
endm
dw 0
endm
for examplewHello: du <Hello, world>
wRusstring: du <фывап>
Title: Re: Unicode literals
Post by: jj2007 on March 31, 2020, 11:23:55 AM
Quote from: JK on March 31, 2020, 04:04:11 AM
@jj2007,

how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?

UTF8, no BOM, Masm (6.15 and higher) or UAsm (recommended) or AsmC
Title: Re: Unicode literals
Post by: Adamanteus on March 31, 2020, 07:16:24 PM
Quote from: JK on March 30, 2020, 02:42:00 AM
Are there already macros for this task, one for defining constants and one for use in expressions, or would i have to try it myself (MultiByteToWideChar...)?

There is already macro for CP 1251, but it anyway is need to insert into macros by own hands :

Code (asm) Select

CYROFFSET_A2WDAT MACRO argz:REQ
  LOCAL retval
ifidni argz, <Ё>
EXITM <+0359h>
elseifidni argz, <ё>
EXITM <+0399h>
else
  retval EQU <>
  forc char, <АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя>
ifidni argz, <char>
retval EQU <+0350h>
exitm
endif
  endm
endif
EXITM retval
ENDM
Title: Re: Unicode literals
Post by: JK on April 01, 2020, 08:35:54 PM
Thanks for your help!

I´m not specifically interested into Russian characters, my goal is a generic solution for handling all kinds of Unicode text literals (e. Russian, Chinese, Greek, whatever)

So to summarize what i have learned:
- I cannot have UTF16 encoded files, it must be ASCII or UTF8.
- In ASMC i can set options (see above), which convert all literals (ASCII/UTF8 encoded) to UTF16 in generated code.
- In MASM/UASM there is no such option, i can pass UTF8 encoded files without BOM, but i have to add own code for the conversion (UTF8 -> UTF16) i want.

Is this correct ?


JK
Title: Re: Unicode literals
Post by: jj2007 on April 01, 2020, 09:30:23 PM
Correct :thumbsup:
Title: Re: Unicode literals
Post by: JK on April 01, 2020, 10:34:15 PM
Thanks for the confirmation jj2007!


Playing with UASM i found this works (file encoding UTF8, no BOM):

...

option LITERALS:ON


.DATA
MsgBoxCaptionW  DW "Asian: 捉敺楬",0


.code


start proc

  invoke MessageBoxW, 0, WSTR("Russian: фывап"), ADDR MsgBoxCaptionW, 0
  invoke ExitProcess, 0

start endp


end start



JK
Title: Re: Unicode literals
Post by: hutch-- on April 01, 2020, 11:32:10 PM
If I need to produce true unicode, not just converted ascii, I use a unicode editor then turn it into db sequences. You then just feed the db data into a Unicode API (with the trailing "W") and it all works fine.
Title: Re: Unicode literals
Post by: nidud on April 02, 2020, 01:21:25 AM
deleted
Title: Re: Unicode literals
Post by: jj2007 on April 02, 2020, 01:43:37 AM
Print "Добро пожаловать" works with MASM, UAsm and AsmC, but of cause there are solutions that look much more professional and challenging ;-)
Title: Re: Unicode literals
Post by: JK on April 02, 2020, 02:08:15 AM
Thanks for all your input.

My (self-written) IDE uses Scintilla as edit control, that´s what Notepad++ does too. So by switching the keyboard setting i can enter Russian or Greek letters directly and my editor window shows them correctly. The more, i can save the text/code in UTF8 encoding (with or without BOM) and pass it to an assembler and linker.

As my example (see my last post) shows, UASM can process (UTF8 encoded) Unicode literals, if "option LITERALS:ON" is set.

Please note, i don´t want to promote USAM or any other assembler, i just wanted to know the possibilities and limitations of them - thanks.


JK
Title: Re: Unicode literals
Post by: daydreamer on April 20, 2020, 07:24:40 PM
I have whitespace problem with 04000h +unicode utf16
tried with 03000h wider space,but still not enough
???
problem is I have ascii
data "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",10,13
data "X                                                                X",10,13
data "XXXXXXXXXXXXX         XXXXXX           XXXXXX",10,13
as source for random generator creates a WCHAR random 4000h+ where X is and if " " places 03000h
is there a 4k+ chinese enough big space or I need to fix it with put together several different spaces to make it looks right?