News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Unicode literals

Started by JK, March 30, 2020, 02:42:00 AM

Previous topic - Next topic

JK

Defining "real" wide string constants and using wide string literals in code seems no so easy. Of course there are macros for wide strings like "this is a wide string", which is then converted to a wide string representation of this former ASCII sequence. But what about "фывап" (Russian characters) or Chinese? Masm et al. don´t like UTF16 encoded files, at least it keeps failing for me (with and without BOM). There is no ASCII representation of "фывап", you need an UTF16 or UTF8 encoded source file.

So i could use UTF8 in my editor, which indeed is accepted (without BOM). In this case everything outside quotes remains ASCII (no problem for the assembler) and UTF8 encoding is present only inside quoted strings. The problem with UTF8 is that macros for wide strings don´t process literals like "фывап" properly, because the necessary conversion isn´t ASCII -> UTF16, but UTF8 -> UTF16.

Are there already macros for this task, one for defining constants and one for use in expressions, or would i have to try it myself (MultiByteToWideChar...)?


JK

nidud

#1
deleted

jj2007

include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  PrintLine "Добро пожаловать"
  uMsgBox 0, "歡迎", "That looks Chinese:", MB_OK
EndOfCode

JK

Thanks for your replies.

@nidud,

my post wasn´t clear enough, of course there is an ASCII codepage with Russian characters. But dealing with codepages is what i wanted to avoid, jj2007´s example demonstrates much better what i meant.

@jj2007,

how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?


JK

nidud

#4
deleted

vitsoft

Quote from: JK on March 31, 2020, 04:04:11 AM
how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?

I faced similar challenge, source code should be written in UTF-8 without BOM and it will be emitted in wide (UTF-16) encoding, if the assembler is told which source encoding is used. This is done in MASM with  option codepage:CP_UTF8 and in €ASM with option EUROASM CodePage=UTF-8, see an example at https://euroassembler.eu/prowin64/cpmix64.htm

JK

thanks nidud,

i see now, the code file for asmc can be UTF8 and by telling asmc that it is UTF8 encoded, all literals are converted to UTF16 in the generated code. So no need for extra macros or functions for UTF8 to UTF16 conversion, it´s built in into asmc.


JK

JK

QuoteThis is done in MASM with  option codepage:CP_UTF8

???, i cannot find an option "codepage" at Microsoft´s MASM pages (https://docs.microsoft.com/en-us/cpp/assembler/masm/option-masm?view=vs-2019)


JK


vitsoft

Quote from: nidud on March 31, 2020, 05:39:41 AM
The information the assembler needs is the code page of the string (UTF8 in this case):

    asmc64 -ws65001 test.asm
...
include winnls.inc
    option codepage:CP_UTF8

I have thought that nidud talks about MASM, sorry.


Mikl__

Hi, JK!
It was my first post in masm32 forum "Ansii and unicode strings" I use macro "du" which allows to display latin and cyrillic letters as unicode charactersdu macro string
local bslash
bslash = 0
irpc c,<string>
if bslash eq 0
if '&c' eq "/"
        bslash = 1
elseif '&c'gt 127
db ('&c'- 0B0h),4
else
dw '&c'
endif
else
           bslash = 0
           if '&c' eq "n"
           DW 0Dh,0Ah
           elseif '&c' eq "/"
           dw '/'
           elseif '&c' eq "r"
           dw 0Dh
           elseif '&c' eq "l"
           dw 0Ah
           elseif '&c' eq "s"
           dw 20h
           elseif '&c' eq "c"
           dw 3Bh
           elseif '&c' eq "t"
           dw 9
   endif
endif
endm
dw 0
endm
for examplewHello: du <Hello, world>
wRusstring: du <фывап>

jj2007

Quote from: JK on March 31, 2020, 04:04:11 AM
@jj2007,

how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?

UTF8, no BOM, Masm (6.15 and higher) or UAsm (recommended) or AsmC

Adamanteus

Quote from: JK on March 30, 2020, 02:42:00 AM
Are there already macros for this task, one for defining constants and one for use in expressions, or would i have to try it myself (MultiByteToWideChar...)?

There is already macro for CP 1251, but it anyway is need to insert into macros by own hands :

Code (asm) Select

CYROFFSET_A2WDAT MACRO argz:REQ
  LOCAL retval
ifidni argz, <Ё>
EXITM <+0359h>
elseifidni argz, <ё>
EXITM <+0399h>
else
  retval EQU <>
  forc char, <АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя>
ifidni argz, <char>
retval EQU <+0350h>
exitm
endif
  endm
endif
EXITM retval
ENDM

JK

Thanks for your help!

I´m not specifically interested into Russian characters, my goal is a generic solution for handling all kinds of Unicode text literals (e. Russian, Chinese, Greek, whatever)

So to summarize what i have learned:
- I cannot have UTF16 encoded files, it must be ASCII or UTF8.
- In ASMC i can set options (see above), which convert all literals (ASCII/UTF8 encoded) to UTF16 in generated code.
- In MASM/UASM there is no such option, i can pass UTF8 encoded files without BOM, but i have to add own code for the conversion (UTF8 -> UTF16) i want.

Is this correct ?


JK

jj2007


JK

Thanks for the confirmation jj2007!


Playing with UASM i found this works (file encoding UTF8, no BOM):

...

option LITERALS:ON


.DATA
MsgBoxCaptionW  DW "Asian: 捉敺楬",0


.code


start proc

  invoke MessageBoxW, 0, WSTR("Russian: фывап"), ADDR MsgBoxCaptionW, 0
  invoke ExitProcess, 0

start endp


end start



JK