Defining "real" wide string constants and using wide string literals in code seems no so easy. Of course there are macros for wide strings like "this is a wide string", which is then converted to a wide string representation of this former ASCII sequence. But what about "фывап" (Russian characters) or Chinese? Masm et al. don´t like UTF16 encoded files, at least it keeps failing for me (with and without BOM). There is no ASCII representation of "фывап", you need an UTF16 or UTF8 encoded source file.
So i could use UTF8 in my editor, which indeed is accepted (without BOM). In this case everything outside quotes remains ASCII (no problem for the assembler) and UTF8 encoding is present only inside quoted strings. The problem with UTF8 is that macros for wide strings don´t process literals like "фывап" properly, because the necessary conversion isn´t ASCII -> UTF16, but UTF8 -> UTF16.
Are there already macros for this task, one for defining constants and one for use in expressions, or would i have to try it myself (MultiByteToWideChar...)?
JK
deleted
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
PrintLine "Добро пожаловать"
uMsgBox 0, "歡迎", "That looks Chinese:", MB_OK
EndOfCode
Thanks for your replies.
@nidud,
my post wasn´t clear enough, of course there is an ASCII codepage with Russian characters. But dealing with codepages is what i wanted to avoid, jj2007´s example demonstrates much better what i meant.
@jj2007,
how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?
JK
deleted
Quote from: JK on March 31, 2020, 04:04:11 AM
how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?
I faced similar challenge, source code should be written in UTF-8 without BOM and it will be emitted in wide (UTF-16) encoding, if the assembler is told which source encoding is used. This is done in MASM with
option codepage:CP_UTF8 and in €ASM with option
EUROASM CodePage=UTF-8, see an example at https://euroassembler.eu/prowin64/cpmix64.htm (https://euroassembler.eu/prowin64/cpmix64.htm)
thanks nidud,
i see now, the code file for asmc can be UTF8 and by telling asmc that it is UTF8 encoded, all literals are converted to UTF16 in the generated code. So no need for extra macros or functions for UTF8 to UTF16 conversion, it´s built in into asmc.
JK
QuoteThis is done in MASM with option codepage:CP_UTF8
???, i cannot find an option "codepage" at Microsoft´s MASM pages (https://docs.microsoft.com/en-us/cpp/assembler/masm/option-masm?view=vs-2019)
JK
Quote from: nidud on March 31, 2020, 05:39:41 AM
The information the assembler needs is the code page of the string (UTF8 in this case):
asmc64 -ws65001 test.asm
...
include winnls.inc
option codepage:CP_UTF8
I have thought that
nidud talks about MASM, sorry.
Hi, JK!
It was my first post in masm32 forum "Ansii and unicode strings" (http://masm32.com/board/index.php?topic=717.msg10211#msg10211) I use macro "du" which allows to display latin and cyrillic letters as unicode charactersdu macro string
local bslash
bslash = 0
irpc c,<string>
if bslash eq 0
if '&c' eq "/"
bslash = 1
elseif '&c'gt 127
db ('&c'- 0B0h),4
else
dw '&c'
endif
else
bslash = 0
if '&c' eq "n"
DW 0Dh,0Ah
elseif '&c' eq "/"
dw '/'
elseif '&c' eq "r"
dw 0Dh
elseif '&c' eq "l"
dw 0Ah
elseif '&c' eq "s"
dw 20h
elseif '&c' eq "c"
dw 3Bh
elseif '&c' eq "t"
dw 9
endif
endif
endm
dw 0
endm
for examplewHello: du <Hello, world>
wRusstring: du <фывап>
Quote from: JK on March 31, 2020, 04:04:11 AM
@jj2007,
how is the source file encoded, UTF16 or UTF8, with or without BOM, and what assembler do you use?
UTF8, no BOM, Masm (6.15 and higher) or UAsm (recommended) or AsmC
Quote from: JK on March 30, 2020, 02:42:00 AM
Are there already macros for this task, one for defining constants and one for use in expressions, or would i have to try it myself (MultiByteToWideChar...)?
There is already macro for CP 1251, but it anyway is need to insert into macros by own hands :
CYROFFSET_A2WDAT MACRO argz:REQ
LOCAL retval
ifidni argz, <Ё>
EXITM <+0359h>
elseifidni argz, <ё>
EXITM <+0399h>
else
retval EQU <>
forc char, <АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя>
ifidni argz, <char>
retval EQU <+0350h>
exitm
endif
endm
endif
EXITM retval
ENDM
Thanks for your help!
I´m not specifically interested into Russian characters, my goal is a generic solution for handling all kinds of Unicode text literals (e. Russian, Chinese, Greek, whatever)
So to summarize what i have learned:
- I cannot have UTF16 encoded files, it must be ASCII or UTF8.
- In ASMC i can set options (see above), which convert all literals (ASCII/UTF8 encoded) to UTF16 in generated code.
- In MASM/UASM there is no such option, i can pass UTF8 encoded files without BOM, but i have to add own code for the conversion (UTF8 -> UTF16) i want.
Is this correct ?
JK
Correct :thumbsup:
Thanks for the confirmation jj2007!
Playing with UASM i found this works (file encoding UTF8, no BOM):
...
option LITERALS:ON
.DATA
MsgBoxCaptionW DW "Asian: 捉敺楬",0
.code
start proc
invoke MessageBoxW, 0, WSTR("Russian: фывап"), ADDR MsgBoxCaptionW, 0
invoke ExitProcess, 0
start endp
end start
JK
If I need to produce true unicode, not just converted ascii, I use a unicode editor then turn it into db sequences. You then just feed the db data into a Unicode API (with the trailing "W") and it all works fine.
deleted
Print "Добро пожаловать" works with MASM, UAsm and AsmC, but of cause there are solutions that look much more professional and challenging ;-)
Thanks for all your input.
My (self-written) IDE uses Scintilla as edit control, that´s what Notepad++ does too. So by switching the keyboard setting i can enter Russian or Greek letters directly and my editor window shows them correctly. The more, i can save the text/code in UTF8 encoding (with or without BOM) and pass it to an assembler and linker.
As my example (see my last post) shows, UASM can process (UTF8 encoded) Unicode literals, if "option LITERALS:ON" is set.
Please note, i don´t want to promote USAM or any other assembler, i just wanted to know the possibilities and limitations of them - thanks.
JK
I have whitespace problem with 04000h +unicode utf16
tried with 03000h wider space,but still not enough
???
problem is I have ascii
data "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",10,13
data "X X",10,13
data "XXXXXXXXXXXXX XXXXXX XXXXXX",10,13
as source for random generator creates a WCHAR random 4000h+ where X is and if " " places 03000h
is there a 4k+ chinese enough big space or I need to fix it with put together several different spaces to make it looks right?