Unicode and UTF-8: Using non-Latin charsets in Assembler

jj2007 · May 10, 2017, 09:36:07 AM

Attached a RichMasm beta for testing - use at your own risk.

Main new features:
- open & save files with non-Latin names (an example of an Arabic file is included below)
- search for non-Latin text, e.g. текст
- pass Unicode arguments for testing via OPT_Arg1, as shown below

This requires a recent MasmBasic installation. Extract the attached archives to \Masm32\MasmBasic, then run RichMasmBeta.exe

GuiParas equ "Unicode rocks!!!!" ; in RichMasm, hit F6 to build this application
GuiMenu equ <@File, Open> ; requires MasmBasic
include \masm32\MasmBasic\Res\MbGui.asm
MakeFont hFont, Height:24
GuiControl MyEdit, "RichEdit", font hFont
wSetWin$ hMyEdit="The commandline passed was"+CrLf$+wCL$()

Event Menu
.if MenuID==0
.if wFileOpen$("Rich source=*.asc|Poor sauce=*.asm|Resource=*.rc")
wSetWin$ hMyEdit=wFileRead$(wFileOpen$())
.endif
.endif
GuiEnd

OPT_Icon Globe ; v v v "Enter text here" in Russian, Chinese and Arabic
OPT_Arg1 Введите текст здесь / 在此输入文字 / أدخل النص هنا

EDIT: Beta removed, the current version is more up-to-date.

hutch-- · May 10, 2017, 01:49:58 PM

Something you could do is write code in a UNICODE editor and run a process that only read the code up to the comments. This would allow comments in any language to make the code readable but remove it and convert the code to ASCII for assembly/compiling. An assembler de-commenter is simple enough to write so you run the UNICODE through the API to convert it to ASCII, strip the comments and then feed it to the assembler.

TWell · May 10, 2017, 04:36:26 PM

nidud's asmc show how to do it:
http://masm32.com/board/index.php?topic=6221.msg66308#msg66308
http://masm32.com/board/index.php?topic=5942.msg66207#msg66207

If only UTF-8 BOM could set that /ws=65001 as default prosessing UTF-8 file, one problem lesser?

jj2007 · May 10, 2017, 07:41:25 PM

Hutch & Tim,

Thanks for your suggestions, I appreciate your interest, really :t

There is one minor problem, though: It works already. Study the examples above, they work absolutely fine and assemble in ML 6.15... no need for more acrobacy ;-)

Under the hood: RichMasm's RichEd20.dll control has always (since 200x?) used Unicode by default. All I had to do is find a way to export UTF-8 text to plain text, and start the build. Actually, the process is a little bit more complicated, but the principle is that simple. And those who believe in purest assembler without any macros can use even ML version 6.14 to process their szText "歡迎", 0 8)

HSE · May 10, 2017, 11:56:19 PM

Quote from: hutch-- on May 10, 2017, 01:49:58 PM
An assembler de-commenter is simple enough to write so you run the UNICODE through the API to convert it to ASCII, strip the comments and then feed it to the assembler.

I think you need an scrip-writer (even more complex than RichMasm) because you need to prevent the introducción of unicode characters in other code than strings. Sometimes copying and pasting introduce unicode characters (especially invisible characters and characters that look like ANSI) and take time to find them.

jj2007 · December 09, 2017, 05:44:46 AM

The MasmBasic version of 8 December 2017 has three new macros for handling UTF-8 strings, uLeft$, uMid$, uRight$:

include \masm32\MasmBasic\MasmBasic.inc
SetGlobals r$="Введите текст здесь" ; "Enter text here" in Russian
SetGlobals c$="在這裡輸入文字" ; "Enter text here" in Chinese
Init
PrintLine "[", r$, "] (original string)"
PrintLine "[", uRight$(r$, 5), "_", uMid$(r$, 9, 5)), "_", uLeft$(r$, 7), "] (right_mid_left, fixed)"
PrintLine "[", uRight$(r$, 5), "_", Mid$(r$, Instr_(r$, "текст"), 2*5)), "_", uLeft$(r$, 7), "] (right_mid_left, Instr)"
wMsgBox 0, wRec$("["+uLeft$(c$, 5)+"]"), "Chinese, uLeft$(5):", MB_OK
wMsgBox 0, wRec$("["+uRight$(c$, 3)+"]"), "Chinese, uRight$(3):", MB_OK
EndOfCode

Remarks:
- use uLeft$(src, chars) if you know the #UTF-8 chars needed
- use "normal" Left$() etc if you got the #chars from Instr_(); but note the need to calculate bytes, see 2*5 above

Output:

Code Select

[Введите текст здесь] (original string)
[здесь_текст_Введите] (right_mid_left, fixed)
[здесь_текст_Введите] (right_mid_left, Instr)

The three macros should work exactly like their Ansi und wide versions (if not, please let me know).

I attach a somewhat bigger project including an example how to use lower$() and UPPER$() with Unicode text.

The MASM Forum

News:

Unicode and UTF-8: Using non-Latin charsets in Assembler

jj2007

hutch--

TWell

jj2007

HSE

jj2007