News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Unicode and UTF-8: Using non-Latin charsets in Assembler

Started by jj2007, May 10, 2017, 09:36:07 AM

Previous topic - Next topic

jj2007

Attached a RichMasm beta for testing - use at your own risk.

Main new features:
- open & save files with non-Latin names (an example of an Arabic file is included below)
- search for non-Latin text, e.g. текст
- pass Unicode arguments for testing via OPT_Arg1, as shown below

This requires a recent MasmBasic installation. Extract the attached archives to \Masm32\MasmBasic, then run RichMasmBeta.exe

GuiParas equ "Unicode rocks!!!!"      ; in RichMasm, hit F6 to build this application
GuiMenu equ <@File, Open>             ; requires MasmBasic
include \masm32\MasmBasic\Res\MbGui.asm
  MakeFont hFont, Height:24
  GuiControl MyEdit, "RichEdit", font hFont
  wSetWin$ hMyEdit="The commandline passed was"+CrLf$+wCL$()

Event Menu
  .if MenuID==0
      .if wFileOpen$("Rich source=*.asc|Poor sauce=*.asm|Resource=*.rc")
            wSetWin$ hMyEdit=wFileRead$(wFileOpen$())
      .endif
  .endif
GuiEnd

OPT_Icon      Globe      ; v v v "Enter text here" in Russian, Chinese and Arabic
OPT_Arg1      Введите текст здесь / 在此输入文字 / أدخل النص هنا


EDIT: Beta removed, the current version is more up-to-date.

hutch--

Something you could do is write code in a UNICODE editor and run a process that only read the code up to the comments. This would allow comments in any language to make the code readable but remove it and convert the code to ASCII for assembly/compiling. An assembler de-commenter is simple enough to write so you run the UNICODE through the API to convert it to ASCII, strip the comments and then feed it to the assembler.

TWell

nidud's asmc show how to do it:
http://masm32.com/board/index.php?topic=6221.msg66308#msg66308
http://masm32.com/board/index.php?topic=5942.msg66207#msg66207

If only UTF-8 BOM could set that /ws=65001 as default prosessing UTF-8 file, one problem lesser?

jj2007

Hutch & Tim,

Thanks for your suggestions, I appreciate your interest, really :t

There is one minor problem, though: It works already. Study the examples above, they work absolutely fine and assemble in ML 6.15... no need for more acrobacy ;-)

Under the hood: RichMasm's RichEd20.dll control has always (since 200x?) used Unicode by default. All I had to do is find a way to export UTF-8 text to plain text, and start the build. Actually, the process is a little bit more complicated, but the principle is that simple. And those who believe in purest assembler without any macros can use even ML version 6.14 to process their szText "歡迎", 0  8)

HSE

Quote from: hutch-- on May 10, 2017, 01:49:58 PM
An assembler de-commenter is simple enough to write so you run the UNICODE through the API to convert it to ASCII, strip the comments and then feed it to the assembler.

I think you need an scrip-writer (even more complex than RichMasm) because you need to prevent the introducción of unicode characters in other code than strings. Sometimes copying and pasting introduce unicode characters (especially invisible characters and characters that look like ANSI) and take time to find them.   
Equations in Assembly: SmplMath

jj2007

The MasmBasic version of 8 December 2017 has three new macros for handling UTF-8 strings, uLeft$, uMid$, uRight$:

include \masm32\MasmBasic\MasmBasic.inc
SetGlobals r$="Введите текст здесь"       ; "Enter text here" in Russian
SetGlobals c$="在這裡輸入文字"              ; "Enter text here" in Chinese
Init
  PrintLine "[", r$, "] (original string)"
  PrintLine "[", uRight$(r$, 5), "_", uMid$(r$, 9, 5)), "_", uLeft$(r$, 7), "] (right_mid_left, fixed)"
  PrintLine "[", uRight$(r$, 5), "_", Mid$(r$, Instr_(r$, "текст"), 2*5)), "_", uLeft$(r$, 7), "] (right_mid_left, Instr)"
  wMsgBox 0, wRec$("["+uLeft$(c$, 5)+"]"), "Chinese, uLeft$(5):", MB_OK
  wMsgBox 0, wRec$("["+uRight$(c$, 3)+"]"), "Chinese, uRight$(3):", MB_OK
EndOfCode


Remarks:
    - use uLeft$(src, chars) if you know the #UTF-8 chars needed
    - use "normal" Left$() etc if you got the #chars from Instr_(); but note the need to calculate bytes, see 2*5 above

Output:
[Введите текст здесь] (original string)
[здесь_текст_Введите] (right_mid_left, fixed)
[здесь_текст_Введите] (right_mid_left, Instr)


The three macros should work exactly like their Ansi und wide versions (if not, please let me know).

I attach a somewhat bigger project including an example how to use lower$() and UPPER$() with Unicode text.