Author Topic: Unicode and UTF-8: Using non-Latin charsets in Assembler  (Read 413 times)

jj2007

  • Moderator
  • Member
  • *****
  • Posts: 7559
  • Assembler is fun ;-)
    • MasmBasic
Unicode and UTF-8: Using non-Latin charsets in Assembler
« on: May 10, 2017, 09:36:07 AM »
Attached a RichMasm beta for testing - use at your own risk.

Main new features:
- open & save files with non-Latin names (an example of an Arabic file is included below)
- search for non-Latin text, e.g. текст
- pass Unicode arguments for testing via OPT_Arg1, as shown below

This requires a recent MasmBasic installation. Extract the attached archives to \Masm32\MasmBasic, then run RichMasmBeta.exe

GuiParas equ "Unicode rocks!!!!"      ; in RichMasm, hit F6 to build this application
GuiMenu equ <@File, Open>             ; requires MasmBasic
include \masm32\MasmBasic\Res\MbGui.asm
  MakeFont hFont, Height:24
  GuiControl MyEdit, "RichEdit", font hFont
  wSetWin$ hMyEdit="The commandline passed was"+CrLf$+wCL$()

Event Menu
  .if MenuID==0
      .if wFileOpen$("Rich source=*.asc|Poor sauce=*.asm|Resource=*.rc")
            wSetWin$ hMyEdit=wFileRead$(wFileOpen$())
      .endif
  .endif
GuiEnd

OPT_Icon      Globe      ; v v v "Enter text here" in Russian, Chinese and Arabic
OPT_Arg1      Введите текст здесь / 在此输入文字 / أدخل النص هنا

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4815
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Unicode and UTF-8: Using non-Latin charsets in Assembler
« Reply #1 on: May 10, 2017, 01:49:58 PM »
Something you could do is write code in a UNICODE editor and run a process that only read the code up to the comments. This would allow comments in any language to make the code readable but remove it and convert the code to ASCII for assembly/compiling. An assembler de-commenter is simple enough to write so you run the UNICODE through the API to convert it to ASCII, strip the comments and then feed it to the assembler.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

TWell

  • Member
  • ****
  • Posts: 748
Re: Unicode and UTF-8: Using non-Latin charsets in Assembler
« Reply #2 on: May 10, 2017, 04:36:26 PM »
nidud's asmc show how to do it:
http://masm32.com/board/index.php?topic=6221.msg66308#msg66308
http://masm32.com/board/index.php?topic=5942.msg66207#msg66207

If only UTF-8 BOM could set that /ws=65001 as default prosessing UTF-8 file, one problem lesser?

jj2007

  • Moderator
  • Member
  • *****
  • Posts: 7559
  • Assembler is fun ;-)
    • MasmBasic
Re: Unicode and UTF-8: Using non-Latin charsets in Assembler
« Reply #3 on: May 10, 2017, 07:41:25 PM »
Hutch & Tim,

Thanks for your suggestions, I appreciate your interest, really :t

There is one minor problem, though: It works already. Study the examples above, they work absolutely fine and assemble in ML 6.15... no need for more acrobacy ;-)

Under the hood: RichMasm's RichEd20.dll control has always (since 200x?) used Unicode by default. All I had to do is find a way to export UTF-8 text to plain text, and start the build. Actually, the process is a little bit more complicated, but the principle is that simple. And those who believe in purest assembler without any macros can use even ML version 6.14 to process their szText "歡迎", 0  8)

HSE

  • Member
  • ****
  • Posts: 533
  • <AMD>< 7-32>
Re: Unicode and UTF-8: Using non-Latin charsets in Assembler
« Reply #4 on: May 10, 2017, 11:56:19 PM »
An assembler de-commenter is simple enough to write so you run the UNICODE through the API to convert it to ASCII, strip the comments and then feed it to the assembler.

I think you need an scrip-writer (even more complex than RichMasm) because you need to prevent the introducción of unicode characters in other code than strings. Sometimes copying and pasting introduce unicode characters (especially invisible characters and characters that look like ANSI) and take time to find them.