News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

How important is unicode

Started by sinsi, January 15, 2019, 06:48:57 PM

Previous topic - Next topic

aw27

Sinsi question makes sense only because it is a pain to deal with Unicode in Assembly Language. On the other hand it is almost transparent in all modern high-level languages and is not only the default selection but in most cases the only selection.
It took many years for high-level languages to reach the level of perfection and there is a whole lot more to it than JJ even conceives. So, it makes sense in many case, particularly in ASM, to use Unicode strings in a resource file.
Sure, you can put it into resources, but why perform such acrobatics if a simple Print "Привет, Мир" does the job? is a ridiculous response but acceptable if JJ was finishing drinking its usual bottle of wine after lunch while listening Louis Armstrong.


hutch--

Basically you write a few accessories and UNICODE becomes a lot easier. 32 bit MASM32 has macros that handle the ascii range to UNICODE but for full range there is an accessory that handles the full range of UNICODE and produced DB sequences that are placed in the .DATA section. You can get UNICODE strings from a resource file written in UNICODE but at least at one stage the API was buggy where the embedded data was reliable.

hutch--

> if JJ was finishing drinking its usual bottle of wine after lunch while listening Louis Armstrong.

That sounds perfectly reasonable to me. My choice is a bottle of pure malt while listening to the many brilliantly good female musicians on Youtube.

sinsi

My question was for a few reasons.
- I've always shied away from using resources but nowadays with manifests/version blocks/icons you can't really avoid them, so adding
   a string table is no big deal.
- The Windows API is apparently all unicode, so any A functions are first converted to unicode then back to ansi, how much overhead?
- As hutch said, unicode uses more memory. A resource section can be discarded but memory used in .data is always in use.

Correct me if I'm wrong, but the masm32 unicode macros are really only for converting english to unicode english yes?

jj2007

Quote from: AW on January 16, 2019, 05:52:15 PMwhy perform such acrobatics if a simple Print "Привет, Мир" does the job? is a ridiculous response

Dear José, if your assembly toolchain does not allow a simple print "Привет, Мир" it is your problem, not mine.

aw27

Quote from: jj2007 on January 16, 2019, 02:13:17 PM
Quote from: hutch-- on January 16, 2019, 10:56:23 AMthe obvious trap is that it is twice the size

For UTF16, yes. That's why the UTF8 representation of Unicode is so popular.

Actually, it is wrong. UTF16 is variable-length, as code points are encoded with one or two 16-bit code units. Characters like emojis use 2 UTF16 characters.

aw27

Quote from: jj2007 on January 16, 2019, 08:50:32 PM
Quote from: AW on January 16, 2019, 05:52:15 PMwhy perform such acrobatics if a simple Print "Привет, Мир" does the job? is a ridiculous response

Dear José, if your assembly toolchain does not allow a simple print "Привет, Мир" it is your problem, not mine.
You are not yet sober.   :shock: . When you are, we will talk again.

jj2007

Quote from: sinsi on January 16, 2019, 08:00:15 PM
My question was for a few reasons.
- I've always shied away from using resources but nowadays with manifests/version blocks/icons you can't really avoid them, so adding a string table is no big deal.

Right, but it's clumsy.

QuoteThe Windows API is apparently all unicode, so any A functions are first converted to unicode then back to ansi, how much overhead?

Correct, but it would matter only in a tight loop, a very unlikely case.

Quoteunicode uses more memory. A resource section can be discarded but memory used in .data is always in use.

Doesn't make a difference for the exe size, though.

QuoteCorrect me if I'm wrong, but the masm32 unicode macros are really only for converting english to unicode english yes?

It seems so, but I am not the right person to answer this. My macros work with Utf8 strings (i.e. true Unicode) in the .DATA section that get converted on the fly to UTF16. That is a few bytes of overhead - push offset utf8string, call makeutf16 - but it allows to assemble them directly as shown above (Print "Привет, Мир") without using resources, and for strings longer than 10 chars or so it bloats less.

However, "true" Unicode can work even with plain Masm32:
include \masm32\include\masm32rt.inc
.code
start:
  invoke SetConsoleOutputCP, 65001 ; force UTF8
  print "Привет, Мир"   ; note the lowercase "p" - this is Masm32, not MasmBasic
  exit
end start


This works fine, see above, but the problem is that you can't build it with some IDEs (I use RichMasm, of course).

hutch--

sinsi,

The macros will only convert what you can type in ascii but a resource script that is written in UNICODE can be compiled into a resource using any character set, Chinese, south east Asian etc .... There is a UNICODE editor in MASM32 that will do the job and there is a tool that converts true UNICODE into DB sequences so it can be done but takes a bit more work.

If it is a large block of UNICODE text, you can also save it as a binary resource in a resource file.

TimoVJL

#24
As WriteConsoleA support UTF-8 with CP 65001, but some assemblers don't accept UTF-8 BOM :(, so Notepad isn't good editor for coding.
.386
.model flat,stdcall
option casemap:none

ExitProcess proto stdcall :dword
GetStdHandle proto stdcall :dword
SetConsoleOutputCP proto stdcall :dword
WriteConsoleA proto stdcall :dword,:dword,:dword,:dword,:dword
includelib kernel32

.data
msg db "Привет ASM",10

.code
mainCRTStartup proc C
invoke SetConsoleOutputCP, 65001
invoke GetStdHandle, -11
invoke WriteConsoleA,eax,addr msg,sizeof msg,0,0
invoke ExitProcess, 0
mainCRTStartup endp
end

EDIT: UAsm had a BOM handling in version 2.38 ? http://masm32.com/board/index.php?topic=6422.msg68795#msg68795
but not in UASM v2.47, Nov 17 2018, Masm-compatible assembler ?TestMasm32Cyr.asm(1) : Error A2210: Syntax error: ´╗┐include
May the source be with you

daydreamer

Quote from: sinsi on January 15, 2019, 06:48:57 PM
For our non-English speaking members, is EN-US good enough or would you prefer your native language?
I am considering moving to all unicode but it is a bit more effort, is the effort justified?

One problem is getting a good translation, forget google et al, it needs to be a native speaker imho.
I prefer to use english everyday so I get exercise and sometimes meet a new person on internet that introduce me to new words and I think its maybe best to try to keep comments in code in english,it becomes too hard for a newbie to both read advanced code+comments that nobody can read
so what everybody thinks about code commenting native vs english?
do you want me to keep try to write english code comments?(not sure if that is correct english with code comments)
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

aw27

Assemblers don't need a BOM (if there is one is useful if they can skip it, so this was the UASM idea but may have been forgotten somewhere or removed on purpose), editors may need a BOM (unless they do a guesswork or wait for you to tell them what the text coding is all about).

aw27

@daydreamer,
Even if you comment only in English you may need to use Unicode to insert an emoticon or the Swedish flag. This is trendy, see Instagram, etc.

hutch--

I have never gone the route of adding the BOM in UNICODE as I only work on x86/64 and simply don't need it. With an assembler, in x86/64, multi-port to other hardware is not a consideration so why bother.

daydreamer

Quote from: AW on January 17, 2019, 07:50:33 AM
@daydreamer,
Even if you comment only in English you may need to use Unicode to insert an emoticon or the Swedish flag. This is trendy, see Instagram, etc.
http://www.unicode.org/charts/
I have checked this page earlier and found lots of useful things,especially checked the game unicodes that might be useful if you want to make a unicode textbased game without need to make graphics for card games,mahjong,chess ,but I havent found flags with cross yet,swedish,danish,finnish
that would be a good combination in a game:unicode characters for games+emoticons that you use when player win/lose/tie

but thanks a swedish flag might be good idea to use

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding