News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

HASM 2.32 Release

Started by johnsa, May 16, 2017, 09:11:28 PM

Previous topic - Next topic

johnsa

Minor Update just to mark the release of HASM (new name of HJWASM) as 2.32.

Changes:
1) Added initial file BOM check for UTF-8
2) Removed string literal support from invoke (samples and documentation updated accordingly) to use CSTR and WSTR.

2 Ensures we keep backwards compatibility in favour of adding this new type of functionality under a totally new procedure calling/invocation scheme.

TWell

This modified mineiro example works, but not a same way as asmc..386
.model flat,stdcall

option dllimport:<kernel32.dll>
GetStdHandle proto :dword
WriteFile proto :dword,:dword,:dword,:dword,:dword
ExitProcess proto :dword
SetConsoleOutputCP proto :dword
STD_OUTPUT_HANDLE equ -11
CP_UTF8 equ 65001

.data?
houtput dd ?
nada dd ?

.data
utf8 db "accents: áéíóúçã"

.code
start:
invoke SetConsoleOutputCP,CP_UTF8
invoke GetStdHandle,STD_OUTPUT_HANDLE
mov houtput,eax
invoke WriteFile,houtput,addr utf8,sizeof utf8,addr nada,0
invoke ExitProcess,0
end start
I hope that people in this forum decide what is the correct way in Win32 and linux to do that.

jj2007

Quote from: TWell on May 17, 2017, 02:50:51 AM
This modified mineiro example works, but not a same way as asmc.

For me, it works with both assemblers, source and exes attached. What exactly did not work?

TWell

My bad :(
I recheck object files.
Both save UTF-8 in similar ways.

mineiro

#4
will be nice if you add a switch on command line to deal with ascii/unicode/utf8/utf16 ...files default:ascii
This way nobody needs change their program style.
If a person like to write their unicode strings using ascii editors and skills, ok.
If we have:
.if al == "a"
On ascii scope thats a byte and works ok. If person saved as unicode so that will report a error because "a" will be 0041h or a word size, ok too, valid error. If person saved as utf8, so "a" will be like ascii, 1 byte and will work. If person saved as utf16 that will be a word and report a error.
So if we have this:
.if eax == "㑖"
If person don't like to write if eax == "㑖" using unicode/utf8/... text editor we know that we can do that by ".if eax == hexnumberh" using ascii editor. This way, the talk about wstr and cstr will continue from an ascii point of view only.

--edited--
This way, you don't need improvements on hasm syntax, thats fine sir johnsa. And because all text file is being interpreted into one scope, so you give to us much freedom and we are able to not only deal with unicode chars while constructing strings, but also we are able to create variables and function names using unicode chars.
☕ dd 10

I'm reading on primary sources below about this subject:
https://tools.ietf.org/html/rfc3629
www.unicode.org/versions/Unicode9.0.0/ch02.pdf
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

johnsa

I think the big problem with full utf-8 / unicode support is that internally the string comparisons, symbol table, tokenizer and parser have all been designed around ascii only.. changing it to make it fully portable would probably require a rather major re-write. I'm not sure how far Nidud has gotten with utf support in asmc (unicode/utf16/utf8 etc) I know he has added code page support ?

nidud

#6
deleted

mineiro

Quote from: johnsa on May 17, 2017, 06:32:08 AM
I think the big problem with full utf-8 / unicode support is that internally the string comparisons, symbol table, tokenizer and parser have all been designed around ascii only.. changing it to make it fully portable would probably require a rather major re-write. I'm not sure how far Nidud has gotten with utf support in asmc (unicode/utf16/utf8 etc) I know he has added code page support ?
http://masm32.com/board/index.php?topic=6221.msg66580#msg66580
Link above have a utf8 text walker. Function is_valid_utf8_encode return sizeof char in byte(s). Can return 1,2,3,4 bytes to one char that user have seen while coding.

Ops, now I understand, if string comparisions only deal with <=7fh so can't be done, to me strings comparisions was done on hasm source code by hexa comparisions. Now I take the point. Thank you for answering and also by think about this subject.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

mineiro

Quote from: nidud on May 17, 2017, 06:58:24 AM
áéíóúçã

If you use a windows editor like Notepad the string is saved as:
E1 E9 ED F3 FA E7 E3

And if you write it in a text mode editor the string is saved as:
A0 82 A1 A2 A3 87 84 - (860)
...
If you use the right tool for the job you don't need to bother with all this Unicode stuff. However, if you use a Windows editor to write a console application you need to translate the text somehow and use the W functions. This adds code to the application.
hello sir nidud
example that you posted is ambiguous, and to avoid ambiguity they have create unicode (aka Universal). And to preserve latin chars as being one byte only they encoded that by using utf8.
Letter á can be E1 or A0, how we can avoid this? By lucky they have different hexa representation and they do not overlap on this example.
Let's convert these symbols to unicode using that easy way (not correct I suppose) that's insert a zero at left size of that char.
00e1h != 00a0h
á != á
So, as you can see, that's not valid, same supposed char to different opcodes. On unicode we can print letter 'a', and after insert accent on same symbol representation as chapter 2 says.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: johnsa on May 17, 2017, 06:32:08 AM
I think the big problem with full utf-8 / unicode support is that internally the string comparisons, symbol table, tokenizer and parser have all been designed around ascii only.. changing it to make it fully portable would probably require a rather major re-write.

It is not a big problem, simply because we need to distinguish two entirely different things:
1. Do we need non-Latin chars in symbols, labels and commands?
2. Do we need non-Latin chars in strings?

Version 1 is this:Печать "Hello World"

Version 2 is this:print "Привет Мир"

1. With current assemblers, Печать "Hello World" is impossible. No problem, nobody needs that. Russian programmers do not use Russian commands in their code (if you don't agree, gimme a link to one example at least); they use English, because that is the language of their compiler. Same for Chinese, Arabic, Japanese and North Korean coders.

2. With current assemblers, print "Привет Мир" is possible. Even the old Masm 6.14 that comes along with the Masm32 SDK can flawlessly produce code that displays Привет Мир a) in the editor, and b) in the console.

It is not a problem of the assembler - that is just a dumb tool, and it never had any interest in the stupid things that coders put "inside quotes". They are Chinese anyway for the assembler (no insult intended, I like Chinese characters).

Conclusion: THERE IS NO PROBLEM, except perhaps that very few editors can handle UTF-8 correctly 8)

nidud

#10
deleted

jj2007

Yes, nidud, I think everybody has now understood that you are not a fan of UTF-8, and that every serious coder in the world should use only the native codepage of his system. Norwegian, in your case. You are free to do that, I am free to ignore you.

nidud

#12
deleted

nidud

#13
deleted

mineiro

hello nidud;
This talk about utf8 on windows O.S. don't sound too logic because default is unicode. So we need transform utf8 to unicode to have more malleable on windows, but on linux we have this malleable way from the ground.
I'm talking about utf8 just because xml files are now utf8 encoded. The intention is to create a universal gui with universal chars.
I have insisted with you because this. As you I can survive without utf8. But future is becoming present.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything