News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Re: How to generate an Unicode string under MASM 6.15?

Started by nidud, May 05, 2017, 11:15:13 PM

Previous topic - Next topic

nidud

#90
deleted

mineiro

hehehe, this is getting funny;

Well sir nidud, let's go back on past, on times that we are not programmers, we do not have computer skills.
So, as being normal user, my boss tell me to search a program to do inventory control. I searched on internet and don't have found one, on any language. But I have a creative mind, I looked to different program screens and don't recognized letters used on that program, but I recognized objects, controls like button,list view, textview, edit box, ... . So, I can use a program like appointment book to be a inventory control per example, instead of put my fingers on resource or also to change resource things by using a program like resource hacker.

You are talking that russians, chineses, brazilians don't like to write their source code by using comments on their mother languages. I'm saying to you the contrary, I like to write source code comments on my mother language, because I have a much bigger jargon than my poor english language. Also, have words on my language that don't exist on others languages. So, whats wrong with this? Why I can't create labels, function names, variables on my mother language? I need remove an accent from a variable name to be compatible with assembler, this way writing on wrong way on my language just to be acceptable to assembler.
Computer comes to make our life easy, not hard.

Chinese point is more difficult, they speak more than 3 different languages, dialects also inside China. Their alphabet cannot fit on 256 symbols space. So, let's exclude chineses? Let's change chinese culture to be like ours? No.

QuoteSo if we translate JWasm to Russian how would this be done? Well, I'm not capable of doing that, and assuming you don't understand Russian either we will need a Russian to do the actual translation. JWasm is a console application using ASCII strings, so this Russian fellow have to live in Russia, or at least use a Russian OS to write the ASCII strings.

Do we need to convert any of these strings to Unicode? No. Do we need to see these Russian ASCII strings? No. Do we need to use this version of JWasm? No, this is for Russian consumption only.

If you have searched by assembly source code on last 10 years, you meet this board, but you probably meet russian wasm board too.
We are talking about symbols, assembler don't need care if a label string was writen using one form or another, to assembler that continues being an identifier.
The point is that by using codepages I am not able to deal with more than 2 languages on same source code. So, I cannot translate bible that was writen on aramaic to latin to after translate to portuguese language.
I recognize my faults, my errors. Only because I'm not able to talk different languages does not means that others are like me.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

nidud

#92
deleted

mineiro

hello sir nidud
yes, it's me   :biggrin:

QuoteNo, I'm saying they do this all the time so that's a lie.
Yes, I agree with you, you don't have said that. My fault. Sorry. I do not have expressed myself on the way I like.

QuoteNot sure what you trying to say here but if you read back your own posts it's clear that you understand most of the technical stuff and the limitations with regards to solve it. I think the problem is that you mixing programming language, which in itself is limited, with other types of software that don't have these limitations.
The limit of a programming language is our mind.
If I open an image edit program I can program on assembly language. I need know how to calculate address on mind and see the 'color' that fit that hexa number. A real example; I open an image edit program on grayscale mode, I insert color 'nop'(90h) on first position, a 'int'(0cdh) color on second position, and a color 20h on third position. I then export that 'image' to a disk as a raw way, rename that to .com and that works. "Cognitive parallax".

If I talk today about 'time machine', what persons understand?
They understand that we can travel on time/space.
The real meaning of time machines was to try to predict next season to plant, to grow, rain seasons.

-edited-
https://www.youtube.com/watch?v=7Y_SQBdVHQk
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

mineiro

#94
hello sir nidud;
this is what I have done
UTF-8 PSEUDOCODE

if first byte is between < 00h to  7fh > then is ascii <= 7fh
if first byte is between <0c0h to 0dfh > then one more byte follows
if second byte is between <80h to 0bfh> is valid char
if first byte is between <0e0h to 0efh> then two more byte follows
if second and third byte is between <80h to 0bfh> is valid char
if first byte is between <0f0h to ...h> then three more byte follows
if second and third and fourth byte is between <80h to 0bfh> is valid char
...
110b = 2, 1110b = 3, 11110b = 4, 111110b = 5 bytes , ...

note:
Using ascii chars inside raw text files (.txt, not structured) bellow hexa numbers are not possible:
00h,01h,02h,03h,04h,05h,06h,07h,08h,   ,   ,0bh,0ch,   ,0eh,0fh ;07 is bell, 08h is backspace, 0bh verticall tab,
10h,11h,12h,13h,14h,15h,16h,17h,18h,19h,1ah,1bh,1ch,1dh,1eh,1fh
                           ,27h,                                ;27h is escape key
                                                           ,7fh ;7fh is delete
Hexa numbers above are usefull to text editors to control text viewer and text buffer sync.

09h is tab, 0ah is line feed, 0dh is carriage return ;abstraction: mechanical typewriter machine
LF happens when we press an arm of typewriter machine to feed paper, to roll paper by doing pressure to activate rotor
CR happens when we move pressed typewriter arm full to back
TAB happens when we control arm moves
SPACE happens when we control arm move one step foward
I have excluded bell sound when arm reach end of paper

Conclusion:
utf-8 does not use chars on range (80h to 0bfh) as being first byte
      does not use chars on range (00h to 7fh) and (0c0h to 0ffh) as being second, third and fourth next bytes

cr,lf,tab,space are global valid chars


So, by preserving all possible ascii table we can't do this.
A scanner example; search for space,tab,cr,lf (non printable chars). From ascii to unicode is easy, 0020h or 2000h (little big endian) to unicode and 20h to ascii.But from utf-8 to ascii they are the same.
We can try to predict text by utf-8 rules, also language entropy can or cannot fit on that rule.
início:      utf8 í == 0c3h 0adh
thats: ascii (in) utf8 (í) ascii (cio)

That word on ascii will not fit on utf8 rules. (language entropy).
So, to increase prediction we can insert excluded ascii chars (00h,01h,...), because on text files they are not possible, but assemblers accept that.
But, this can fail on language entropy and exclude ascii chars.

So, to solve this problem you can create a switch key on command line to tell assembler that input source file is unicode/utf8/ascii Default:ascii

edit--
minor correction on pseudocode
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

nidud

#95
deleted

mineiro

hello sir nidud;
I continue trying to make you think about
início db "início" ; início is a label, a string (ascii or utf8) and a comment

Quote from: nidud on May 12, 2017, 06:18:56 AM
Given we have no idea whats above 0x7F we just have to accept all chars above as done in the table example.
We have an idea no? Thats a delete key on keyboard. Ascii rules.
Ok, above that we can't conclude anything.

Quote
I will assume the the main goal here is to be able to use the Portuguese language included chars above 127
Thank you for personalized version, but I'm not thinking only on portuguese language, I'm thinking about all languages.

Særleg == 6 symbols on screen
ascii (S) utf8 (æ) ascii (rleg)
the entropy of that word on your language by using utf8 rules is:
Særleg is acceptable on utf8 rule, so it's a valid utf8 text string
but if you try on your computer "Særleg" word only on ascii mode so means that it cannot fit on utf8 rules, so, it's an ascii string.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

nidud

#97
deleted

mineiro

well sir nidud, I think I'm not perturbing you ok.

mineiro@assembly:~/.wine/drive_c$ wine asmc /pe test.asm
Asmc Macro Assembler Version 2.24G
Portions Copyright (c) 1992-2002 Sybase, Inc. All Rights Reserved.

Assembling: test.asm
fixme:ntdll:find_reg_tz_info Can't find matching timezone information in the registry for bias 180, std (d/m/y): 19/02/2017, dlt (d/m/y): 15/10/2017
test.asm(12) : error A2008: syntax error : S
test.asm(15) : error A2167: unexpected literal found in expression : ício:
test.asm(16) : error A2206: missing operator in expression
test.asm(16) : error A2033: invalid INVOKE argument : 0
test.asm(19) : error A2008: syntax error : início
test.asm(19) : error A2088: END directive required at end of file
mineiro@assembly:~/.wine/drive_c$

asmc does not understand utf8 text files
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

nidud

#99
deleted

mineiro

I think you have uploaded wrong version. From 3 downloads 2 are mine, just to re-check.
Also date/time included, this file has been build today.
mineiro@assembly:~/.wine/drive_c$ ls asmc
asmc224G-mineiro.zip  asmc.exe
mineiro@assembly:~/.wine/drive_c$ ls asmc.exe -sal
296 -rw-rw-r-- 1 mineiro mineiro 303104 Mai 11 14:39 asmc.exe
mineiro@assembly:~/.wine/drive_c$ wine cmd.exe
Versão do CMD Wine 5.1.2600 (1.6.2)

C:\>asmc
Asmc Macro Assembler Version 2.24G
Portions Copyright (c) 1992-2002 Sybase, Inc. All Rights Reserved.

USAGE: ASMC [ options ] filelist
Use option /? for more info

C:\>dir asmc.exe
O volume na unidade C não tem rótulo.
Número de Série do Volume é 0000-0000

Directory of C:

11/5/2017     14:39       303,104  asmc.exe
       1 file                   303,104 bytes
       0 directories     95,769,042,944 bytes free


C:\>
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

mineiro

pseudocode posted before works but can accept invalid unicode chars. After rewriting that code 3 times I reach this code, not optimized:

include \masm32\include\masm32rt.inc

.686
.xmm

predict_txt PROTO :dword, :dword
is_valid_utf8_encode PROTO :dword, :dword

.data
pbuffchar dd 2 dup (0)

.data?
houtput dd ?
pfile dd ?
szfile dd ?
temp dd ?

.code
start:
invoke GetStdHandle,STD_OUTPUT_HANDLE
mov houtput,eax

mov pfile,InputFile("utf8.txt")
mov szfile,ecx

invoke predict_txt,pfile,szfile

free pfile
inkey "Done..."
invoke ExitProcess, 0

align 16
predict_txt proc _pfile:DWORD,_szfile:DWORD

mov esi,_pfile
mov ecx,_szfile

next_char:
invoke is_valid_utf8_encode,esi,ecx
;returns 0 if invalid utf8 char
;returns -1 if bound error
;return sizeof utf8 char

.if eax == -1 ;abort
jmp quit
.elseif eax == 0 ;if invalid utf8 so this can be an extended ascii char table
.elseif eax == 1 ;valid ascii char
;check identifier delimiters
.else ;valid utf8 char, check BOM

.endif
next:
sub ecx,eax
add esi,eax

test ecx,ecx
jnz next_char

quit:
ret
predict_txt endp

align 16
;this function check for a valid utf8 char on text
;return:
;-1 if bound error sizeof text !=
;0 if not valid utf8
;sizeof utf8 char
is_valid_utf8_encode proc uses esi ecx _ptext:dword,_sztext:dword

LOCAL szbytes:dword

mov szbytes,-1
mov esi,_ptext
mov ecx,_sztext

test ecx,ecx ;eof?
jz done

mov szbytes,0
movzx eax,byte ptr [esi] ;read one byte
@@: ;counting utf8 bytes need to get a valid char
inc szbytes
rcl al,1 ;0???????? 110????? 1110???? 11110??? 111110?? 1111110? 11111110
jc @B

.if szbytes > ecx
mov szbytes,-1 ;don't have sufficient bytes to be read
jmp done
.endif
.if szbytes > 4 ;not utf8 valid encode
mov szbytes,0
jmp done
.endif

mov szbytes,0
movzx eax,byte ptr [esi] ;read one byte
.if eax <= 07fh
mov szbytes,1
jmp done
.elseif eax >= 0c2h && eax <= 0dfh
.if byte ptr [esi+1] >= 80h && byte ptr [esi+1] <= 0bfh
mov szbytes,2
jmp done
.endif


.elseif eax == 0e0h
.if byte ptr [esi+1] >= 0a0h && byte ptr [esi+1] <= 0bfh
.if byte ptr [esi+2] >= 80h && byte ptr [esi+2] <= 0bfh
mov szbytes,3
jmp done
.endif
.endif
.elseif eax >= 0e1h && eax <= 0ech
.if byte ptr [esi+1] >= 80h && byte ptr [esi+1] <= 0bfh
.if byte ptr [esi+2] >= 80h && byte ptr [esi+2] <= 0bfh
mov szbytes,3
jmp done
.endif
.endif
.elseif eax == 0edh
.if byte ptr [esi+1] >= 80h && byte ptr [esi+1] <= 09fh
.if byte ptr [esi+2] >= 80h && byte ptr [esi+2] <= 0bfh
mov szbytes,3
jmp done
.endif
.endif
.elseif eax >= 0eeh && eax <= 0efh
.if byte ptr [esi+1] >= 80h && byte ptr [esi+1] <= 0bfh
.if byte ptr [esi+2] >= 80h && byte ptr [esi+2] <= 0bfh
mov szbytes,3
jmp done
.endif
.endif


.elseif eax == 0f0h
.if byte ptr [esi+1] >= 90h && byte ptr [esi+1] <= 0bfh
.if byte ptr [esi+2] >= 80h && byte ptr [esi+2] <= 0bfh
.if byte ptr [esi+3] >= 80h && byte ptr [esi+3] <= 0bfh
mov szbytes,4
jmp done
.endif
.endif
.endif

.elseif eax >= 0f1h && eax <= 0f3h
.if byte ptr [esi+1] >= 80h && byte ptr [esi+1] <= 0bfh
.if byte ptr [esi+2] >= 80h && byte ptr [esi+2] <= 0bfh
.if byte ptr [esi+3] >= 80h && byte ptr [esi+3] <= 0bfh
mov szbytes,4
jmp done
.endif
.endif
.endif

.elseif eax == 0f4h
.if byte ptr [esi+1] >= 80h && byte ptr [esi+1] <= 8fh
.if byte ptr [esi+2] >= 80h && byte ptr [esi+2] <= 0bfh
.if byte ptr [esi+3] >= 80h && byte ptr [esi+3] <= 0bfh
mov szbytes,4
jmp done
.endif
.endif
.endif
.endif

done:
mov eax,szbytes
ret
is_valid_utf8_encode endp

end start

edited= inserted -1 error
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

nidud

#102
deleted

mineiro

hello sir nidud;
I think that don't work because uses spaces, and spaces are identifier delimiter into one scope, but on string construction " " they are valid.

; Build: asmc /pe test.asm
.486
.model   flat, c
option   dllimport:<msvcrt.dll>

printf   proto :ptr, :vararg
exit   proto :dword

.data
dd 10

.code
↑:
dec   
jnz ↑
printf("%d\n",)
exit(0)

end   ↑
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

nidud

#104
deleted