News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

UTF-8 to UTF-16

Started by aw27, January 24, 2019, 09:06:53 PM

Previous topic - Next topic

guga

Better would be using MultiByteToWideChar since the text contains chars not used in latin/portuguese

Do you have some example of it using MultiByteToWideChar and WideCharToMultiByte

I built one years ago for AnsitoUTF8, but never did the reverse operation, and don´t know how to do it:

The AnsitoUTF8, i ported was like this (RosAsm syntax):


Proc AnsitoUTF8Masm:
    Arguments @pszAscii, @pszUTF8, @BomFlag
    Local @lenASCII, @lenUCS2, @lenUTF8, @pszUCS2, @pUnicode, @LenCharString, @lenUTF8Result
    Uses ecx, edi, edx

    xor eax eax
    On D@pszAscii = 0, ExitP
    On D@pszUTF8 = 0, ExitP

    mov edi D@pszUTF8
    ; length of pszUTF8 must be enough; its maximum is (lenASCII*3 + 1)
    call StrLenProc D@pszAscii
    If eax = 0
        mov B$edi 0 | ExitP
    End_If

    mov ecx eax
    shl eax 1
    add eax ecx
    inc eax
    call 'RosMem.VMemAlloc' edi, eax
    If eax = 0
        mov B$edi 0 | ExitP
    End_If
    mov D@pUnicode eax
    mov D@LenCharString ecx
    inc ecx
    call 'KERNEL32.MultiByteToWideChar' &CP_ACP, &MB_PRECOMPOSED, D@pszASCII, D@LenCharString, D@pUnicode, ecx
    mov D@lenUCS2 eax
    call UTF8Length D@pUnicode, eax
    mov D@lenUTF8 eax

    If D@BomFlag = &TRUE
        add eax 3
    End_If
    mov D@lenUTF8Result eax

    mov edi D@pszUTF8
    call 'RosMem.VMemAlloc' D@pszUTF8, eax
    If eax = 0
        mov B$edi 0
        call 'RosMem.VMemFree' D@pUnicode | ExitP
    End_If

    If D@BomFlag = &TRUE
        mov D$eax 0BFBBEF
        add eax 3
    End_If

    ;length of pszUTF8 must be >= (lenUTF8 + 1)
    call 'KERNEL32.WideCharToMultiByte' &CP_UTF8, 0, D@pUnicode, D@LenCharString, eax, D@lenUTF8, 0, 0

    call 'RosMem.VMemFree' D@pUnicode

    mov eax D@lenUTF8Result
    mov ecx D$edi
    add ecx eax | mov B$ecx 0

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok, guys...I guess i suceed to port it. Don´t know if it is the roper way though  :icon_rolleyes: :icon_rolleyes:

It should use the BOM flag (containing 0BFBBEF at the beginning), but i didn´t found an example that uses it to implement too. So, here @BomFlag argument is useless


Proc UTF8toAnsi:
    Arguments @pszUTF8, @pszAscii, @BomFlag
    Local @lenUTF8, @pUnicode, @lenUnicode, @pTempUnicode
    Uses ecx, edx, edi, ebx

    xor eax eax
    On D@pszAscii = 0, ExitP
    On D@pszUTF8 = 0, ExitP

    mov edi D@pszAscii
    ; length of pszUTF8 must be enough; its maximum is (lenASCII*3 + 1)
    call StrLenProc D@pszUTF8
    If eax = 0
        mov B$edi 0 | ExitP
    End_If
    mov ecx eax
    mov D@lenUTF8 ecx

    call 'KERNEL32.MultiByteToWideChar' &CP_UTF8, 0, D@pszUTF8, D@lenUTF8, D@pUnicode, 0
    mov D@lenUnicode eax
    mov ecx eax
    shl eax 1
    lea edi D@pTempUnicode | mov D$edi 0
    call 'RosMem.VMemAlloc' edi, eax
    If eax = 0
        mov B$edi 0 | ExitP
    End_If
    mov D@pUnicode eax

    call 'KERNEL32.MultiByteToWideChar' &CP_UTF8, 0, D@pszUTF8, D@lenUTF8, D@pUnicode, D@lenUnicode

    call 'KERNEL32.WideCharToMultiByte' &CP_ACP, 0, D@pUnicode, D@lenUnicode, D@pszAscii, 256, 0, 0
    mov ebx eax

    call 'RosMem.VMemFree' D@pUnicode
    mov edi D@pszAscii
    add edi ebx | mov B$edi 0

    mov eax ebx

EndP


Example of usage:

[GugaBuffer: B$ 0 #260]
call UTF8toAnsi {B$ "A Saída dos Operários da Fábrica Lumière", 0}, GugaBuffer, 0

eax will return the lenght of the converted string
and
"GugaBuffer" holds the converted string


I adapted the code from here:
http://www.masmforum.com/board/index.php?PHPSESSID=8d46cd4ecb1688be429ab49694ec53e6&topic=6507.0;wap2

But..not sure if it is the proper way :(
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on April 14, 2019, 11:56:43 PM
Tks, JJ.

Do you have a source example in masm how can i implement it ?

As Timo wrote, MultiByteToWideChar() using CP_UTF8 and back to ANSI with WideCharToMultiByte()

My routine uses wPrint, which UTF16 under the hood. As you can see, there are several ways to do it 8)

A bigger problem is that not all editors save their files as UTF8 or UTF16, and some editors need a BOM to recognise UTF8/UTF16, others don't need one, and others choke if they see a BOM. It's a mess :P

guga

Tks JJ

QuoteA bigger problem is that not all editors save their files as UTF8 or UTF16, and some editors need a BOM to recognise UTF8/UTF16, others don't need one, and others choke if they see a BOM. It's a mess :P

Yeah, that´s a true mess. I gave a test on timo´s routine and ported those ones from the link to RolsAsm. It worked, but it was a hell to identify what codepage it was actually being used. I´m trying to export huge amounts of text to ani so i can translate them later to portuguese but the encodage they uses are a mess. Sometimes uses UTF8 and others i have no idea on what kind of encoding they uses :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

I succeeded to convert all of them using notepad, though. I opened them on NotePad and simply exportd it to UTF8. After that, i checked the text on the routine i ported, and so far, it was ok. But, a huge headache porting and analyzing almost 500 Mb of plain text. Ouch ! :bgrin: :bgrin: :bgrin:

Btw...if someone knows a free and good translator from english to portuguese, please let me know. It can be opensource etc. All is needed is open a huge document (in english) and translate it to portuguese (brazilian portuguese).

I´m searching on google but all the apps i found so far uses google translator rather then a offline software, and it will take ages to translate all this monster documents if i have to depend on online translations using google
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Post a handful of your docs, maybe I can automate the process. Which codepage are they normally? Mostly Portuguese, sometimes Utf8?

Re Google: try DeepL, it's better than Google Translate. Free trial for the offline version, it seems.

P.S., for the conversion:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  GfNoRecurse=                  1       ; if you want subfolders, comment out this line
  GetFiles *.asm
  deb 4, "files found", eax
  For_ ebx=0 To Min(9, eax-1)
        Let esi=ConvertCp$(FileRead$(Files$(ebx)), 860, CP_UTF8)
        FileWrite Cat$(Files$(ebx)+".utf8"), esi
  Next
EndOfCode


Source & exe attached. Note the Min(9, eax).

guga

Tks a lot, JJ

I[ll give a try (in your app and on deepl)

I uploaded the file if you want to give it a test

https://we.tl/t-LQjjMocOvR

The formart seems to be in UTF8 (If notepad converted it properly).
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on April 16, 2019, 05:52:35 AMThe formart seems to be in UTF8 (If notepad converted it properly).

The format seems to be plain English, and it is indeed codepage UTF8.

You wrote you had plenty of small files with unknown codepages. This is the opposite, one big file with a known codepage. What do you want to achieve?

guga

#22
Hi JJ.

I need to translate a huge file to portuguese. I managed to fix that utf8 problem simply opening and saving the file to notepad. Now i have a text file with 45 Mb that i need to translate from english to portuguese at once.

The deepl site has a tiny limit and after one translation i´m not allowed to do again :(

I could, however try creating a small app to translate directly from google translator (considering thee limit of 5000 chars per translation), but it probably would take a long time to translate the text, since it is huge :(
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

What about the free trial? https://www.deepl.com/pro.html#pricing

guga

Couldn´pt do it. I tried and it stopped working after translating only 1 Mb

"Data Confidentiality
Your texts are deleted immediately after you've received the translation

Use the Web Translator without limits
Translate as much text as you like with your single-user license

5 document translations per month
Faster document translation via the Web Translator; fully-editable files"

Since the file is about 45 Mb (text file) it won´pt help that much :(

I´m suceeding to make it translate through google translator api opn a routine i made, but, i´m afraid it will take an endless time to finish :(
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com