Author Topic: UTF-8 to UTF-16  (Read 539 times)

guga

  • Member
  • *****
  • Posts: 1027
  • Assembly is a state of art.
    • RosAsm
Re: UTF-8 to UTF-16
« Reply #15 on: April 15, 2019, 12:59:52 AM »
Better would be using MultiByteToWideChar since the text contains chars not used in latin/portuguese

Do you have some example of it using MultiByteToWideChar and WideCharToMultiByte

I built one years ago for AnsitoUTF8, but never did the reverse operation, and don´t know how to do it:

The AnsitoUTF8, i ported was like this (RosAsm syntax):

Code: [Select]
Proc AnsitoUTF8Masm:
    Arguments @pszAscii, @pszUTF8, @BomFlag
    Local @lenASCII, @lenUCS2, @lenUTF8, @pszUCS2, @pUnicode, @LenCharString, @lenUTF8Result
    Uses ecx, edi, edx

    xor eax eax
    On D@pszAscii = 0, ExitP
    On D@pszUTF8 = 0, ExitP

    mov edi D@pszUTF8
    ; length of pszUTF8 must be enough; its maximum is (lenASCII*3 + 1)
    call StrLenProc D@pszAscii
    If eax = 0
        mov B$edi 0 | ExitP
    End_If

    mov ecx eax
    shl eax 1
    add eax ecx
    inc eax
    call 'RosMem.VMemAlloc' edi, eax
    If eax = 0
        mov B$edi 0 | ExitP
    End_If
    mov D@pUnicode eax
    mov D@LenCharString ecx
    inc ecx
    call 'KERNEL32.MultiByteToWideChar' &CP_ACP, &MB_PRECOMPOSED, D@pszASCII, D@LenCharString, D@pUnicode, ecx
    mov D@lenUCS2 eax
    call UTF8Length D@pUnicode, eax
    mov D@lenUTF8 eax

    If D@BomFlag = &TRUE
        add eax 3
    End_If
    mov D@lenUTF8Result eax

    mov edi D@pszUTF8
    call 'RosMem.VMemAlloc' D@pszUTF8, eax
    If eax = 0
        mov B$edi 0
        call 'RosMem.VMemFree' D@pUnicode | ExitP
    End_If

    If D@BomFlag = &TRUE
        mov D$eax 0BFBBEF
        add eax 3
    End_If

    ;length of pszUTF8 must be >= (lenUTF8 + 1)
    call 'KERNEL32.WideCharToMultiByte' &CP_UTF8, 0, D@pUnicode, D@LenCharString, eax, D@lenUTF8, 0, 0

    call 'RosMem.VMemFree' D@pUnicode

    mov eax D@lenUTF8Result
    mov ecx D$edi
    add ecx eax | mov B$ecx 0

EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

  • Member
  • *****
  • Posts: 1027
  • Assembly is a state of art.
    • RosAsm
Re: UTF-8 to UTF-16
« Reply #16 on: April 15, 2019, 01:56:30 AM »
Ok, guys...I guess i suceed to port it. Don´t know if it is the roper way though  :icon_rolleyes: :icon_rolleyes:

It should use the BOM flag (containing 0BFBBEF at the beginning), but i didn´t found an example that uses it to implement too. So, here @BomFlag argument is useless

Code: [Select]
Proc UTF8toAnsi:
    Arguments @pszUTF8, @pszAscii, @BomFlag
    Local @lenUTF8, @pUnicode, @lenUnicode, @pTempUnicode
    Uses ecx, edx, edi, ebx

    xor eax eax
    On D@pszAscii = 0, ExitP
    On D@pszUTF8 = 0, ExitP

    mov edi D@pszAscii
    ; length of pszUTF8 must be enough; its maximum is (lenASCII*3 + 1)
    call StrLenProc D@pszUTF8
    If eax = 0
        mov B$edi 0 | ExitP
    End_If
    mov ecx eax
    mov D@lenUTF8 ecx

    call 'KERNEL32.MultiByteToWideChar' &CP_UTF8, 0, D@pszUTF8, D@lenUTF8, D@pUnicode, 0
    mov D@lenUnicode eax
    mov ecx eax
    shl eax 1
    lea edi D@pTempUnicode | mov D$edi 0
    call 'RosMem.VMemAlloc' edi, eax
    If eax = 0
        mov B$edi 0 | ExitP
    End_If
    mov D@pUnicode eax

    call 'KERNEL32.MultiByteToWideChar' &CP_UTF8, 0, D@pszUTF8, D@lenUTF8, D@pUnicode, D@lenUnicode

    call 'KERNEL32.WideCharToMultiByte' &CP_ACP, 0, D@pUnicode, D@lenUnicode, D@pszAscii, 256, 0, 0
    mov ebx eax

    call 'RosMem.VMemFree' D@pUnicode
    mov edi D@pszAscii
    add edi ebx | mov B$edi 0

    mov eax ebx

EndP

Example of usage:
Code: [Select]
[GugaBuffer: B$ 0 #260]
call UTF8toAnsi {B$ "A Saída dos Operários da Fábrica Lumière", 0}, GugaBuffer, 0

eax will return the lenght of the converted string
and
"GugaBuffer" holds the converted string

I adapted the code from here:
http://www.masmforum.com/board/index.php?PHPSESSID=8d46cd4ecb1688be429ab49694ec53e6&topic=6507.0;wap2

But..not sure if it is the proper way :(
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9278
  • Assembler is fun ;-)
    • MasmBasic
Re: UTF-8 to UTF-16
« Reply #17 on: April 15, 2019, 02:30:41 AM »
Tks, JJ.

Do you have a source example in masm how can i implement it ?

As Timo wrote, MultiByteToWideChar() using CP_UTF8 and back to ANSI with WideCharToMultiByte()

My routine uses wPrint, which UTF16 under the hood. As you can see, there are several ways to do it 8)

A bigger problem is that not all editors save their files as UTF8 or UTF16, and some editors need a BOM to recognise UTF8/UTF16, others don't need one, and others choke if they see a BOM. It's a mess :P

guga

  • Member
  • *****
  • Posts: 1027
  • Assembly is a state of art.
    • RosAsm
Re: UTF-8 to UTF-16
« Reply #18 on: April 16, 2019, 04:28:16 AM »
Tks JJ

Quote
A bigger problem is that not all editors save their files as UTF8 or UTF16, and some editors need a BOM to recognise UTF8/UTF16, others don't need one, and others choke if they see a BOM. It's a mess :P

Yeah, that´s a true mess. I gave a test on timo´s routine and ported those ones from the link to RolsAsm. It worked, but it was a hell to identify what codepage it was actually being used. I´m trying to export huge amounts of text to ani so i can translate them later to portuguese but the encodage they uses are a mess. Sometimes uses UTF8 and others i have no idea on what kind of encoding they uses :icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

I succeeded to convert all of them using notepad, though. I opened them on NotePad and simply exportd it to UTF8. After that, i checked the text on the routine i ported, and so far, it was ok. But, a huge headache porting and analyzing almost 500 Mb of plain text. Ouch ! :bgrin: :bgrin: :bgrin:

Btw...if someone knows a free and good translator from english to portuguese, please let me know. It can be opensource etc. All is needed is open a huge document (in english) and translate it to portuguese (brazilian portuguese).

I´m searching on google but all the apps i found so far uses google translator rather then a offline software, and it will take ages to translate all this monster documents if i have to depend on online translations using google
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9278
  • Assembler is fun ;-)
    • MasmBasic
Re: UTF-8 to UTF-16
« Reply #19 on: April 16, 2019, 04:41:06 AM »
Post a handful of your docs, maybe I can automate the process. Which codepage are they normally? Mostly Portuguese, sometimes Utf8?

Re Google: try DeepL, it's better than Google Translate. Free trial for the offline version, it seems.

P.S., for the conversion:

include \masm32\MasmBasic\MasmBasic.inc         ; download
  Init
  GfNoRecurse=                  1       ; if you want subfolders, comment out this line
  GetFiles *.asm
  deb 4, "files found", eax
  For_ ebx=0 To Min(9, eax-1)
        Let esi=ConvertCp$(FileRead$(Files$(ebx)), 860, CP_UTF8)
        FileWrite Cat$(Files$(ebx)+".utf8"), esi
  Next
EndOfCode


Source & exe attached. Note the Min(9, eax).

guga

  • Member
  • *****
  • Posts: 1027
  • Assembly is a state of art.
    • RosAsm
Re: UTF-8 to UTF-16
« Reply #20 on: April 16, 2019, 05:52:35 AM »
Tks a lot, JJ

I[ll give a try (in your app and on deepl)

I uploaded the file if you want to give it a test

https://we.tl/t-LQjjMocOvR

The formart seems to be in UTF8 (If notepad converted it properly).
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 9278
  • Assembler is fun ;-)
    • MasmBasic
Re: UTF-8 to UTF-16
« Reply #21 on: April 16, 2019, 11:47:23 AM »
The formart seems to be in UTF8 (If notepad converted it properly).

The format seems to be plain English, and it is indeed codepage UTF8.

You wrote you had plenty of small files with unknown codepages. This is the opposite, one big file with a known codepage. What do you want to achieve?