UTF-8 to UTF-16

aw27 · January 24, 2019, 09:06:53 PM

Since the last topic was about macros...
In my opinion, a really revolutionary internal MACRO that UASM should have would be a UTF-8 to UTF-16 macro.
As we know, it is impossible to make a regular UTF-8 to UTF-16 macro that changes the data section at assemble time. What people has done so far are macros that convert ASCII characters to words and call these Unicode conversion macros but they are not.

The general algorythm in C++ is pretty simple, but again, impossible to convert to an assemble time ASM macro. I copied from here: https://gist.github.com/rechardchen/3321830)

Code Select


wstring UTF8toUnicode(const string& s)
	{
		wstring ws;
		wchar_t wc;
		for (int i = 0; i < s.length(); )
		{
			char c = s[i];
			if ((c & 0x80) == 0)
			{
				wc = c;
				++i;
			}
			else if ((c & 0xE0) == 0xC0)
			{
				wc = (s[i] & 0x1F) << 6;
				wc |= (s[i + 1] & 0x3F);
				i += 2;
			}
			else if ((c & 0xF0) == 0xE0)
			{
				wc = (s[i] & 0xF) << 12;
				wc |= (s[i + 1] & 0x3F) << 6;
				wc |= (s[i + 2] & 0x3F);
				i += 3;
			}
			else if ((c & 0xF8) == 0xF0)
			{
				wc = (s[i] & 0x7) << 18;
				wc |= (s[i + 1] & 0x3F) << 12;
				wc |= (s[i + 2] & 0x3F) << 6;
				wc |= (s[i + 3] & 0x3F);
				i += 4;
			}
			else if ((c & 0xFC) == 0xF8)
			{
				wc = (s[i] & 0x3) << 24;
				wc |= (s[i] & 0x3F) << 18;
				wc |= (s[i] & 0x3F) << 12;
				wc |= (s[i] & 0x3F) << 6;
				wc |= (s[i] & 0x3F);
				i += 5;
			}
			else if ((c & 0xFE) == 0xFC)
			{
				wc = (s[i] & 0x1) << 30;
				wc |= (s[i] & 0x3F) << 24;
				wc |= (s[i] & 0x3F) << 18;
				wc |= (s[i] & 0x3F) << 12;
				wc |= (s[i] & 0x3F) << 6;
				wc |= (s[i] & 0x3F);
				i += 6;
			}
			ws += wc;
		}
		return ws;
	}

Let's see whether it works:

So, the complete program to produce the above image:

Code Select


#include <Windows.h>
#include <cstring>
#include <iostream>


	std::string myutf8String = "Russian: советских\nJapanese: 私は学生です\nChinese: 你好\nTamil: ಬಾ ಇಲ್ಲಿ ಸಂಭವಿಸು\nClassical Greek: ὕαλον ϕαγεῖν\nCzech: Mohu jíst sklo\nArabic:أنا قادر على أكل الزجاج و هذا لا يؤلمني.";
	std::wstring myutf16;
	std::wstring UTF8toUnicode(const std::string& s);

	int main()
	{
		myutf16=UTF8toUnicode(myutf8String);
		MessageBoxW(0, myutf16.c_str(), L"UTF-16", 0);

		return 0;
	}

	using namespace std;
	
	wstring UTF8toUnicode(const string& s)
	{
		wstring ws;
		wchar_t wc;
		for (int i = 0; i < s.length(); )
		{
			char c = s[i];
			if ((c & 0x80) == 0)
			{
				wc = c;
				++i;
			}
			else if ((c & 0xE0) == 0xC0)
			{
				wc = (s[i] & 0x1F) << 6;
				wc |= (s[i + 1] & 0x3F);
				i += 2;
			}
			else if ((c & 0xF0) == 0xE0)
			{
				wc = (s[i] & 0xF) << 12;
				wc |= (s[i + 1] & 0x3F) << 6;
				wc |= (s[i + 2] & 0x3F);
				i += 3;
			}
			else if ((c & 0xF8) == 0xF0)
			{
				wc = (s[i] & 0x7) << 18;
				wc |= (s[i + 1] & 0x3F) << 12;
				wc |= (s[i + 2] & 0x3F) << 6;
				wc |= (s[i + 3] & 0x3F);
				i += 4;
			}
			else if ((c & 0xFC) == 0xF8)
			{
				wc = (s[i] & 0x3) << 24;
				wc |= (s[i] & 0x3F) << 18;
				wc |= (s[i] & 0x3F) << 12;
				wc |= (s[i] & 0x3F) << 6;
				wc |= (s[i] & 0x3F);
				i += 5;
			}
			else if ((c & 0xFE) == 0xFC)
			{
				wc = (s[i] & 0x1) << 30;
				wc |= (s[i] & 0x3F) << 24;
				wc |= (s[i] & 0x3F) << 18;
				wc |= (s[i] & 0x3F) << 12;
				wc |= (s[i] & 0x3F) << 6;
				wc |= (s[i] & 0x3F);
				i += 6;
			}
			ws += wc;
		}
		return ws;
	}

habran · January 24, 2019, 09:57:09 PM

It has been implemented some time ago, look it up in String.c from line 99
It works flawlessly 8)

HSE · January 24, 2019, 10:16:33 PM

Hi Atelier!

What problem for a macro you see?

jj2007 · January 24, 2019, 10:38:05 PM

Quote from: habran on January 24, 2019, 09:57:09 PM
It has been implemented some time ago, look it up in String.c from line 99
I works flawlessly 8)

What's the trick then?

Code Select

include \masm32\include\masm32rt.inc

.data
txTitle	dw "Does it work?", 0		; Error A2055: Initializer value too large
txHelloW	dw "Привет, Мир!", 0

.code
start:
	invoke MessageBox, 0, offset txHelloW, offset txTitle, MB_OK
	exit

end start

Btw we had the discussion already in Summer 2017

habran · January 24, 2019, 11:41:00 PM

Declaring wide string data with dw will only happen with OPTION LITERALS:ON and using
command line switches –Zm or –Zne will disable this.

jj2007 · January 25, 2019, 12:17:27 AM

Yep it works, good to know :t

Code Select

include \masm32\include\masm32rt.inc
OPTION LITERALS:ON

.data
txTitle		dw "Does it work?", 0
txHelloW	dw "Привет, Мир!", 0

.code
start:
	invoke MessageBoxW, 0, offset txHelloW, offset txTitle, MB_OK
	exit

end start

aw27 · January 25, 2019, 01:46:17 AM

@habran
I was thinking of it in the form of an internal macro called on demand as needed without the need for the OPTION LITERALS:ON. It would look more familiar to MASM users. But doing it through OPTION LITERALS:ON is already great. :t

@HSE
I don't see any way of making the macro operators turn string characters into bytes (for numerical evaluation and comparison purposes) at assemble time. It may be easy, despite people being struggling with that for decades, but I believe it is impossible (at assemble time only, of course).

TimoVJL · January 25, 2019, 02:16:30 AM

C++ example converted to C

Code Select

#define WIN32_LEAN_AND_MEAN
#include <windows.h>

#pragma comment(lib, "user32.lib")

wchar_t *UTF8toUnicode(char *s, wchar_t *ws)
{
	wchar_t wc;
	char *c = s;
	do
	{
		if ((*c & 0x80) == 0)
		{
			wc = *c++;
		}
		else if ((*c & 0xE0) == 0xC0)
		{
			wc = ( *c++ & 0x1F) << 6;
			wc |= (*c++ & 0x3F);
		}
		else if ((*c & 0xF0) == 0xE0)
		{
			wc = (*c++ & 0xF) << 12;
			wc |= (*c++ & 0x3F) << 6;
			wc |= (*c++ & 0x3F);
		}
		else if ((*c & 0xF8) == 0xF0)
		{
			wc = (*c & 0x7) << 18;
			wc |= (*c++ & 0x3F) << 12;
			wc |= (*c++ & 0x3F) << 6;
			wc |= (*c++ & 0x3F);
		}
		else if ((*c & 0xFC) == 0xF8)
		{
			wc = (*c++ & 0x3) << 24;
			wc |= (*c++ & 0x3F) << 18;
			wc |= (*c++ & 0x3F) << 12;
			wc |= (*c++ & 0x3F) << 6;
			wc |= (*c++  & 0x3F);
		}
		else if ((*c & 0xFE) == 0xFC)
		{
			wc = (*c++ & 0x1) << 30;
			wc |= (*c++ & 0x3F) << 24;
			wc |= (*c++ & 0x3F) << 18;
			wc |= (*c++ & 0x3F) << 12;
			wc |= (*c++ & 0x3F) << 6;
			wc |= (*c++ & 0x3F);
		}
		*ws++ = wc;
	} while (*c);
	return ws;
}

void __cdecl mainCRTStartup(void)
{
	char *myutf8String = u8"Russian: советских\nJapanese: 私は学生です\nChinese: 你好\nTamil: ಬಾ ಇಲ್ಲಿ ಸಂಭವಿಸು\nClassical Greek: ὕαλον ϕαγεῖν\nCzech: Mohu jíst sklo\nArabic:أنا قادر على أكل الزجاج و هذا لا يؤلمني.";
	wchar_t myutf16[200];
	UTF8toUnicode(myutf8String, myutf16);
	MessageBoxW(0, myutf16, L"UTF-16", 0);
}

msvc 2010 - 2013

Code Select

#pragma execution_character_set("utf-8")

Abdel Hamid · January 25, 2019, 05:44:55 AM

the sentence in arabic : أنا قادر على أكل الزجاج و هذا لا يؤلمني
says : I am able to eat glass and this doesn't hurt me
it's a little bit funny

HSE · January 25, 2019, 09:39:54 PM

Quote from: AW on January 25, 2019, 01:46:17 AM
I don't see any way of making the macro operators turn string characters into bytes (for numerical evaluation and comparison purposes) at assemble time. It may be easy, despite people being struggling with that for decades, but I believe it is impossible (at assemble time only, of course).

Perhaps it's not possible with elemental macros we build, and you are right in that sense. But I don't think is impossible with advanced macros, just boring.

aw27 · January 25, 2019, 11:23:01 PM

Quote from: HSE on January 25, 2019, 09:39:54 PM
Perhaps it's not possible with elemental macros we build, and you are right in that sense. But I don't think is impossible with advanced macros, just boring.

Not a question of being boring, people that produced a number of macros I have seen in a few places in the masm32 SDK is vaccinated against boredom.

guga · April 14, 2019, 11:15:00 PM

Great work.

Someone have a working example of UTF8 to Ansi to convert things like:

A SaÃda dos OperÃ¡rios da FÃ¡brica LumiÃ¨re

to

A Saída dos Operários da Fábrica Lumière

?

jj2007 · April 14, 2019, 11:39:43 PM

include \masm32\MasmBasic\MasmBasic.inc ; download
Init
Let esi="A SaÃda dos OperÃ¡rios da FÃ¡brica LumiÃ¨re"
wPrint wRec$(esi)
EndOfCode

Output:

Code Select

A Saída dos Operários da Fábrica Lumière

guga · April 14, 2019, 11:56:43 PM

Tks, JJ.

Do you have a source example in masm how can i implement it ?

TimoVJL · April 15, 2019, 12:07:47 AM

If don't care about codepage or WinAPI string functions:

Code Select

#define WIN32_LEAN_AND_MEAN
#include <windows.h>

#pragma comment(lib, "user32.lib")

char *UTF8toANSI(char *s, char *as)
{
	char ch;
	char *c = s;
	do
	{
		if ((*c & 0x80) == 0)
		{
			ch = *c++;
		}
		else if ((*c & 0xE0) == 0xC0)
		{
			ch = ( *c++ & 0x1F) << 6;
			ch |= (*c++ & 0x3F);
		}
		else if ((*c & 0xF0) == 0xE0)
		{
			ch = (*c++ & 0xF) << 12;
			ch |= (*c++ & 0x3F) << 6;
			ch |= (*c++ & 0x3F);
		}
		else if ((*c & 0xF8) == 0xF0)
		{
			ch = (*c & 0x7) << 18;
			ch |= (*c++ & 0x3F) << 12;
			ch |= (*c++ & 0x3F) << 6;
			ch |= (*c++ & 0x3F);
		}
		else if ((*c & 0xFC) == 0xF8)
		{
			ch = (*c++ & 0x3) << 24;
			ch |= (*c++ & 0x3F) << 18;
			ch |= (*c++ & 0x3F) << 12;
			ch |= (*c++ & 0x3F) << 6;
			ch |= (*c++  & 0x3F);
		}
		else if ((*c & 0xFE) == 0xFC)
		{
			ch = (*c++ & 0x1) << 30;
			ch |= (*c++ & 0x3F) << 24;
			ch |= (*c++ & 0x3F) << 18;
			ch |= (*c++ & 0x3F) << 12;
			ch |= (*c++ & 0x3F) << 6;
			ch |= (*c++ & 0x3F);
		}
		*as++ = ch;
	} while (*c);
	return as;
}

void __cdecl mainCRTStartup(void)
{
	//char *myutf8String = u8"A Saída dos Operários da Fábrica Lumière";
	char *myutf8String = "A SaÃda dos OperÃ¡rios da FÃ¡brica LumiÃ¨re";
	char myANSI[200];
	UTF8toANSI(myutf8String, myANSI);
	MessageBoxA(0, myANSI, "ANSI", 0);
}

otherwise MultiByteToWideChar() using CP_UTF8 and back to ANSI with WideCharToMultiByte()

The MASM Forum

News:

UTF-8 to UTF-16

aw27

habran

HSE

jj2007

habran

jj2007

aw27

TimoVJL

Abdel Hamid

HSE

aw27

guga

jj2007

guga

TimoVJL