Re: How to generate an Unicode string under MASM 6.15?

nidud · May 09, 2017, 10:37:24 PM

deleted

mineiro · May 10, 2017, 03:11:20 AM

But I can see this topic by using cellphone, by using TV, internet of things.
I don't know how to change codepage on tv, cellphone, ..., only on pc computers.

nidud · May 10, 2017, 03:45:11 AM

deleted

TWell · May 10, 2017, 04:20:59 AM

The problem is that DOS/Windows command.com/cmd.exe don't support UTF-8.
Conversion have to do in some point. There is also a OEM/ANSI problem with CUI/GUI.
Windows NT is internally UNICODE (UTF-16LE).
ml understand only ASCII.

nidud · May 10, 2017, 04:50:03 AM

deleted

jj2007 · May 10, 2017, 06:44:03 AM

Attached two simple plain Masm32 sources, with and without Utf8 BOM.
Open them in Notepad, Wordpad, MS Word and qEditor to see the differences (there are differences, it's a mess).

Notepad++ btw treats both as UTF-8, which is slightly incorrect.

mineiro · May 10, 2017, 07:19:18 AM

I don't have sure sir nidud;
I have only read about utf-8 but I do not have started to code a utf-8 string identifier.
Tomorrow I can start to code, well, maybe 2 weeks later can be done, who knows.

mineiro · May 10, 2017, 08:00:28 AM

hellow sir jj2007;
I remember that you talked about 2 bytes on start of file, I think that inserted by notepad and not welcome.
What means that bytes? Can I ignore that 2 bytes or that can be some hint?
what you have found about that?

nidud · May 10, 2017, 08:11:00 AM

deleted

mineiro · May 10, 2017, 08:31:27 AM

QuoteWhat is it you can't see, whats missing?

The symbols/font does not match one with each other, different symbols.

SÃ¦rleg != Særleg

SÃ¦rleg == 7 symbols on screen
Særleg == 6 symbols on screen

jj2007 · May 10, 2017, 09:00:14 AM

Quote from: mineiro on May 10, 2017, 08:00:28 AMWhat means that bytes? Can I ignore that 2 bytes or that can be some hint?

These three bytes are hex EF BB BF, and they are the UTF-8 "BOM", i.e. byte order mark. There is a fairly good description here.

The main point of the BOM is that the software knows that the ANSI chars following are UTF-8 encoded. It is a priori not possible to identify with certainty whether an ANSI text is encoded as UTF-8 or with any other codepage. And only if you know the codepage, you can translate text to "true" Unicode for displaying e.g. MessageBoxes correctly. This is why everybody (well, almost everybody) uses UTF-8.

nidud · May 10, 2017, 09:02:48 AM

deleted

mineiro · May 10, 2017, 11:00:28 AM

hello sir jj2007;
I checked that. Opened notepad, do not have pressed any char on keyboard and saved text on utf-8 format, the size of file is that 3 bytes.
exactly, thanks.

hello sir nidud;
I understood what you say.
Your computer is configured to your region, your country (codepage). Mine to other region, other country (codepage).
To I be able to read what you write using your language symbols (alphabet letters) I need know codepage that text has been written, because from my codepage what you written appears like garbage on screen. And this is mutual from your point of view, to you understand what I write on my language you need switch your computer codepage to the same or you will see garbage on screen.
So we start switching all codepage that we know in a hope that garbage strings transform into some recognized symbol strings.
I don't understand your language but I can recognize symbols of your language. So, while switching codepages sounds that symbols of your language fitted fine with others language symbols, this way I can believe that I have in hands a french/spanish/russian text instead of Norwegian text. We lost information.

I'm supposing that on your screen, the symbols below are equal instead of different:
SÃ¦rleg != Særleg

edited--
writed to written, sorry, poor english language.

nidud · May 10, 2017, 11:46:22 AM

deleted

mineiro · May 10, 2017, 12:39:46 PM

Quote from: nidud on May 10, 2017, 11:46:22 AM
Do you understand Norwegian?

No, but I can recognize Norwegian chars, letters, symbols.

Quote
I first need to learn you language I think ;)

Yes, but you can recognize latin chars, letters, symbols.

Quote
Doesn't help if I don't understand the language. In this case we have similar letters, but Chines and Arabic, it doesn't really matter: I can't read it anyway.

Me too, but I can try to recognize their chars, letters, symbols.

Quote
For which purpose?

To switch garbage data on screen into form of chars, letters, symbols. So I can go and use a translator. If translator returns to me non sense words, so I know that I'm not dealing with that codepage language.

Quote
How likely do you think it is that one man could have written the sample below with the same keyboard/OS and understand all these languages?

      "早上好计算机程序员。\n"
      "おはようのコンピュータのプログラマー。\n"
      "Хороший программист утром.\n"
      "Καλή προγραμματιστής ηλεκτρονικών υπολογιστών πρωί.\n"
      "सुप्रभात कंप्यूटर प्रोग्रामर.\n"
      "Chào buổi sáng lập trình máy tính.\n"
      "დილა მშვიდობისა, კომპიუტერული პროგრამისტი.\n"
      "Добро јутро компјутерски програмер.\n"
      "Բարի լույս ծրագրավորող.\n"
      "안녕하세요 컴퓨터 프로그래머.\n",

They have used some escape char on browser, text editor, hexadecimal coded, ... to insert that symbols, or have copy and paste operation. On firefox browser I was able to enter that chars by this sequence:
control + shift + u
so, appears to me: u
after I pressed hexadecimal string sequence 2654:
u2654
after I pressed enter key and that turned into a valid symbol: a king chess piece (char, letter, symbol).
♔

this is what I like to test, because I read that utf-8 chars can have variable bytes size, not only 2 bytes if extended chars is being show on screen.

The MASM Forum

News:

Re: How to generate an Unicode string under MASM 6.15?

nidud

mineiro

nidud

TWell

nidud

jj2007

mineiro

mineiro

nidud

mineiro

jj2007

nidud

mineiro

nidud

mineiro