The MASM Forum

General => The Laboratory => Topic started by: K_F on March 26, 2023, 08:56:49 PM

Title: Looking for a UTF-8 table in DB (DD..etc) format
Post by: K_F on March 26, 2023, 08:56:49 PM
Before I get over excited and spend the rest of my life typing out this table myself,   :wink2:
Does anyone know if there is such a table floating around.

I'm making a complete reference table (HTML Tags and Entities) in UTF-8, as this format has approx 95% compatibility

It can be in any of the formats,for example:-
   DB   "0022",0       ; "&quot"   --- Preferably every code sequence
   DB   "&quot",0      ; 0x0022

:thumbsup:
Title: Re: Looking for a UTF-8 table in DB (DD..etc) format
Post by: NoCforMe on March 27, 2023, 06:23:00 AM
This page (https://www.charset.org/utf-8) "lists the first 100,000 characters on 100 pages". (Hmm, just how many UTF-8 characters are there?)

(sorry, not in DB/DD format. could write a small program to format it? or perhaps JJ has some magic stuff that'll do that?)
Title: Re: Looking for a UTF-8 table in DB (DD..etc) format
Post by: K_F on March 27, 2023, 07:37:48 AM
Quote from: NoCforMe on March 27, 2023, 06:23:00 AM
(Hmm, just how many UTF-8 characters are there?)
Thanks I'll look at that.
Think somewhere around 150,000 IIRC.

Looks like this page might be the one to check out..

https://www.w3schools.com/charsets/ref_html_utf8.asp
Title: Re: Looking for a UTF-8 table in DB (DD..etc) format
Post by: NoCforMe on March 27, 2023, 08:19:22 AM
One question, though: your example used the HTML identifier "&quot". But are there such identifiers for the entire UTF-8 charset? I don't think so. On the other hand, you can't use the actual characters either, as they can't be used either by whatever editor you're using or by the assembler. So I'm not sure how you would encode this table.
Title: Re: Looking for a UTF-8 table in DB (DD..etc) format
Post by: K_F on March 28, 2023, 04:46:10 AM
Quote from: NoCforMe on March 27, 2023, 08:19:22 AM
One question, though: your example used the HTML identifier "&quot". But are there such identifiers for the entire UTF-8 charset? I don't think so.  So I'm not sure how you would encode this table.
Both the numerical and text formats are found in html files.
Quote<tr><th colspan=9 class='ct'>Trainer &nbsp;&nbsp;A DINGBAT</th></tr>

The entities seem to be in either format (text of numerical(binary, Oct,Hex)), so the idea is to create a table of Tags and Entities that one can use for filtering required information. It looks like I'll be making a few tables of the same thing, but different format.

From what I've looked at till now, it looks like about 2000 entries per table will suffice.
This is naturally for the English language, but I'll make it flexible for other languages that native speakers can use with their own language tables filled in, or just change the code.
:thumbsup: