News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Looking for a UTF-8 table in DB (DD..etc) format

Started by K_F, March 26, 2023, 08:56:49 PM

Previous topic - Next topic

K_F

Before I get over excited and spend the rest of my life typing out this table myself,   :wink2:
Does anyone know if there is such a table floating around.

I'm making a complete reference table (HTML Tags and Entities) in UTF-8, as this format has approx 95% compatibility

It can be in any of the formats,for example:-
   DB   "0022",0       ; "&quot"   --- Preferably every code sequence
   DB   "&quot",0      ; 0x0022

:thumbsup:
'Sire, Sire!... the peasants are Revolting !!!'
'Yes, they are.. aren't they....'

NoCforMe

This page "lists the first 100,000 characters on 100 pages". (Hmm, just how many UTF-8 characters are there?)

(sorry, not in DB/DD format. could write a small program to format it? or perhaps JJ has some magic stuff that'll do that?)
Assembly language programming should be fun. That's why I do it.

K_F

Quote from: NoCforMe on March 27, 2023, 06:23:00 AM
(Hmm, just how many UTF-8 characters are there?)
Thanks I'll look at that.
Think somewhere around 150,000 IIRC.

Looks like this page might be the one to check out..

https://www.w3schools.com/charsets/ref_html_utf8.asp
'Sire, Sire!... the peasants are Revolting !!!'
'Yes, they are.. aren't they....'

NoCforMe

One question, though: your example used the HTML identifier "&quot". But are there such identifiers for the entire UTF-8 charset? I don't think so. On the other hand, you can't use the actual characters either, as they can't be used either by whatever editor you're using or by the assembler. So I'm not sure how you would encode this table.
Assembly language programming should be fun. That's why I do it.

K_F

Quote from: NoCforMe on March 27, 2023, 08:19:22 AM
One question, though: your example used the HTML identifier "&quot". But are there such identifiers for the entire UTF-8 charset? I don't think so.  So I'm not sure how you would encode this table.
Both the numerical and text formats are found in html files.
Quote<tr><th colspan=9 class='ct'>Trainer &nbsp;&nbsp;A DINGBAT</th></tr>

The entities seem to be in either format (text of numerical(binary, Oct,Hex)), so the idea is to create a table of Tags and Entities that one can use for filtering required information. It looks like I'll be making a few tables of the same thing, but different format.

From what I've looked at till now, it looks like about 2000 entries per table will suffice.
This is naturally for the English language, but I'll make it flexible for other languages that native speakers can use with their own language tables filled in, or just change the code.
:thumbsup:


'Sire, Sire!... the peasants are Revolting !!!'
'Yes, they are.. aren't they....'