### Author Topic: Unicode strings  (Read 8634 times)

The switch /ws is added to convert "quoted strings" to Unicode. In addition to this the OPTION WSTRING:[ON|OFF] is added. This will allow declaration of a string array directly in the data segment using DW:
` option wstring:on dw "Declaring a Unicode string",0`
Test case:
`include conio.incinclude ctype.inc .codemain proc _cputws( "Type 'Y' when finished typing keys: " ) .repeat toupper( _getwch()  ) .until al == 'Y' _putwch( eax ) ; 'Y' _putwch( 13 ) ; Carriage return _putwch( 10 ) ; Line feed xor eax,eax retmain endp END main`
Makefile:
`_getwch.exe: asmc -ws -pe -D__PE__ -D_WIN64 \$*.asm \$@ pause`
This also works with the @CStr() macro.

Nice idea :t

Hi nidud,

Nice work with asmc. Poasm has the same approach, dw is used for words or Unicode strings. Why not to use du looking more specific than dw?

Poasm has the same approach, dw is used for words or Unicode strings. Why not to use du looking more specific than dw?

The plan was to rewrite the string parsing for this implementation, only targeting the existing functionality of the string usage. However, I had to create a whole duplicated set of functions for this to work correctly, so it became rather extensive.

Extending the DW functionality on the other hand was rather simple, so I went with the lazy approach: you then only have to flip DB to DW in the end. Given it's possible to define little endian values using DW it does create some compatibility issues, but at the same time also add some extended functionality to the data section.

` dw "ab"`
In addition to this the string-hash used for detecting duplicated strings still uses the original ASCII string, so switching the option on and off may have unintended consequences:
Code: [Select]
` mov eax,@CStr( "abcdef" ) option wstring:on mov eax,@CStr( "def" )`
The last one will now use the offset of the first ASCII string + 3.

Well, as for now the only usage is Asmc -ws, so it wont create any potential backward compatibility issues given it will mainly be used writing in this syntax. The implementation in the data segment may change later if this becomes an issue and the DU options may then be a solution.

> The switch /ws is added to convert "quoted strings" to Unicode. In addition to this the OPTION WSTRING:[ON|OFF] is added. This will allow declaration of a string array directly in the data segment using DW:

Compliments, this has needed to be done for a long time.  :t
Looking at the test case it assumes a start-up module, ending with RET, and then END main. Should probably end with exit(0).

Still work thought

Some changes added to the string functions:
- added null ( "" ) string
- added ("Multi" "Text" "Lines")
- added (L"Unicode") string macro

test case:
` .386 .model flat, stdcall .codeS equ <" ">T equ <"\t">N equ <"\n">A equ <"Auto">U equ <"Unicode">W equ <L"Unicode">foo proc a1foo endpbar proc a1, a2bar endp foo( A ) ; DS0000 foo( "" ) ; DS0000[4] foo( U ) ; DS0001 foo( W ) ; DS0002 bar( A, A ) ; 0,0 bar( A, U ) ; 0,1 bar( A, W ) ; 0,2 bar( U, A ) ; 1,0 bar( U, U ) ; 1,1 bar( U, W ) ; 1,2 bar( W, A ) ; 2,0 bar( W, U ) ; 2,1 bar( W, W ) ; 2,2 foo( U T A N ; DS0003 A T U N ) foo( W T A N ; DS0004 A T U N ) bar( U T A N ; 3 A T U N, W T A N ; 4 A T U N ) mov eax,@CStr( "" ) ; 3 -- DS0003[26] mov eax,@CStr( "Auto" ) ; 0 mov eax,@CStr( "Unicode" ) ; 1 mov eax,@CStr( L"Unicode" ) ; 2 mov eax,@CStr(     "Unicode" "\t" "Auto" "\n" ; 3     "Auto" "\t" "Unicode" "\n" ) foo( "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n" ) END`
A special flag is set if <L"> is found inside a proc( L"" ) or inside @CStr( L"" ). This will enable expansion of DW "string" without the -ws switch or wstring option set.

A macro named L will not be effected but a text equation will strip the L.

The total size of the separated strings is currently limited to maximum line size (2K).

Added OPTION CODEPAGE:<value> for Unicode creation. This is basically the first argument to MultiByteToWideChar(). The default value is 0.

Code: [Select]
`;;  Code Page Default Values.;CP_ACP   equ 0   ; default to ANSI code pageCP_OEMCP   equ 1   ; default to OEM  code pageCP_MACCP   equ 2   ; default to MAC  code pageCP_THREAD_ACP   equ 3   ; current thread's ANSI code pageCP_SYMBOL   equ 42   ; SYMBOL translationsCP_UTF7   equ 65000   ; UTF-7 translationCP_UTF8   equ 65001   ; UTF-8 translation`
The switch /ws is also extended to /ws[[=]<value>]

I made a few changes to the @Cstr() macro to enable usage in the data segment. Normally the macro insert .data at the beginning and .code at the end. This will now be skipped if already in the data segment.

In addition to this the macro normally return offset to the created string, but now this will be skipped if the macro is the first token of the line.

Example

Code: [Select]
`usage db 'Usage:',9,'NOLPT 1',9,9,'disable LPT1',10 db 9,'...',10 db 9,'NOLPT 4',9,9,'disable LPT4',10 db 9,'NOLPT 1U',9,'uninstall from LPT1',10 db 9,'etc.',10 db 0usage   label byte@CStr( "Usage:\tNOLPT 1\t\tdisable LPT1\n" "\t...\n" "\tNOLPT 4\t\tdisable LPT4\n" "\tNOLPT 1U\tuninstall from LPT1\n" "\tetc.\n" )`
This enable, in addition to use of C-escape characters, to flip from ASCII to Unicode in the data segment using option or switch.

Code: [Select]
`string dd @CStr( "string" )`

`ï»¿; Build: asmc /pe test.asm .486 .model flat, c option dllimport:<msvcrt.dll>printf proto :ptr, :varargexit proto :dword .codestart: printf("BOM detected\n") exit(0) end start`

- fixed bug in EIP-related offsets in 64-bit

This apply to using strings in combination with the -pe switch in 64-bit. The logic of string creation is to reuse strings already created. perror( "Nothing to do.." ) followed by strcpy( &path, "." ) will reuse the end of the first string as argument: LEA RDX,DS0000[14]. This failed do to an error calculating the address + 14.

Test case:
Code: [Select]
`.x64.model  flat, fastcalloption  dllimport:<msvcrt>printf  proto :ptr, :varargexit    proto :dword.datastring  db 16 dup(0)format  db "%s",10,0pointer db "%p",10,"%p",10,0.codemain proc    mov string[0],'a'    mov string[1],'b'    mov string[2],'c'    mov string[3],'d'    mov string[4],'e'    mov string[5],'f'    invoke printf,addr format, addr string    lea rdx,string    lea r8,string[1]    invoke printf,addr pointer, rdx, r8    invoke exit,0main endp    end main`
Output

Old version:
Code: [Select]
`a00000000004030000000000000403002...000000000040100 | C6 05 F9 1F 00 00 61    | mov byte ptr ds:[403000],61000000000040100 | C6 05 F4 1F 00 00 62    | mov byte ptr ds:[403002],62000000000040100 | C6 05 EF 1F 00 00 63    | mov byte ptr ds:[403004],63000000000040101 | C6 05 EA 1F 00 00 64    | mov byte ptr ds:[403006],64000000000040101 | C6 05 E5 1F 00 00 65    | mov byte ptr ds:[403008],65000000000040102 | C6 05 E0 1F 00 00 66    | mov byte ptr ds:[40300A],66`
New version:
Code: [Select]
`abcdef00000000004030000000000000403001`

More Unicode/ASCII testing.

I've added some macros for declaring a resource directly in the source file for the -pe switch in the winres.inc file. The base for the test is the RichEdit sample tut35.

The resource is declared at the end of the source file and look similar to the .RC file. However the top level definition have to be done manually so the declaration looks like this:
Code: [Select]
`WinStart proc    mov ebx,GetModuleHandle(0)    ExitProcess(WinMain(ebx, 0, GetCommandLine(), SW_SHOWDEFAULT))WinStart endpRCBEGIN    RCTYPES 3    RCENTRY RT_MENU    RCENTRY RT_DIALOG    RCENTRY RT_ACCELERATOR    RCENUMN 1    RCENUMX IDR_MAINMENU    RCENUMN 4    RCENUMX IDD_OPTIONDLG    RCENUMX IDD_FINDDLG    RCENUMX IDD_GOTODLG    RCENUMX IDD_REPLACEDLG    RCENUMN 1    RCENUMX 105    REPEAT 6    RCLANGX LANGUAGEID    ENDM    MENUBEGIN      MENUNAME IDS_FILE        MENUITEM 0, IDM_OPEN,   IDS_OPEN        MENUITEM 0, IDM_CLOSE,  IDS_CLOSE        MENUITEM 0, IDM_SAVE,   IDS_SAVE        MENUITEM 0, IDM_SAVEAS, IDS_SAVEAS        SEPARATOR        MENUITEM MF_END, IDM_EXIT, IDS_EXIT      MENUNAME IDS_EDIT        MENUITEM 0, IDM_UNDO,   IDS_UNDO        MENUITEM 0, IDM_REDO,   IDS_REDO        MENUITEM 0, IDM_COPY,   IDS_COPY        MENUITEM 0, IDM_CUT,    IDS_CUT        MENUITEM 0, IDM_PASTE,  IDS_PASTE        SEPARATOR        MENUITEM 0, IDM_DELETE, IDS_DELETE        SEPARATOR        MENUITEM MF_END, IDM_SELECTALL, IDS_SELECTALL      MENUNAME IDS_SEARCH        MENUITEM 0, IDM_FIND,     IDS_FIND        MENUITEM 0, IDM_FINDNEXT, IDS_FINDNEXT        MENUITEM 0, IDM_FINDPREV, IDS_FINDPREV        MENUITEM 0, IDM_REPLACE,  IDS_REPLACE        SEPARATOR        MENUITEM MF_END, IDM_GOTOLINE, IDS_GOTO      MENUITEM MF_END, IDM_OPTION, IDS_OPTIONS    MENUEND    DLGFLAGS equ DS_MODALFRAME or DS_SETFONT or WS_POPUP or WS_VISIBLE or WS_CAPTION or WS_SYSMENU    DLGBEGIN DLGFLAGS,7,0,0,196,60     CAPTION IDS_OPTIONS     FONT 8, "MS Sans Serif"      DEFPUSHBUTTON   IDS_OK,IDOK,137,7,50,14;39,14      PUSHBUTTON      IDS_CANCEL,IDCANCEL,137,25,50,14      GROUPBOX        0,IDC_STATIC,5,0,124,49      LTEXT           IDS_BACKGR,IDC_STATIC,20,14,60,8      LTEXT           0,IDC_BACKCOLORBOX,85,11,28,14,SS_NOTIFY or WS_BORDER      LTEXT           IDS_FOREGR,IDC_STATIC,20,33,35,8      LTEXT           0,IDC_TEXTCOLORBOX,85,29,28,14,SS_NOTIFY or WS_BORDER    DLGEND    DLGBEGIN DLGFLAGS,9,0,0,186,54     CAPTION IDS_FIND2     FONT 8, "MS Sans Serif"      EDITTEXT        IDC_FINDEDIT,42,3,94,12,ES_AUTOHSCROLL      CONTROL         IDS_MATCHCASE,IDC_MATCHCASE,RC_BUTTON,BS_AUTOCHECKBOX or WS_TABSTOP,6,24,54,10      CONTROL         IDS_WHOLEWORD,IDC_WHOLEWORD,RC_BUTTON,BS_AUTOCHECKBOX or WS_TABSTOP,6,37,56,10      CONTROL         IDS_DOWN,IDC_DOWN,RC_BUTTON,BS_AUTORADIOBUTTON or WS_TABSTOP,83,27,35,10      CONTROL         IDS_UP,IDC_UP,RC_BUTTON,BS_AUTORADIOBUTTON or WS_TABSTOP,83,38,25,10      DEFPUSHBUTTON   IDS_OK,IDOK,141,3,39,12      PUSHBUTTON      IDS_CANCEL,IDCANCEL,141,18,39,12      LTEXT           IDS_FINDWHAT,IDC_STATIC,5,4,34,8      GROUPBOX        IDS_DIRECTION,IDC_STATIC,70,18,64,32    DLGEND    DLGBEGIN DLGFLAGS,4,0,0,106,30,WS_EX_TOOLWINDOW     CAPTION IDS_GOTO2     FONT 8, "MS Sans Serif", 0, 0, 0x1      EDITTEXT        IDC_LINENO,29,4,35,11,ES_AUTOHSCROLL or ES_NUMBER,WS_EX_CLIENTEDGE      DEFPUSHBUTTON   IDS_OK,IDOK,70,4,31,11      PUSHBUTTON      IDS_CANCEL,IDCANCEL,70,17,31,11      LTEXT           IDS_LINE,IDC_STATIC,8,5,18,8    DLGEND    DLGBEGIN DLGFLAGS,6,0,0,186,33     CAPTION IDS_REPLACE2     FONT 8, "MS Sans Serif"      EDITTEXT        IDC_FINDEDIT,51,3,84,12,ES_AUTOHSCROLL      EDITTEXT        IDC_REPLACEEDIT,51,17,84,11,ES_AUTOHSCROLL      DEFPUSHBUTTON   IDS_OK,IDOK,142,3,39,11      PUSHBUTTON      IDS_CANCEL,IDCANCEL,142,17,39,11      LTEXT           IDS_FINDWHAT,IDC_STATIC,3,4,34,8      LTEXT           IDS_REPLACEWITH,IDC_STATIC,3,18,42,8    DLGEND    DLGBEGIN 0x0046000B,0x00009C4E,11,71,0x9C51,0,11      db 70,0, 78,156,0,0, 11,0, 71,0, 81,156, 0,0      db 11,0,82,0,80,156,0,0,3,0,114,0,79,156,0,0      db 139,0,114,0,82,156,0,0    DLGENDRCEND    end WinStart`
The IDS-strings are declared in separate files in this case. I  added three languages. One set of ASCII files and one as UTF-8. The specific language file sets the Unicode/ASCII definition and code page.

Code: [Select]
`;; Build: asmc -pe test.asm;__PE__   equ 1_CType   equ <stdcall>include lang/en.txtinclude windows.incinclude richedit.incinclude winres.inc`
Code: [Select]
`LANGUAGEID      equ LANGID_NNif 1 ; Set if not local _UNICODE       equ 1 option         wstring:on option         codepage:865endifIDS_FILE        equ <"&Fil">`
The resource macro RCBEGIN sets the Unicode option on given the menus and dialogs are always Unicode but the rest of the code may use A or W depending on location.

Added a second test with syntax highlighting. This uses the same method as Doszip Edit with support for numbers and quotes. I also added a font selector to the menu.
Added some changes to the @CStr() macro.

If used in the data segment with a return code the string will be created in the const segment. This will enable declaration of text directly to a pointer.

rterrmsgs   struct
rterrno     int_t ?     ;; error number
rterrtxt    wstring_t ? ;; text of error message
rterrmsgs   ends

.data

rterrmsgs { _RT_FLOAT, @CStr( _RT_FLOAT_TXT ) };

I have done a small test for ASMC Unicode facilities. It is the first time I use ASMC, so a few things are probably not right. But it works!

So, I made a project in Visual Studio as follows:

`#include <conio.h>extern void testASMC(void);short *someRussianVS = L"VS1 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.\n";int main(){ _cputws(someRussianVS); _cputws(L"VS2 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.\n\n"); testASMC(); return 0;}`
Code: [Select]
`includelib G:\asmc\lib\libc.libinclude G:\asmc\include\conio.incinclude G:\asmc\include\ctype.incinclude G:\asmc\include\winnls.inc;option wstring:onOPTION CODEPAGE:CP_UTF8 .constsomeRussian dw "ASMC 1 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.",10,0.codetestASMC proc _cputws(offset someRussian) _cputws(@CStr( "ASMC 2 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.\n")); _getch() rettestASMC endpEND`

Output:
VS1 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.
VS2 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.

ASMC 1 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.
ASMC 2 - Встре́ча с медве́дем мо́жет быть о́чень опа́сна.

It is amazing that the same ASMC source code builds both for 32-bit and 64-bit without any change.
Even the same libc.lib works both for 32-bit and 64-bit (In other words, why do I need to declare it if it is not needed at all in this program?  ).