Author Topic: Unicode strings  (Read 1652 times)

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Unicode strings
« on: January 13, 2017, 06:52:12 AM »
The switch /ws is added to convert "quoted strings" to Unicode. In addition to this the OPTION WSTRING:[ON|OFF] is added. This will allow declaration of a string array directly in the data segment using DW:
Code: [Select]
option wstring:on

dw "Declaring a Unicode string",0

Test case:
Code: [Select]
include conio.inc
include ctype.inc

.code

main proc

_cputws( "Type 'Y' when finished typing keys: " )

.repeat

toupper( _getwch()  )

.until al == 'Y'

_putwch( eax ) ; 'Y'
_putwch( 13 ) ; Carriage return
_putwch( 10 ) ; Line feed

xor eax,eax
ret

main endp

END main

Makefile:
Code: [Select]
_getwch.exe:
asmc -ws -pe -D__PE__ -D_WIN64 $*.asm
$@
pause

This also works with the @CStr() macro.

ragdog

  • Member
  • ****
  • Posts: 531
Re: Unicode strings
« Reply #1 on: January 13, 2017, 06:58:08 AM »
Nice idea :t

Vortex

  • Member
  • *****
  • Posts: 1734
Re: Unicode strings
« Reply #2 on: January 13, 2017, 07:16:06 AM »
Hi nidud,

Nice work with asmc. Poasm has the same approach, dw is used for words or Unicode strings. Why not to use du looking more specific than dw?

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #3 on: January 13, 2017, 08:23:21 AM »
Poasm has the same approach, dw is used for words or Unicode strings. Why not to use du looking more specific than dw?

The plan was to rewrite the string parsing for this implementation, only targeting the existing functionality of the string usage. However, I had to create a whole duplicated set of functions for this to work correctly, so it became rather extensive.

Extending the DW functionality on the other hand was rather simple, so I went with the lazy approach: you then only have to flip DB to DW in the end. Given it's possible to define little endian values using DW it does create some compatibility issues, but at the same time also add some extended functionality to the data section.

Code: [Select]
dw "ab"

In addition to this the string-hash used for detecting duplicated strings still uses the original ASCII string, so switching the option on and off may have unintended consequences:
Code: [Select]
mov eax,@CStr( "abcdef" )
option wstring:on
mov eax,@CStr( "def" )

The last one will now use the offset of the first ASCII string + 3.

Well, as for now the only usage is Asmc -ws, so it wont create any potential backward compatibility issues given it will mainly be used writing in this syntax. The implementation in the data segment may change later if this becomes an issue and the DU options may then be a solution.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4935
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Unicode strings
« Reply #4 on: January 13, 2017, 09:25:15 AM »
 :biggrin:

> The switch /ws is added to convert "quoted strings" to Unicode. In addition to this the OPTION WSTRING:[ON|OFF] is added. This will allow declaration of a string array directly in the data segment using DW:

Compliments, this has needed to be done for a long time.  :t
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #5 on: January 13, 2017, 10:14:26 AM »
Looking at the test case it assumes a start-up module, ending with RET, and then END main. Should probably end with exit(0).

Still work thought  :biggrin:

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #6 on: March 26, 2017, 09:02:18 AM »
Some changes added to the string functions:
- added null ( "" ) string
- added ("Multi" "Text" "Lines")
- added (L"Unicode") string macro

test case:
Code: [Select]
.386
.model flat, stdcall
.code

S equ <" ">
T equ <"\t">
N equ <"\n">
A equ <"Auto">
U equ <"Unicode">
W equ <L"Unicode">

foo proc a1
foo endp
bar proc a1, a2
bar endp

foo( A ) ; DS0000
foo( "" ) ; DS0000[4]
foo( U ) ; DS0001
foo( W ) ; DS0002

bar( A, A ) ; 0,0
bar( A, U ) ; 0,1
bar( A, W ) ; 0,2

bar( U, A ) ; 1,0
bar( U, U ) ; 1,1
bar( U, W ) ; 1,2

bar( W, A ) ; 2,0
bar( W, U ) ; 2,1
bar( W, W ) ; 2,2

foo(
U T A N ; DS0003
A T U N
)
foo(
W T A N ; DS0004
A T U N
)
bar(
U T A N ; 3
A T U N,
W T A N ; 4
A T U N
)

mov eax,@CStr( "" ) ; 3 -- DS0003[26]
mov eax,@CStr( "Auto" ) ; 0
mov eax,@CStr( "Unicode" ) ; 1
mov eax,@CStr( L"Unicode" ) ; 2
mov eax,@CStr(
    "Unicode" "\t" "Auto" "\n" ; 3
    "Auto" "\t" "Unicode" "\n"
)

foo(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
)

END

A special flag is set if <L"> is found inside a proc( L"" ) or inside @CStr( L"" ). This will enable expansion of DW "string" without the -ws switch or wstring option set.

A macro named L will not be effected but a text equation will strip the L.

The total size of the separated strings is currently limited to maximum line size (2K).

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #7 on: May 07, 2017, 09:03:17 AM »
Added OPTION CODEPAGE:<value> for Unicode creation. This is basically the first argument to MultiByteToWideChar(). The default value is 0.

Code: [Select]
;
;  Code Page Default Values.
;
CP_ACP   equ 0   ; default to ANSI code page
CP_OEMCP   equ 1   ; default to OEM  code page
CP_MACCP   equ 2   ; default to MAC  code page
CP_THREAD_ACP   equ 3   ; current thread's ANSI code page
CP_SYMBOL   equ 42   ; SYMBOL translations

CP_UTF7   equ 65000   ; UTF-7 translation
CP_UTF8   equ 65001   ; UTF-8 translation

The switch /ws is also extended to /ws[[=]<value>]

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #8 on: May 11, 2017, 11:45:06 PM »
I made a few changes to the @Cstr() macro to enable usage in the data segment. Normally the macro insert .data at the beginning and .code at the end. This will now be skipped if already in the data segment.

In addition to this the macro normally return offset to the created string, but now this will be skipped if the macro is the first token of the line.

Example

Code: [Select]
usage db 'Usage:',9,'NOLPT 1',9,9,'disable LPT1',10
db 9,'...',10
db 9,'NOLPT 4',9,9,'disable LPT4',10
db 9,'NOLPT 1U',9,'uninstall from LPT1',10
db 9,'etc.',10
db 0

usage   label byte
@CStr( "Usage:\tNOLPT 1\t\tdisable LPT1\n"
"\t...\n"
"\tNOLPT 4\t\tdisable LPT4\n"
"\tNOLPT 1U\tuninstall from LPT1\n"
"\tetc.\n" )

This enable, in addition to use of C-escape characters, to flip from ASCII to Unicode in the data segment using option or switch.

Code: [Select]
string dd @CStr( "string" )

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #9 on: May 15, 2017, 04:34:36 AM »
Added auto detect UTF-8 header (BOM)

Code: [Select]
; Build: asmc /pe test.asm

.486
.model flat, c
option dllimport:<msvcrt.dll>

printf proto :ptr, :vararg
exit proto :dword

.code
start:
printf("BOM detected\n")
exit(0)

end start

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #10 on: June 24, 2017, 07:11:26 AM »
- fixed bug in EIP-related offsets in 64-bit

This apply to using strings in combination with the -pe switch in 64-bit. The logic of string creation is to reuse strings already created. perror( "Nothing to do.." ) followed by strcpy( &path, "." ) will reuse the end of the first string as argument: LEA RDX,DS0000[14]. This failed do to an error calculating the address + 14.

Test case:
Code: [Select]
.x64
.model  flat, fastcall
option  dllimport:<msvcrt>
printf  proto :ptr, :vararg
exit    proto :dword

.data
string  db 16 dup(0)
format  db "%s",10,0
pointer db "%p",10,"%p",10,0

.code

main proc
    mov string[0],'a'
    mov string[1],'b'
    mov string[2],'c'
    mov string[3],'d'
    mov string[4],'e'
    mov string[5],'f'
    invoke printf,addr format, addr string
    lea rdx,string
    lea r8,string[1]
    invoke printf,addr pointer, rdx, r8
    invoke exit,0
main endp

    end main

Output

Old version:
Code: [Select]
a
0000000000403000
0000000000403002
...
000000000040100 | C6 05 F9 1F 00 00 61    | mov byte ptr ds:[403000],61
000000000040100 | C6 05 F4 1F 00 00 62    | mov byte ptr ds:[403002],62
000000000040100 | C6 05 EF 1F 00 00 63    | mov byte ptr ds:[403004],63
000000000040101 | C6 05 EA 1F 00 00 64    | mov byte ptr ds:[403006],64
000000000040101 | C6 05 E5 1F 00 00 65    | mov byte ptr ds:[403008],65
000000000040102 | C6 05 E0 1F 00 00 66    | mov byte ptr ds:[40300A],66

New version:
Code: [Select]
abcdef
0000000000403000
0000000000403001

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #11 on: August 17, 2017, 04:40:47 AM »
More Unicode/ASCII testing.

I've added some macros for declaring a resource directly in the source file for the -pe switch in the winres.inc file. The base for the test is the RichEdit sample tut35.

The resource is declared at the end of the source file and look similar to the .RC file. However the top level definition have to be done manually so the declaration looks like this:
Code: [Select]
WinStart proc
    mov ebx,GetModuleHandle(0)
    ExitProcess(WinMain(ebx, 0, GetCommandLine(), SW_SHOWDEFAULT))
WinStart endp

RCBEGIN

    RCTYPES 3
    RCENTRY RT_MENU
    RCENTRY RT_DIALOG
    RCENTRY RT_ACCELERATOR

    RCENUMN 1
    RCENUMX IDR_MAINMENU
    RCENUMN 4
    RCENUMX IDD_OPTIONDLG
    RCENUMX IDD_FINDDLG
    RCENUMX IDD_GOTODLG
    RCENUMX IDD_REPLACEDLG
    RCENUMN 1
    RCENUMX 105
    REPEAT 6
    RCLANGX LANGUAGEID
    ENDM

    MENUBEGIN
      MENUNAME IDS_FILE
        MENUITEM 0, IDM_OPEN,   IDS_OPEN
        MENUITEM 0, IDM_CLOSE,  IDS_CLOSE
        MENUITEM 0, IDM_SAVE,   IDS_SAVE
        MENUITEM 0, IDM_SAVEAS, IDS_SAVEAS
        SEPARATOR
        MENUITEM MF_END, IDM_EXIT, IDS_EXIT
      MENUNAME IDS_EDIT
        MENUITEM 0, IDM_UNDO,   IDS_UNDO
        MENUITEM 0, IDM_REDO,   IDS_REDO
        MENUITEM 0, IDM_COPY,   IDS_COPY
        MENUITEM 0, IDM_CUT,    IDS_CUT
        MENUITEM 0, IDM_PASTE,  IDS_PASTE
        SEPARATOR
        MENUITEM 0, IDM_DELETE, IDS_DELETE
        SEPARATOR
        MENUITEM MF_END, IDM_SELECTALL, IDS_SELECTALL
      MENUNAME IDS_SEARCH
        MENUITEM 0, IDM_FIND,     IDS_FIND
        MENUITEM 0, IDM_FINDNEXT, IDS_FINDNEXT
        MENUITEM 0, IDM_FINDPREV, IDS_FINDPREV
        MENUITEM 0, IDM_REPLACE,  IDS_REPLACE
        SEPARATOR
        MENUITEM MF_END, IDM_GOTOLINE, IDS_GOTO
      MENUITEM MF_END, IDM_OPTION, IDS_OPTIONS
    MENUEND

    DLGFLAGS equ DS_MODALFRAME or DS_SETFONT or WS_POPUP or WS_VISIBLE or WS_CAPTION or WS_SYSMENU

    DLGBEGIN DLGFLAGS,7,0,0,196,60
     CAPTION IDS_OPTIONS
     FONT 8, "MS Sans Serif"
      DEFPUSHBUTTON   IDS_OK,IDOK,137,7,50,14;39,14
      PUSHBUTTON      IDS_CANCEL,IDCANCEL,137,25,50,14
      GROUPBOX        0,IDC_STATIC,5,0,124,49
      LTEXT           IDS_BACKGR,IDC_STATIC,20,14,60,8
      LTEXT           0,IDC_BACKCOLORBOX,85,11,28,14,SS_NOTIFY or WS_BORDER
      LTEXT           IDS_FOREGR,IDC_STATIC,20,33,35,8
      LTEXT           0,IDC_TEXTCOLORBOX,85,29,28,14,SS_NOTIFY or WS_BORDER
    DLGEND

    DLGBEGIN DLGFLAGS,9,0,0,186,54
     CAPTION IDS_FIND2
     FONT 8, "MS Sans Serif"
      EDITTEXT        IDC_FINDEDIT,42,3,94,12,ES_AUTOHSCROLL
      CONTROL         IDS_MATCHCASE,IDC_MATCHCASE,RC_BUTTON,BS_AUTOCHECKBOX or WS_TABSTOP,6,24,54,10
      CONTROL         IDS_WHOLEWORD,IDC_WHOLEWORD,RC_BUTTON,BS_AUTOCHECKBOX or WS_TABSTOP,6,37,56,10
      CONTROL         IDS_DOWN,IDC_DOWN,RC_BUTTON,BS_AUTORADIOBUTTON or WS_TABSTOP,83,27,35,10
      CONTROL         IDS_UP,IDC_UP,RC_BUTTON,BS_AUTORADIOBUTTON or WS_TABSTOP,83,38,25,10
      DEFPUSHBUTTON   IDS_OK,IDOK,141,3,39,12
      PUSHBUTTON      IDS_CANCEL,IDCANCEL,141,18,39,12
      LTEXT           IDS_FINDWHAT,IDC_STATIC,5,4,34,8
      GROUPBOX        IDS_DIRECTION,IDC_STATIC,70,18,64,32
    DLGEND

    DLGBEGIN DLGFLAGS,4,0,0,106,30,WS_EX_TOOLWINDOW
     CAPTION IDS_GOTO2
     FONT 8, "MS Sans Serif", 0, 0, 0x1
      EDITTEXT        IDC_LINENO,29,4,35,11,ES_AUTOHSCROLL or ES_NUMBER,WS_EX_CLIENTEDGE
      DEFPUSHBUTTON   IDS_OK,IDOK,70,4,31,11
      PUSHBUTTON      IDS_CANCEL,IDCANCEL,70,17,31,11
      LTEXT           IDS_LINE,IDC_STATIC,8,5,18,8
    DLGEND

    DLGBEGIN DLGFLAGS,6,0,0,186,33
     CAPTION IDS_REPLACE2
     FONT 8, "MS Sans Serif"
      EDITTEXT        IDC_FINDEDIT,51,3,84,12,ES_AUTOHSCROLL
      EDITTEXT        IDC_REPLACEEDIT,51,17,84,11,ES_AUTOHSCROLL
      DEFPUSHBUTTON   IDS_OK,IDOK,142,3,39,11
      PUSHBUTTON      IDS_CANCEL,IDCANCEL,142,17,39,11
      LTEXT           IDS_FINDWHAT,IDC_STATIC,3,4,34,8
      LTEXT           IDS_REPLACEWITH,IDC_STATIC,3,18,42,8
    DLGEND

    DLGBEGIN 0x0046000B,0x00009C4E,11,71,0x9C51,0,11
      db 70,0, 78,156,0,0, 11,0, 71,0, 81,156, 0,0
      db 11,0,82,0,80,156,0,0,3,0,114,0,79,156,0,0
      db 139,0,114,0,82,156,0,0
    DLGEND

RCEND

    end WinStart

The IDS-strings are declared in separate files in this case. I  added three languages. One set of ASCII files and one as UTF-8. The specific language file sets the Unicode/ASCII definition and code page.

Code: [Select]
;
; Build: asmc -pe test.asm
;
__PE__   equ 1
_CType   equ <stdcall>

include lang/en.txt
include windows.inc
include richedit.inc
include winres.inc

Code: [Select]
LANGUAGEID      equ LANGID_NN

if 1 ; Set if not local
 _UNICODE       equ 1
 option         wstring:on
 option         codepage:865
endif

IDS_FILE        equ <"&Fil">

The resource macro RCBEGIN sets the Unicode option on given the menus and dialogs are always Unicode but the rest of the code may use A or W depending on location.

nidud

  • Member
  • *****
  • Posts: 1411
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #12 on: August 20, 2017, 07:09:40 AM »
Added a second test with syntax highlighting. This uses the same method as Doszip Edit with support for numbers and quotes. I also added a font selector to the menu.