Author Topic: Unicode strings  (Read 862 times)

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Unicode strings
« on: January 13, 2017, 06:52:12 AM »
The switch /ws is added to convert "quoted strings" to Unicode. In addition to this the OPTION WSTRING:[ON|OFF] is added. This will allow declaration of a string array directly in the data segment using DW:
Code: [Select]
option wstring:on

dw "Declaring a Unicode string",0

Test case:
Code: [Select]
include conio.inc
include ctype.inc

.code

main proc

_cputws( "Type 'Y' when finished typing keys: " )

.repeat

toupper( _getwch()  )

.until al == 'Y'

_putwch( eax ) ; 'Y'
_putwch( 13 ) ; Carriage return
_putwch( 10 ) ; Line feed

xor eax,eax
ret

main endp

END main

Makefile:
Code: [Select]
_getwch.exe:
asmc -ws -pe -D__PE__ -D_WIN64 $*.asm
$@
pause

This also works with the @CStr() macro.

ragdog

  • Member
  • ****
  • Posts: 514
Re: Unicode strings
« Reply #1 on: January 13, 2017, 06:58:08 AM »
Nice idea :t

Vortex

  • Member
  • *****
  • Posts: 1665
Re: Unicode strings
« Reply #2 on: January 13, 2017, 07:16:06 AM »
Hi nidud,

Nice work with asmc. Poasm has the same approach, dw is used for words or Unicode strings. Why not to use du looking more specific than dw?

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #3 on: January 13, 2017, 08:23:21 AM »
Poasm has the same approach, dw is used for words or Unicode strings. Why not to use du looking more specific than dw?

The plan was to rewrite the string parsing for this implementation, only targeting the existing functionality of the string usage. However, I had to create a whole duplicated set of functions for this to work correctly, so it became rather extensive.

Extending the DW functionality on the other hand was rather simple, so I went with the lazy approach: you then only have to flip DB to DW in the end. Given it's possible to define little endian values using DW it does create some compatibility issues, but at the same time also add some extended functionality to the data section.

Code: [Select]
dw "ab"

In addition to this the string-hash used for detecting duplicated strings still uses the original ASCII string, so switching the option on and off may have unintended consequences:
Code: [Select]
mov eax,@CStr( "abcdef" )
option wstring:on
mov eax,@CStr( "def" )

The last one will now use the offset of the first ASCII string + 3.

Well, as for now the only usage is Asmc -ws, so it wont create any potential backward compatibility issues given it will mainly be used writing in this syntax. The implementation in the data segment may change later if this becomes an issue and the DU options may then be a solution.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4556
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Unicode strings
« Reply #4 on: January 13, 2017, 09:25:15 AM »
 :biggrin:

> The switch /ws is added to convert "quoted strings" to Unicode. In addition to this the OPTION WSTRING:[ON|OFF] is added. This will allow declaration of a string array directly in the data segment using DW:

Compliments, this has needed to be done for a long time.  :t
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #5 on: January 13, 2017, 10:14:26 AM »
Looking at the test case it assumes a start-up module, ending with RET, and then END main. Should probably end with exit(0).

Still work thought  :biggrin:

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #6 on: March 26, 2017, 09:02:18 AM »
Some changes added to the string functions:
- added null ( "" ) string
- added ("Multi" "Text" "Lines")
- added (L"Unicode") string macro

test case:
Code: [Select]
.386
.model flat, stdcall
.code

S equ <" ">
T equ <"\t">
N equ <"\n">
A equ <"Auto">
U equ <"Unicode">
W equ <L"Unicode">

foo proc a1
foo endp
bar proc a1, a2
bar endp

foo( A ) ; DS0000
foo( "" ) ; DS0000[4]
foo( U ) ; DS0001
foo( W ) ; DS0002

bar( A, A ) ; 0,0
bar( A, U ) ; 0,1
bar( A, W ) ; 0,2

bar( U, A ) ; 1,0
bar( U, U ) ; 1,1
bar( U, W ) ; 1,2

bar( W, A ) ; 2,0
bar( W, U ) ; 2,1
bar( W, W ) ; 2,2

foo(
U T A N ; DS0003
A T U N
)
foo(
W T A N ; DS0004
A T U N
)
bar(
U T A N ; 3
A T U N,
W T A N ; 4
A T U N
)

mov eax,@CStr( "" ) ; 3 -- DS0003[26]
mov eax,@CStr( "Auto" ) ; 0
mov eax,@CStr( "Unicode" ) ; 1
mov eax,@CStr( L"Unicode" ) ; 2
mov eax,@CStr(
    "Unicode" "\t" "Auto" "\n" ; 3
    "Auto" "\t" "Unicode" "\n"
)

foo(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n"
)

END

A special flag is set if <L"> is found inside a proc( L"" ) or inside @CStr( L"" ). This will enable expansion of DW "string" without the -ws switch or wstring option set.

A macro named L will not be effected but a text equation will strip the L.

The total size of the separated strings is currently limited to maximum line size (2K).

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #7 on: May 07, 2017, 09:03:17 AM »
Added OPTION CODEPAGE:<value> for Unicode creation. This is basically the first argument to MultiByteToWideChar(). The default value is 0.

Code: [Select]
;
;  Code Page Default Values.
;
CP_ACP   equ 0   ; default to ANSI code page
CP_OEMCP   equ 1   ; default to OEM  code page
CP_MACCP   equ 2   ; default to MAC  code page
CP_THREAD_ACP   equ 3   ; current thread's ANSI code page
CP_SYMBOL   equ 42   ; SYMBOL translations

CP_UTF7   equ 65000   ; UTF-7 translation
CP_UTF8   equ 65001   ; UTF-8 translation

The switch /ws is also extended to /ws[[=]<value>]

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #8 on: May 11, 2017, 11:45:06 PM »
I made a few changes to the @Cstr() macro to enable usage in the data segment. Normally the macro insert .data at the beginning and .code at the end. This will now be skipped if already in the data segment.

In addition to this the macro normally return offset to the created string, but now this will be skipped if the macro is the first token of the line.

Example

Code: [Select]
usage db 'Usage:',9,'NOLPT 1',9,9,'disable LPT1',10
db 9,'...',10
db 9,'NOLPT 4',9,9,'disable LPT4',10
db 9,'NOLPT 1U',9,'uninstall from LPT1',10
db 9,'etc.',10
db 0

usage   label byte
@CStr( "Usage:\tNOLPT 1\t\tdisable LPT1\n"
"\t...\n"
"\tNOLPT 4\t\tdisable LPT4\n"
"\tNOLPT 1U\tuninstall from LPT1\n"
"\tetc.\n" )

This enable, in addition to use of C-escape characters, to flip from ASCII to Unicode in the data segment using option or switch.

Code: [Select]
string dd @CStr( "string" )

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #9 on: May 15, 2017, 04:34:36 AM »
Added auto detect UTF-8 header (BOM)

Code: [Select]
; Build: asmc /pe test.asm

.486
.model flat, c
option dllimport:<msvcrt.dll>

printf proto :ptr, :vararg
exit proto :dword

.code
start:
printf("BOM detected\n")
exit(0)

end start

nidud

  • Member
  • *****
  • Posts: 1294
    • https://github.com/nidud/asmc
Re: Unicode strings
« Reply #10 on: June 24, 2017, 07:11:26 AM »
- fixed bug in EIP-related offsets in 64-bit

This apply to using strings in combination with the -pe switch in 64-bit. The logic of string creation is to reuse strings already created. perror( "Nothing to do.." ) followed by strcpy( &path, "." ) will reuse the end of the first string as argument: LEA RDX,DS0000[14]. This failed do to an error calculating the address + 14.

Test case:
Code: [Select]
.x64
.model  flat, fastcall
option  dllimport:<msvcrt>
printf  proto :ptr, :vararg
exit    proto :dword

.data
string  db 16 dup(0)
format  db "%s",10,0
pointer db "%p",10,"%p",10,0

.code

main proc
    mov string[0],'a'
    mov string[1],'b'
    mov string[2],'c'
    mov string[3],'d'
    mov string[4],'e'
    mov string[5],'f'
    invoke printf,addr format, addr string
    lea rdx,string
    lea r8,string[1]
    invoke printf,addr pointer, rdx, r8
    invoke exit,0
main endp

    end main

Output

Old version:
Code: [Select]
a
0000000000403000
0000000000403002
...
000000000040100 | C6 05 F9 1F 00 00 61    | mov byte ptr ds:[403000],61
000000000040100 | C6 05 F4 1F 00 00 62    | mov byte ptr ds:[403002],62
000000000040100 | C6 05 EF 1F 00 00 63    | mov byte ptr ds:[403004],63
000000000040101 | C6 05 EA 1F 00 00 64    | mov byte ptr ds:[403006],64
000000000040101 | C6 05 E5 1F 00 00 65    | mov byte ptr ds:[403008],65
000000000040102 | C6 05 E0 1F 00 00 66    | mov byte ptr ds:[40300A],66

New version:
Code: [Select]
abcdef
0000000000403000
0000000000403001