Hi Guys
I gave a test on a Dword to binary string converter using SSE2. Can someone benchmark it for me please ? (Many thanks to Peter Cordes for the tip :thumbsup:)
; only used for SSSE3
[<16 shuf_broadcast_hi_lo:
B$ 1,1,1,1, 1,1,1,1 ; broadcast the second 8 bits to the first 8 bytes
B$ 0,0,0,0, 0,0,0,0] ; broadcast the first 8 bits to the second 8 bytes
; select the relevant bit within each byte, from high to low for printing
[<16 bitmask: B$ 128, 64, 32, 16, ; 1<<7, 1<<6, 1<<5, 1<<4
B$ 8, 4, 2, 1, ; 1<<3, 1<<2, 1<<1, 1<<0
B$ 128, 64, 32, 16, ; 1<<7, 1<<6, 1<<5, 1<<4
B$ 8, 4, 2, 1] ; 1<<3, 1<<2, 1<<1, 1<<0
[<16 ascii_ones: '1' #16] ; Number "1" (in Ascii) duplicated 16 times.
Proc numberToBin:
Arguments @Number, @Output
movd xmm0 D@Number ; 32-bit load even though we only care about the low 16 bits.
mov eax D@Output ; Output buffer pointer
; to print left-to-right, we need the high bit to go in the first (low) byte
punpcklbw xmm0 xmm0 ; llhh (from low to high byte elements)
pshuflw xmm0 xmm0 5 ; 5 hhhhllll
punpckldq xmm0 xmm0 ; hhhhhhhhllllllll
; or with SSSE3:
; pshufb xmm0 X$[shuf_broadcast_hi_lo] ; SSSE3
pand xmm0 X$bitmask ; each input bit is now isolated within the corresponding output byte
; compare it against zero
pxor xmm1 xmm1
pcmpeqb xmm0 xmm1 ; -1 in elements that are 0, 0 in elements with any non-zero bit.
paddb xmm0 X$ascii_ones ; '1' + (-1 or 0) = '0' or 1'
mov B$eax+16 0 ; terminating zero
movups X$eax xmm0
EndP
Example of usage:
[testing: B$ 0 #256]
call numberToBin 123456, testing
I personally prefer a version without having to align the data, but, i tested it 1t to see if it was working :) . So perhaps using movdqu to load the values at bitmask and ascii_ones Tables would be better to avoid the need of alignment of data.
References:
https://stackoverflow.com/questions/40811218/creating-an-x86-assembler-program-that-converts-an-integer-to-a-16-bit-binary-st
https://www.agner.org/optimize
Quote from: guga on May 03, 2020, 07:24:13 PM
Hi Guys
I gave a test on a Dword to binary string converter using SSE2. Can someone benchmark it for me please ? (Many thanks to Peter Cordes for the tip :thumbsup:)
; only used for SSSE3
[<16 shuf_broadcast_hi_lo:
B$ 1,1,1,1, 1,1,1,1 ; broadcast the second 8 bits to the first 8 bytes
B$ 0,0,0,0, 0,0,0,0] ; broadcast the first 8 bits to the second 8 bytes
; select the relevant bit within each byte, from high to low for printing
[<16 bitmask: B$ 128, 64, 32, 16, ; 1<<7, 1<<6, 1<<5, 1<<4
B$ 8, 4, 2, 1, ; 1<<3, 1<<2, 1<<1, 1<<0
B$ 128, 64, 32, 16, ; 1<<7, 1<<6, 1<<5, 1<<4
B$ 8, 4, 2, 1] ; 1<<3, 1<<2, 1<<1, 1<<0
[<16 ascii_ones: '1' #16] ; Number "1" (in Ascii) duplicated 16 times.
Proc numberToBin:
Arguments @Number, @Output
movd xmm0 D@Number ; 32-bit load even though we only care about the low 16 bits.
mov eax D@Output ; Output buffer pointer
; to print left-to-right, we need the high bit to go in the first (low) byte
punpcklbw xmm0 xmm0 ; llhh (from low to high byte elements)
pshuflw xmm0 xmm0 5 ; 5 hhhhllll
punpckldq xmm0 xmm0 ; hhhhhhhhllllllll
; or with SSSE3:
; pshufb xmm0 X$[shuf_broadcast_hi_lo] ; SSSE3
pand xmm0 X$bitmask ; each input bit is now isolated within the corresponding output byte
; compare it against zero
pxor xmm1 xmm1
pcmpeqb xmm0 xmm1 ; -1 in elements that are 0, 0 in elements with any non-zero bit.
paddb xmm0 X$ascii_ones ; '1' + (-1 or 0) = '0' or 1'
mov B$eax+16 0 ; terminating zero
movups X$eax xmm0
EndP
Example of usage:
[testing: B$ 0 #256]
call numberToBin 123456, testing
I personally prefer a version without having to align the data, but, i tested it 1t to see if it was working :) . So perhaps using movdqu to load the values at bitmask and ascii_ones Tables would be better to avoid the need of alignment of data.
References:
https://stackoverflow.com/questions/40811218/creating-an-x86-assembler-program-that-converts-an-integer-to-a-16-bit-binary-st
https://www.agner.org/optimize
why unaligned?you can use alignas(); in C/C++
I have seen earlier combine of non-SSE2 algo used in the first unaligned and last unaligned few bytes and in the middle aligned SS2 algo when it comes to character algos and I think thats a good alternative
One issue. The fucntion seems to work only for 16 bit (Word) and not a Dword.
How to extend it to work with 32 bit numbers ?