### Author Topic: Rotate Bits SSE  (Read 5378 times)

#### jj2007

• Member
• Posts: 11476
• Assembler is fun ;-)
##### Re: Rotate Bits SSE
« Reply #30 on: April 30, 2020, 02:37:21 PM »
Another approach, no idea how fast it is:

include \masm32\MasmBasic\MasmBasic.inc
.data
Number         OWORD 10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110y
.code
RotateLeft128 proc pNumber      ; single shift left
mov eax, pNumber
movups xmm0, OWORD ptr [eax]
psllq xmm0, 1                 ; shift left two qwords by one bit
movsx edx, byte ptr [eax+7]   ; get sign of low qword
test sbyte ptr [eax+15], 128  ; get sign of high qword
movups OWORD ptr [eax], xmm0
.if Sign?
or byte ptr [eax], 1    ; rotate sign of high qword in
.endif
test edx, edx
.if Sign?
or byte ptr [eax+8], 1  ; set bit 0 of low qword
.endif
ret
RotateLeft128 endp
Init
Cls
deb 4, "in", b:Number:4       ; show number as binary, 4 dwords
invoke RotateLeft128, offset Number
deb 4, "out", b:Number:4
EndOfCode

Output:
Code: [Select]
in      b:Number:4      10000110111111101111111011111110 10001110111111101111111011111110 10011110111111101111111011111110 10111110111111101111111011111110
out     b:Number:4      00001101111111011111110111111101 00011101111111011111110111111101 00111101111111011111110111111101 01111101111111011111110111111101

#### daydreamer

• Member
• Posts: 1635
• building nextdoor
##### Re: Rotate Bits SSE
« Reply #31 on: April 30, 2020, 07:01:30 PM »

The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
specific case of rotate bits ala byte can be done
look below on RGB shuffle alternative instead of bitshift channels
http://masm32.com/board/index.php?topic=7687.0
SIMD fan and macro fan
why assembly is fastest is because its switch has no (brakes) breaks
:P
only in 16bit assembly you can get away with "Only words" :P

#### jj2007

• Member
• Posts: 11476
• Assembler is fun ;-)
##### Re: Rotate Bits SSE
« Reply #32 on: April 30, 2020, 08:44:16 PM »
Compliments to Marinus
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
in      b:MyOword:4d    10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J   b:MyOword:4d    00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M   b:MyOword:4d    00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
259 ms for Marinus
555 ms for Jochen

#### mineiro

• Member
• Posts: 697
##### Re: Rotate Bits SSE
« Reply #33 on: April 30, 2020, 09:33:06 PM »
sir jj code results in my machine:
Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
336500     345020     140012     217061      30000
306974     332504     140001     213686      30000
305980     331467     140001     213544      30000
307002     332557     140001     213615      30000

sir daydreamer, I'm reading your code and will check what can be done. Thanks.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

#### jj2007

• Member
• Posts: 11476
• Assembler is fun ;-)
##### Re: Rotate Bits SSE
« Reply #34 on: April 30, 2020, 10:23:27 PM »
sir jj code results in my machine:
Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
336500     345020     140012     217061      30000
306974     332504     140001     213686      30000
305980     331467     140001     213544      30000
307002     332557     140001     213615      30000

Can you explain these numbers?

#### Siekmanski

• Member
• Posts: 2362
##### Re: Rotate Bits SSE
« Reply #35 on: April 30, 2020, 10:24:04 PM »

Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
in      b:MyOword:4d    10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J   b:MyOword:4d    00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M   b:MyOword:4d    00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
226 ms for Marinus
418 ms for Jochen
Creative coders use backward thinking techniques as a strategy.

#### mineiro

• Member
• Posts: 697
##### Re: Rotate Bits SSE
« Reply #36 on: May 01, 2020, 01:28:29 AM »
Can you explain these numbers?
Thats better explained in agner fog blog (testp.pdf). I'm using PMCTestB64.asm file and assembling with uasm in linux.
Basically clock, core cycles, instructions, micro operations and branchs taken.

Clock is the clock count measured with the RDTSC instruction.
Core cyc is the clock count measured with the "core clock cycles" counter. The CPU can change its core frequency dynamically in order to save power. ...

7.14 What do I need the performance monitor counters for?
These counters are useful for counting instructions, micro-operations, cache misses, branch mispredictions and other events that are useful for testing program performance.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

#### jj2007

• Member
• Posts: 11476
• Assembler is fun ;-)
##### Re: Rotate Bits SSE
« Reply #37 on: May 01, 2020, 01:38:01 AM »
Yes, sure, I know all that, but how does it relate to the algos tested?

#### mineiro

• Member
• Posts: 697
##### Re: Rotate Bits SSE
« Reply #38 on: May 01, 2020, 01:45:00 AM »
good question, I don't know.
305980 / 140874 = 2,172
555ms  /  259ms = 2,1429
418ms  /  226ms = 1,849557522
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

#### Siekmanski

• Member
• Posts: 2362
##### Re: Rotate Bits SSE
« Reply #39 on: May 01, 2020, 04:10:22 AM »

Time is the only constant factor. ( At least at sea level )
Different CPU architectures have their own pros and cons.
Creative coders use backward thinking techniques as a strategy.

#### mineiro

• Member
• Posts: 697
##### Re: Rotate Bits SSE
« Reply #40 on: May 01, 2020, 08:53:39 AM »
good question, I don't know.
305980 / 140874 = 2,172
555ms  /  259ms = 2,1429
418ms  /  226ms = 1,849557522

Code: [Select]
mineiro@assembly:~/asm\$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping        : 3
microcode       : 0xcc
cpu MHz         : 800.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch xsaveopt invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bogomips        : 6816.06
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

rept 10000, constant_tsc and cpu MHz : 800.000,
305980 / 800000000 = 0,000382475 seconds
140874 / 800000000 = 0,0001760925 seconds

382475 / 176092  == 2,17201803603
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #41 on: May 01, 2020, 04:57:17 PM »
A few rapid tricks concerning rotating bits.

For faster rotation of specific bits we can also use pshufd mnemonic (Tks Siekmanski for the tip)

To do that, i create a table of equates of all possible 256 values used in pshufd. For rotation i created these:
Code: [Select]
[SSE_ROTATE_RIGHT_96BITS   147]    ; Dwords in xmm are copied from 1234 to 2341 ordering. Therefore, it is rotating right 96 bits. (Which is the same as rotating left 32 bits)
[SSE_ROTATE_LEFT_32BITS 147]        ; the same things and result as SSE_ROTATE_RIGHT_96BITS

[SSE_ROTATE_LEFT_96BITS     57]      ; Dwords in xmm are copied from 1234 to 4123 ordering. Therefore, it is rotating left 96 bits. (Which is the same as rotating right 32 bits)
[SSE_ROTATE_RIGHT_32BITS  57]      ; the same things and result as SSE_ROTATE_LEFT_96BITS

for 64 bits, we have a simple Swaping of both Qwords inside xmm, therefore we are rotating (left and right) 64 bits. Rotating left and right 64 bits are the same thing

[SSE_SWAP_QWORDS 78]        ; Dwords in xmm are copied from 1234 to 3412 ordering. Therefore, it is rotating right and left left 64 bits.
[SSE_ROTATE_LEFT_64BITS 78]      ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_RIGHT_64BITS 78]    ; same as above
[SSE_ROTATE_64BITS 78]                ; same as above

We can also use pshufd to invert the order of dwords simple as:
[SSE_INVERT_DWORDS 27]      ; Dwords in xmm are copied from 1234 to 4321 ordering.

Example of usage:

pshufd xmm1 xmm0 SSE_INVERT_DWORDS ; xmm1 will contains the inverted order of dwords in xmm0
pshufd xmm0 xmm0 SSE_INVERT_DWORDS ; inverted the order of dwords in xmm0

pshufd xmm1 xmm0 SSE_SWAP_QWORDS ; swap qwords and copy them from xmm0 to xmm1
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap qwords in xmm0

pshufd xmm1 xmm0 SSE_ROTATE_RIGHT_96BITS ; rotate 96 bits left in xmm0 and copy them onto xmm1

etc etc

I´m finishing building the labels for all these equates related to pshufd and will put them on another thread. Since pshufd is a bit confusing in having to remember all the bits location etc, making equates with the proper label names is better to understand what to do and also, build a set of macros if needed (for rosasm, masm, nasm, fasm etc)

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### Siekmanski

• Member
• Posts: 2362
##### Re: Rotate Bits SSE
« Reply #42 on: May 01, 2020, 05:42:28 PM »
You can use this MACRO for the shuffle positions,

Shuffle MACRO V0,V1,V2,V3
EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

pshufd  xmm0,xmm0,Shuffle(1,0,3,2)
is equal to:
pshufd  xmm0,xmm0,01001110b
Creative coders use backward thinking techniques as a strategy.

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #43 on: May 02, 2020, 10:38:26 AM »
You can use this MACRO for the shuffle positions,

Shuffle MACRO V0,V1,V2,V3
EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

pshufd  xmm0,xmm0,Shuffle(1,0,3,2)
is equal to:
pshufd  xmm0,xmm0,01001110b

Great !!! Thanks, siekmanski. I[´ll adapt it to RosAsm and make some tests :)  using macros is much much easier to identify how to use this SSE mnemonic:)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #44 on: May 04, 2020, 06:37:15 AM »
Hi Siekmanski

About the macro, how to make ones for pshufhw, pshuflw, pshufw, shufpd, shufps ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com