### Author Topic: Rotate Bits SSE  (Read 5976 times)

#### Siekmanski

• Member
• Posts: 2365
##### Re: Rotate Bits SSE
« Reply #15 on: April 28, 2020, 01:05:39 PM »
Hi guga,

Here is my contribution, left and right rotation in 1 routine.

Code: [Select]
`Number         oword 000102030405060708090a0b0c0d0e0fhNumRotateBits  dd -4    ; number of bits to rotate ( range = -127 to +127 )Rotate128Bits:    movdqu  xmm0,oword ptr [Number]    mov     ecx,64                  ; 64 bits Barrier.    mov     eax,NumRotateBits       ; Number of bits to rotate.    test    eax,eax                 ; Positive = Left and Negative = Right Bit Rotation    jns     TestDirection           ; Rotate Left or Right?    add     eax,128                 ; Correct the position. TestDirection:    cmp     eax,ecx    jb      Test64bitBarrier    pshufd  xmm0,xmm0,01001110b     ; Swap High and Low 64 bits if NumRotateBits > 64Test64bitBarrier:    and     eax,63                  ; Keep it within 64 bit.    movd    xmm2,eax                ; Bits to shift.    movd    xmm3,ecx                ; 64 bit Shift-Range.    movdqu  xmm1,xmm0               ; Make a copy of the 128 bits.    psubq   xmm3,xmm2               ; Calculate Shift-Range.    psllq   xmm0,xmm2               ; Do the bit shifts to the left.    psrlq   xmm1,xmm3               ; Do the bit shifts to the right.    pshufd  xmm1,xmm1,01001110b     ; Swap High and Low 64 bit.    pxor    xmm0,xmm1               ; Glue the 128 bits together.    ret                             ; result in xmm0`
Creative coders use backward thinking techniques as a strategy.

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #16 on: April 29, 2020, 04:06:45 AM »
Tks you so much, Siekmanski.

Works like a charm.

This is for a routine i´m making to create a SHA3 Cypher. So far, the code and result works  ok, but there are some minor issues i´m, trying to optimize it more. For example....How to optimize this and port from MMX to SSE2 ?

Code: [Select]
`; Where the arguments State is a 25 dupe Qword (so, 50 qwords in total) = KeccakState: Q\$ 0 #25 and BC is a 5 dupe qword. (10 qwords). [BC: Q\$ 0 #5]Proc keccakf_Theta::    Arguments @State    Uses esi, edi, ecx, ebx, edx    mov esi D@State    movups XMM0 X\$esi    movups XMM1 X\$esi+40 | xorps XMM0 XMM1    movups XMM2 X\$esi+80 | xorps XMM0 XMM2    movups XMM2 X\$esi+120 | xorps XMM0 XMM2    movups XMM2 X\$esi+160 | xorps XMM0 XMM2    movups X\$BC XMM0    movups XMM0 X\$esi+16    movups XMM1 X\$esi+56 | xorps XMM0 XMM1    movups XMM2 X\$esi+96 | xorps XMM0 XMM2    movups XMM2 X\$esi+136 | xorps XMM0 XMM2    movups XMM2 X\$esi+176 | xorps XMM0 XMM2    movups X\$BC+16 XMM0    movq MM0 Q\$esi+32 | pxor MM0 Q\$esi+72 | pxor MM0 Q\$esi+112 | pxor MM0 Q\$esi+152 | pxor MM0 Q\$esi+192 | movq Q\$BC+32 MM0    mov edi BC    movq MM0 Q\$edi+8    call rotl64 1    pxor MM0 Q\$edi+32    movq MM1 Q\$esi | pxor MM1 MM0 | movq Q\$esi MM1; esi    movq MM1 Q\$esi+40 | pxor MM1 MM0 | movq Q\$esi+40 MM1 ; esi+40    movq MM1 Q\$esi+80 | pxor MM1 MM0 | movq Q\$esi+80 MM1; esi+80    movq MM1 Q\$esi+120 | pxor MM1 MM0 | movq Q\$esi+120 MM1;esi+120    movq MM1 Q\$esi+160 | pxor MM1 MM0 | movq Q\$esi+160 MM1;esi+160    movq MM0 Q\$edi+16    call rotl64 1    pxor MM0 Q\$edi    movq MM1 Q\$esi+8 | pxor MM1 MM0 | movq Q\$esi+8 MM1;esi+8    movq MM1 Q\$esi+48 | pxor MM1 MM0 | movq Q\$esi+48 MM1;esi+48    movq MM1 Q\$esi+88 | pxor MM1 MM0 | movq Q\$esi+88 MM1;esi+88    movq MM1 Q\$esi+128 | pxor MM1 MM0 | movq Q\$esi+128 MM1;esi+128    movq MM1 Q\$esi+168 | pxor MM1 MM0 | movq Q\$esi+168 MM1;esi+168    movq MM0 Q\$edi+24    call rotl64 1    pxor MM0 Q\$edi+8    movq MM1 Q\$esi+16 | pxor MM1 MM0 | movq Q\$esi+16 MM1 ; eax = 16    movq MM1 Q\$esi+56 | pxor MM1 MM0 | movq Q\$esi+56 MM1 ; eax = 56    movq MM1 Q\$esi+96 | pxor MM1 MM0 | movq Q\$esi+96 MM1 ; eax = 96    movq MM1 Q\$esi+136 | pxor MM1 MM0 | movq Q\$esi+136 MM1; eax = 136    movq MM1 Q\$esi+176 | pxor MM1 MM0 | movq Q\$esi+176 MM1 ; eax = 176    movq MM0 Q\$edi+32    call rotl64 1    pxor MM0 Q\$edi+16    movq MM1 Q\$esi+24 | pxor MM1 MM0 | movq Q\$esi+24 MM1 ; eax = 24    movq MM1 Q\$esi+64 | pxor MM1 MM0 | movq Q\$esi+64 MM1 ; eax = 64    movq MM1 Q\$esi+104 | pxor MM1 MM0 | movq Q\$esi+104 MM1 ; eax  = 104    movq MM1 Q\$esi+144 | pxor MM1 MM0 | movq Q\$esi+144 MM1 ; eax = 144    movq MM1 Q\$esi+184 | pxor MM1 MM0 | movq Q\$esi+184 MM1 ; eax = 184    movq MM0 Q\$edi    call rotl64 1    pxor MM0 Q\$edi+24    movq MM1 Q\$esi+32 | pxor MM1 MM0 | movq Q\$esi+32 MM1    movq MM1 Q\$esi+72 | pxor MM1 MM0 | movq Q\$esi+72 MM1    movq MM1 Q\$esi+112 | pxor MM1 MM0 | movq Q\$esi+112 MM1    movq MM1 Q\$esi+152 | pxor MM1 MM0 | movq Q\$esi+152 MM1    movq MM1 Q\$esi+192 | pxor MM1 MM0 | movq Q\$esi+192 MM1EndP`
; where rotl64 =
Code: [Select]
`Proc rotl64:    Arguments @Count    mov eax D@Count    movq MM2 MM0    movd MM3 eax    sub eax 64    neg eax    movd MM4 eax    psllq MM0 MM3    psrlq MM2 MM4    por MM0 MM2EndP`

The above is the unrolled loop routine from my keccakf routine related to compute the loop to calculate Theta,. I´ll post here laater if needed, since i´m trying to optimize other parts of the algo as well :)

So far, my SHA3 cypher works ok, for all different types such as 224, 256, 384 and 512. But more optimization is needed :)

Ex: To compute the SHA3 from string "guga" my algo returns the corrected values (Although, i believe it is still needs more optimization):

SHA3-224    fa6d153b62efd49a85c2b4ae0b53e2f2361d54e51dc2940281b6691d
SHA3-256    0aa747b1b0f38a900fccf52a50386ff57b59e7123ecfc2df1547973e2eeaa19d
SHA3-512    603c3235eb25a491b5fd8148065066ab307769b714e5abc43448a37f361864e4d786afaff22d5ac5c839d350413d36c953837bf0b29171cae18f7395d720762c

https://md5calc.com/hash/sha3-256/guga
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #17 on: April 29, 2020, 04:36:01 AM »
Or, maybe rotating each of the High and Low 64 bit at once. (Could also works for other needs as well, such as a routine to rotate bytes (16 at once), words (8 at once), qwords (2 at once) that are stored inside a XMM register)

For example:
XMM0 = 001000....0010    (1st 64 bits) - rotate this 64 bits
XMM0 = 010011....1011    (2nd 64 bits) - rotate this 64 bits

So, making rotate only by the 64 bits chunks, rather then the whole 128 bit chain. I.e: Forcing the data from "rotl64" to compute a rotation at 2 64bits chains (qword) at once.

Perhaps this could also helps optimizing a bit the sequence before it goes through the chunks of  pxor MM1 MM0 .....
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### Siekmanski

• Member
• Posts: 2365
##### Re: Rotate Bits SSE
« Reply #18 on: April 29, 2020, 06:46:40 AM »
I have no knowledge regarding Hash Functions, so I had a look on the internet.
Couldn't find a logical explanation of the algorithm which I can understand.
It's a bit hard for me to decipher your code example.

Is it so you only rotate 1 bit at a time?
Creative coders use backward thinking techniques as a strategy.

#### jj2007

• Member
• Posts: 11526
• Assembler is fun ;-)
##### Re: Rotate Bits SSE
« Reply #19 on: April 29, 2020, 07:20:46 AM »
Here is my contribution, left and right rotation in 1 routine.

Looks comvincing
Code: [Select]
`  .Repeat inc NumRotateBits call Rotate128Bits deb 4, "Out", b:xmm0  .Until NumRotateBits>=64  .Repeat dec NumRotateBits call Rotate128Bits deb 4, "Out", b:xmm0  .Until NumRotateBits==-64`
Code: [Select]
`Out     b:xmm0          00011000000110100001110000011110Out     b:xmm0          00110000001101000011100000111100Out     b:xmm0          01100000011010000111000001111000Out     b:xmm0          11000000110100001110000011110000Out     b:xmm0          10000001101000011100000111100000Out     b:xmm0          00000011010000111000001111000000Out     b:xmm0          00000110100001110000011110000000Out     b:xmm0          00001101000011100000111100000000Out     b:xmm0          00011010000111000001111000000000Out     b:xmm0          00110100001110000011110000000000Out     b:xmm0          01101000011100000111100000000000Out     b:xmm0          11010000111000001111000000000000Out     b:xmm0          10100001110000011110000000000000Out     b:xmm0          01000011100000111100000000000000Out     b:xmm0          10000111000001111000000000000000Out     b:xmm0          00001110000011110000000000000001Out     b:xmm0          00011100000111100000000000000010Out     b:xmm0          00111000001111000000000000000100Out     b:xmm0          01110000011110000000000000001000Out     b:xmm0          11100000111100000000000000010000Out     b:xmm0          11000001111000000000000000100000Out     b:xmm0          10000011110000000000000001000000Out     b:xmm0          00000111100000000000000010000001Out     b:xmm0          00001111000000000000000100000010Out     b:xmm0          00011110000000000000001000000100Out     b:xmm0          00111100000000000000010000001000Out     b:xmm0          01111000000000000000100000010000Out     b:xmm0          11110000000000000001000000100000Out     b:xmm0          11100000000000000010000001000000Out     b:xmm0          11000000000000000100000010000000Out     b:xmm0          10000000000000001000000100000001Out     b:xmm0          00000000000000010000001000000011Out     b:xmm0          00000000000000100000010000000110Out     b:xmm0          00000000000001000000100000001100Out     b:xmm0          00000000000010000001000000011000Out     b:xmm0          00000000000100000010000000110000Out     b:xmm0          00000000001000000100000001100000Out     b:xmm0          00000000010000001000000011000001Out     b:xmm0          00000000100000010000000110000010Out     b:xmm0          00000001000000100000001100000100Out     b:xmm0          00000010000001000000011000001000Out     b:xmm0          00000100000010000000110000010000Out     b:xmm0          00001000000100000001100000100000Out     b:xmm0          00010000001000000011000001000000Out     b:xmm0          00100000010000000110000010000000Out     b:xmm0          01000000100000001100000100000001Out     b:xmm0          10000001000000011000001000000010Out     b:xmm0          00000010000000110000010000000101Out     b:xmm0          00000100000001100000100000001010Out     b:xmm0          00001000000011000001000000010100Out     b:xmm0          00010000000110000010000000101000Out     b:xmm0          00100000001100000100000001010000Out     b:xmm0          01000000011000001000000010100000Out     b:xmm0          10000000110000010000000101000001Out     b:xmm0          00000001100000100000001010000011Out     b:xmm0          00000011000001000000010100000110Out     b:xmm0          00000110000010000000101000001100Out     b:xmm0          00001100000100000001010000011000Out     b:xmm0          00011000001000000010100000110000Out     b:xmm0          00110000010000000101000001100000Out     b:xmm0          01100000100000001010000011000000Out     b:xmm0          11000001000000010100000110000001Out     b:xmm0          10000010000000101000001100000011Out     b:xmm0          00000100000001010000011000000111Out     b:xmm0          10000010000000101000001100000011Out     b:xmm0          11000001000000010100000110000001Out     b:xmm0          01100000100000001010000011000000Out     b:xmm0          00110000010000000101000001100000Out     b:xmm0          00011000001000000010100000110000Out     b:xmm0          00001100000100000001010000011000Out     b:xmm0          00000110000010000000101000001100Out     b:xmm0          00000011000001000000010100000110Out     b:xmm0          00000001100000100000001010000011Out     b:xmm0          10000000110000010000000101000001Out     b:xmm0          01000000011000001000000010100000Out     b:xmm0          00100000001100000100000001010000Out     b:xmm0          00010000000110000010000000101000Out     b:xmm0          00001000000011000001000000010100Out     b:xmm0          00000100000001100000100000001010Out     b:xmm0          00000010000000110000010000000101Out     b:xmm0          10000001000000011000001000000010Out     b:xmm0          01000000100000001100000100000001Out     b:xmm0          00100000010000000110000010000000Out     b:xmm0          00010000001000000011000001000000Out     b:xmm0          00001000000100000001100000100000Out     b:xmm0          00000100000010000000110000010000Out     b:xmm0          00000010000001000000011000001000Out     b:xmm0          00000001000000100000001100000100Out     b:xmm0          00000000100000010000000110000010Out     b:xmm0          00000000010000001000000011000001Out     b:xmm0          00000000001000000100000001100000Out     b:xmm0          00000000000100000010000000110000Out     b:xmm0          00000000000010000001000000011000Out     b:xmm0          00000000000001000000100000001100Out     b:xmm0          00000000000000100000010000000110Out     b:xmm0          00000000000000010000001000000011Out     b:xmm0          10000000000000001000000100000001Out     b:xmm0          11000000000000000100000010000000Out     b:xmm0          11100000000000000010000001000000Out     b:xmm0          11110000000000000001000000100000Out     b:xmm0          01111000000000000000100000010000Out     b:xmm0          00111100000000000000010000001000Out     b:xmm0          00011110000000000000001000000100Out     b:xmm0          00001111000000000000000100000010Out     b:xmm0          00000111100000000000000010000001Out     b:xmm0          10000011110000000000000001000000Out     b:xmm0          11000001111000000000000000100000Out     b:xmm0          11100000111100000000000000010000Out     b:xmm0          01110000011110000000000000001000Out     b:xmm0          00111000001111000000000000000100Out     b:xmm0          00011100000111100000000000000010Out     b:xmm0          00001110000011110000000000000001Out     b:xmm0          10000111000001111000000000000000Out     b:xmm0          01000011100000111100000000000000Out     b:xmm0          10100001110000011110000000000000Out     b:xmm0          11010000111000001111000000000000Out     b:xmm0          01101000011100000111100000000000Out     b:xmm0          00110100001110000011110000000000Out     b:xmm0          00011010000111000001111000000000Out     b:xmm0          00001101000011100000111100000000Out     b:xmm0          00000110100001110000011110000000Out     b:xmm0          00000011010000111000001111000000Out     b:xmm0          10000001101000011100000111100000Out     b:xmm0          11000000110100001110000011110000Out     b:xmm0          01100000011010000111000001111000Out     b:xmm0          00110000001101000011100000111100Out     b:xmm0          00011000000110100001110000011110Out     b:xmm0          00001100000011010000111000001111Out     b:xmm0          10000110000001101000011100000111Out     b:xmm0          11000011000000110100001110000011Out     b:xmm0          01100001100000011010000111000001Out     b:xmm0          10110000110000001101000011100000Out     b:xmm0          01011000011000000110100001110000Out     b:xmm0          00101100001100000011010000111000Out     b:xmm0          00010110000110000001101000011100Out     b:xmm0          00001011000011000000110100001110Out     b:xmm0          00000101100001100000011010000111Out     b:xmm0          10000010110000110000001101000011Out     b:xmm0          01000001011000011000000110100001Out     b:xmm0          10100000101100001100000011010000Out     b:xmm0          01010000010110000110000001101000Out     b:xmm0          00101000001011000011000000110100Out     b:xmm0          00010100000101100001100000011010Out     b:xmm0          00001010000010110000110000001101Out     b:xmm0          10000101000001011000011000000110Out     b:xmm0          01000010100000101100001100000011Out     b:xmm0          00100001010000010110000110000001Out     b:xmm0          10010000101000001011000011000000Out     b:xmm0          01001000010100000101100001100000Out     b:xmm0          00100100001010000010110000110000Out     b:xmm0          00010010000101000001011000011000Out     b:xmm0          00001001000010100000101100001100Out     b:xmm0          00000100100001010000010110000110Out     b:xmm0          00000010010000101000001011000011Out     b:xmm0          00000001001000010100000101100001Out     b:xmm0          10000000100100001010000010110000Out     b:xmm0          01000000010010000101000001011000Out     b:xmm0          00100000001001000010100000101100Out     b:xmm0          00010000000100100001010000010110Out     b:xmm0          00001000000010010000101000001011Out     b:xmm0          10000100000001001000010100000101Out     b:xmm0          11000010000000100100001010000010Out     b:xmm0          11100001000000010010000101000001Out     b:xmm0          01110000100000001001000010100000Out     b:xmm0          00111000010000000100100001010000Out     b:xmm0          00011100001000000010010000101000Out     b:xmm0          00001110000100000001001000010100Out     b:xmm0          00000111000010000000100100001010Out     b:xmm0          00000011100001000000010010000101Out     b:xmm0          10000001110000100000001001000010Out     b:xmm0          11000000111000010000000100100001Out     b:xmm0          01100000011100001000000010010000Out     b:xmm0          00110000001110000100000001001000Out     b:xmm0          00011000000111000010000000100100Out     b:xmm0          00001100000011100001000000010010Out     b:xmm0          00000110000001110000100000001001Out     b:xmm0          10000011000000111000010000000100Out     b:xmm0          01000001100000011100001000000010Out     b:xmm0          10100000110000001110000100000001Out     b:xmm0          01010000011000000111000010000000Out     b:xmm0          00101000001100000011100001000000Out     b:xmm0          00010100000110000001110000100000Out     b:xmm0          00001010000011000000111000010000Out     b:xmm0          00000101000001100000011100001000Out     b:xmm0          00000010100000110000001110000100Out     b:xmm0          00000001010000011000000111000010Out     b:xmm0          10000000101000001100000011100001Out     b:xmm0          01000000010100000110000001110000Out     b:xmm0          00100000001010000011000000111000Out     b:xmm0          00010000000101000001100000011100Out     b:xmm0          00001000000010100000110000001110Out     b:xmm0          00000100000001010000011000000111`

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #20 on: April 29, 2020, 07:24:39 AM »
Hi Scarmatil

yes...Your algo works like a charm. Rotating 1 bit at a time.

Can it also work for 64 bits inside a XMM also rotating 1 bit at a time, but restricted to the low and half Qwords of the 128 bits ?

Maybe this could help better optimizing this.

About the SHA3 , i built one biased  to these ones:

http://gauss.ececs.uc.edu/Courses/c6053/lectures/Hashing/sha3
https://github.com/magurosan/sha3-odzhan

The second link, didn´t worked as expected and it was very slow, so i had to create a variation of it biased on both links to make it work properly.

I´ll post here the test i made containing the embeded source (RosAsm syntax) and also i´ll try to make one for masm syntax as well to make it easier for people read.

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### Siekmanski

• Member
• Posts: 2365
##### Re: Rotate Bits SSE
« Reply #21 on: April 29, 2020, 08:17:30 AM »

First I was "PSHUFD" with Magnus, and now with Scarmatil.

Code: [Select]
`    ;rotate64left 1 bit    movq    xmm0,qword ptr [Number64]    movq    xmm1,xmm0    psllq   xmm0,1      psrlq   xmm1,63     pxor    xmm0,xmm1    ;rotate64right 1 bit    movq    xmm0,qword ptr [Number64]    movq    xmm1,xmm0    psllq   xmm0,63     psrlq   xmm1,1    pxor    xmm0,xmm1`
Creative coders use backward thinking techniques as a strategy.

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #22 on: April 29, 2020, 08:47:52 AM »

First I was "PSHUFD" with Magnus, and now with Scarmatil.

Ooooppppsss... Sorry  Siekmanski. I was with RosAsm opene3d and one of the routines was made by an former contributor called Scarmatil

Tks for the code. I´l take a look and see if i can adapt it. I´m also trying to port my SHA3 cypher to masm, before i upload both versions.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

#### Siekmanski

• Member
• Posts: 2365
##### Re: Rotate Bits SSE
« Reply #23 on: April 29, 2020, 08:51:54 AM »

Creative coders use backward thinking techniques as a strategy.

#### mineiro

• Member
• Posts: 701
##### Re: Rotate Bits SSE
« Reply #24 on: April 29, 2020, 09:59:52 PM »
I ran some tests and in the 64-bit long mode, common instructions are more efficient than SSE to accomplish this goal.
As an example:
Code: [Select]
`;rotate 1 bit to leftmov rax,9090909090909090h ;64 bits numberrcl rax,1adc rax,0`
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

#### Siekmanski

• Member
• Posts: 2365
##### Re: Rotate Bits SSE
« Reply #25 on: April 29, 2020, 11:16:31 PM »
Hi mineiro,

As far as I understand and observed, I don't see possibilities for parallel execution speed optimizations in guga's routine. ( could be I missed something... )
In 64 bit long mode, your solution fits his routine best.
It's a pity we don't have rotate bits instructions in SIMD.
Creative coders use backward thinking techniques as a strategy.

#### mineiro

• Member
• Posts: 701
##### Re: Rotate Bits SSE
« Reply #26 on: April 30, 2020, 12:47:04 AM »
You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.

Some results to "rept 10000":
Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
164772     162016     210012     243834      30000
140896     152612     210001     240022      30000
140874     152643     210001     240013      30000
140954     152642     210001     240013      30000
Code: [Select]
`align 16rot3 procmov ecx,[counter]movdqu xmm0,oword ptr [number].if ecx >= 64 pshufd  xmm0,xmm0,01001110b sub ecx,64.endifmovdqu xmm1,xmm0movd xmm3,ecxpsllq xmm0,xmm3sub ecx,64neg ecxmovd xmm4,ecxpsrlq xmm1,xmm4pshufd  xmm1,xmm1,01001110bpor xmm0,xmm1retrot3 endp`Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
180560     174677     190012     233851      20000
135620     146905     190001     230022      20000
135610     146932     190001     230019      20000
135648     146918     190001     230019      20000

Code: [Select]
`align 16rot6 procmov ecx,countermov rax,[number+8]mov rdx,[number].if ecx >= 64    mov rbx,rax    mov rax,rdx    mov rdx,rbx    sub ecx,64.endifxor ebx,ebxshld rbx,rax,clshld rax,rdx,clshl rdx,clor rdx,rbxretrot6 endp`Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
148766     144041     180012     283841      20000
122292     132467     180001     280025      20000
122310     132504     180001     280019      20000
122304     132506     180001     280020      20000
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

#### daydreamer

• Member
• Posts: 1681
• building nextdoor
##### Re: Rotate Bits SSE
« Reply #27 on: April 30, 2020, 04:59:22 AM »
You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.

Some results to "rept 10000":
Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
164772     162016     210012     243834      30000
140896     152612     210001     240022      30000
140874     152643     210001     240013      30000
140954     152642     210001     240013      30000
Code: [Select]
`align 16rot3 procmov ecx,[counter]movdqu xmm0,oword ptr [number].if ecx >= 64 pshufd  xmm0,xmm0,01001110b sub ecx,64.endifmovdqu xmm1,xmm0movd xmm3,ecxpsllq xmm0,xmm3sub ecx,64neg ecxmovd xmm4,ecxpsrlq xmm1,xmm4pshufd  xmm1,xmm1,01001110bpor xmm0,xmm1retrot3 endp`Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
180560     174677     190012     233851      20000
135620     146905     190001     230022      20000
135610     146932     190001     230019      20000
135648     146918     190001     230019      20000

Code: [Select]
`align 16rot6 procmov ecx,countermov rax,[number+8]mov rdx,[number].if ecx >= 64    mov rbx,rax    mov rax,rdx    mov rdx,rbx    sub ecx,64.endifxor ebx,ebxshld rbx,rax,clshld rax,rdx,clshl rdx,clor rdx,rbxretrot6 endp`Processor 0
Clock   Core cyc   Instruct       Uops    BrTaken
148766     144041     180012     283841      20000
122292     132467     180001     280025      20000
122310     132504     180001     280019      20000
122304     132506     180001     280020      20000
the below code have much depency on earlier instruction
Code: [Select]
`xor ebx,ebxshld rbx,rax,clshld rax,rdx,clshl rdx,clor rdx,rbx`maybe if unroll it like this breaks depency
and if cpu has several shift execution units it starts working on 2shifts at a time
forgive me if r9,r10,r11 is wrong register choice,because I am not experienced with 64bit asm
Code: [Select]
`xor ebx,ebxxor r9,r9shld rbx,rax,clshld r9,r10,clshld rax,rdx,clshld r10,r11,clshl rdx,clshl r11,clor rdx,rbxor r11,r9`on the other hand there are some instructions that are starved on execution units,so it has no use unroll those except its only few cpus that has slow shift

because its suppose to be part of bigger code,macro it instead of proc could gain few cycles
SIMD fan and macro fan
why assembly is fastest is because its switch has no (brakes) breaks
:P
only in 16bit assembly you can get away with "Only words" :P

#### mineiro

• Member
• Posts: 701
##### Re: Rotate Bits SSE
« Reply #28 on: April 30, 2020, 06:03:54 AM »
hello sir daydreamer;
No problems regarding the calling convention and register to be preserved, they are just tests; I didn't use the usual Windows or Linux convention myself. If others join so this can be a problem.

I understood what you said about register related dependency, execution out of order. I didn't find a viable way to solve the dependency to rotate 128 bits.
The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

#### guga

• Member
• Posts: 1357
• Assembly is a state of art.
##### Re: Rotate Bits SSE
« Reply #29 on: April 30, 2020, 08:13:00 AM »
Hi Guys, i finished porting the SHA3 algorithm to masm, i just need to test what i have done wrong with the porting because the result is wrong. I´ll open another post containingt the algorithm as soon i fix the masm version. This is the algorithm from where i´m using to test this SSE rotation functions to speed up a little bit. And also, creating some functions related to Rotate and Shift 128 (or more bits) in other situations can also be helpful to other things :)

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com