Rotate Bits SSE

Siekmanski · April 28, 2020, 01:05:39 PM

Hi guga,

Here is my contribution, left and right rotation in 1 routine.

Number         oword 000102030405060708090a0b0c0d0e0fh
NumRotateBits  dd -4    ; number of bits to rotate ( range = -127 to +127 )

Rotate128Bits:
    movdqu  xmm0,oword ptr [Number]
    mov     ecx,64                  ; 64 bits Barrier.
    mov     eax,NumRotateBits       ; Number of bits to rotate.
    test    eax,eax                 ; Positive = Left and Negative = Right Bit Rotation
    jns     TestDirection           ; Rotate Left or Right?
    add     eax,128                 ; Correct the position. 
TestDirection:
    cmp     eax,ecx
    jb      Test64bitBarrier
    pshufd  xmm0,xmm0,01001110b     ; Swap High and Low 64 bits if NumRotateBits > 64
Test64bitBarrier:
    and     eax,63                  ; Keep it within 64 bit.
    movd    xmm2,eax                ; Bits to shift.
    movd    xmm3,ecx                ; 64 bit Shift-Range.
    movdqu  xmm1,xmm0               ; Make a copy of the 128 bits.
    psubq   xmm3,xmm2               ; Calculate Shift-Range.
    psllq   xmm0,xmm2               ; Do the bit shifts to the left.
    psrlq   xmm1,xmm3               ; Do the bit shifts to the right.
    pshufd  xmm1,xmm1,01001110b     ; Swap High and Low 64 bit.
    pxor    xmm0,xmm1               ; Glue the 128 bits together.
    ret                             ; result in xmm0

guga · April 29, 2020, 04:06:45 AM

Tks you so much, Siekmanski.

Works like a charm.

This is for a routine i´m making to create a SHA3 Cypher. So far, the code and result works ok, but there are some minor issues i´m, trying to optimize it more. For example....How to optimize this and port from MMX to SSE2 ?

Code Select



; Where the arguments State is a 25 dupe Qword (so, 50 qwords in total) = KeccakState: Q$ 0 #25 and BC is a 5 dupe qword. (10 qwords). [BC: Q$ 0 #5]

Proc keccakf_Theta::
    Arguments @State
    Uses esi, edi, ecx, ebx, edx

    mov esi D@State

    movups XMM0 X$esi
    movups XMM1 X$esi+40 | xorps XMM0 XMM1
    movups XMM2 X$esi+80 | xorps XMM0 XMM2
    movups XMM2 X$esi+120 | xorps XMM0 XMM2
    movups XMM2 X$esi+160 | xorps XMM0 XMM2
    movups X$BC XMM0

    movups XMM0 X$esi+16
    movups XMM1 X$esi+56 | xorps XMM0 XMM1
    movups XMM2 X$esi+96 | xorps XMM0 XMM2
    movups XMM2 X$esi+136 | xorps XMM0 XMM2
    movups XMM2 X$esi+176 | xorps XMM0 XMM2
    movups X$BC+16 XMM0

    movq MM0 Q$esi+32 | pxor MM0 Q$esi+72 | pxor MM0 Q$esi+112 | pxor MM0 Q$esi+152 | pxor MM0 Q$esi+192 | movq Q$BC+32 MM0

    mov edi BC

    movq MM0 Q$edi+8
    call rotl64 1
    pxor MM0 Q$edi+32

    movq MM1 Q$esi | pxor MM1 MM0 | movq Q$esi MM1; esi
    movq MM1 Q$esi+40 | pxor MM1 MM0 | movq Q$esi+40 MM1 ; esi+40
    movq MM1 Q$esi+80 | pxor MM1 MM0 | movq Q$esi+80 MM1; esi+80
    movq MM1 Q$esi+120 | pxor MM1 MM0 | movq Q$esi+120 MM1;esi+120
    movq MM1 Q$esi+160 | pxor MM1 MM0 | movq Q$esi+160 MM1;esi+160

    movq MM0 Q$edi+16
    call rotl64 1
    pxor MM0 Q$edi

    movq MM1 Q$esi+8 | pxor MM1 MM0 | movq Q$esi+8 MM1;esi+8
    movq MM1 Q$esi+48 | pxor MM1 MM0 | movq Q$esi+48 MM1;esi+48
    movq MM1 Q$esi+88 | pxor MM1 MM0 | movq Q$esi+88 MM1;esi+88
    movq MM1 Q$esi+128 | pxor MM1 MM0 | movq Q$esi+128 MM1;esi+128
    movq MM1 Q$esi+168 | pxor MM1 MM0 | movq Q$esi+168 MM1;esi+168

    movq MM0 Q$edi+24
    call rotl64 1
    pxor MM0 Q$edi+8

    movq MM1 Q$esi+16 | pxor MM1 MM0 | movq Q$esi+16 MM1 ; eax = 16
    movq MM1 Q$esi+56 | pxor MM1 MM0 | movq Q$esi+56 MM1 ; eax = 56
    movq MM1 Q$esi+96 | pxor MM1 MM0 | movq Q$esi+96 MM1 ; eax = 96
    movq MM1 Q$esi+136 | pxor MM1 MM0 | movq Q$esi+136 MM1; eax = 136
    movq MM1 Q$esi+176 | pxor MM1 MM0 | movq Q$esi+176 MM1 ; eax = 176

    movq MM0 Q$edi+32
    call rotl64 1
    pxor MM0 Q$edi+16

    movq MM1 Q$esi+24 | pxor MM1 MM0 | movq Q$esi+24 MM1 ; eax = 24
    movq MM1 Q$esi+64 | pxor MM1 MM0 | movq Q$esi+64 MM1 ; eax = 64
    movq MM1 Q$esi+104 | pxor MM1 MM0 | movq Q$esi+104 MM1 ; eax  = 104
    movq MM1 Q$esi+144 | pxor MM1 MM0 | movq Q$esi+144 MM1 ; eax = 144
    movq MM1 Q$esi+184 | pxor MM1 MM0 | movq Q$esi+184 MM1 ; eax = 184

    movq MM0 Q$edi
    call rotl64 1
    pxor MM0 Q$edi+24

    movq MM1 Q$esi+32 | pxor MM1 MM0 | movq Q$esi+32 MM1
    movq MM1 Q$esi+72 | pxor MM1 MM0 | movq Q$esi+72 MM1
    movq MM1 Q$esi+112 | pxor MM1 MM0 | movq Q$esi+112 MM1
    movq MM1 Q$esi+152 | pxor MM1 MM0 | movq Q$esi+152 MM1
    movq MM1 Q$esi+192 | pxor MM1 MM0 | movq Q$esi+192 MM1

EndP

; where rotl64 =

Code Select


Proc rotl64:
    Arguments @Count

    mov eax D@Count
    movq MM2 MM0
    movd MM3 eax
    sub eax 64
    neg eax
    movd MM4 eax
    psllq MM0 MM3
    psrlq MM2 MM4
    por MM0 MM2

EndP

The above is the unrolled loop routine from my keccakf routine related to compute the loop to calculate Theta,. I´ll post here laater if needed, since i´m trying to optimize other parts of the algo as well :)

So far, my SHA3 cypher works ok, for all different types such as 224, 256, 384 and 512. But more optimization is needed :)

Ex: To compute the SHA3 from string "guga" my algo returns the corrected values (Although, i believe it is still needs more optimization):

SHA3-224    fa6d153b62efd49a85c2b4ae0b53e2f2361d54e51dc2940281b6691d
SHA3-256    0aa747b1b0f38a900fccf52a50386ff57b59e7123ecfc2df1547973e2eeaa19d
SHA3-384    e11df0d663cac016f88a24c36309075972adadc1bd5d6d13928849d334444354f6a68e15f18915d2a4b285960ceeada4
SHA3-512    603c3235eb25a491b5fd8148065066ab307769b714e5abc43448a37f361864e4d786afaff22d5ac5c839d350413d36c953837bf0b29171cae18f7395d720762c

https://md5calc.com/hash/sha3-256/guga

guga · April 29, 2020, 04:36:01 AM

Or, maybe rotating each of the High and Low 64 bit at once. (Could also works for other needs as well, such as a routine to rotate bytes (16 at once), words (8 at once), qwords (2 at once) that are stored inside a XMM register)

For example:
XMM0 = 001000....0010 (1st 64 bits) - rotate this 64 bits
XMM0 = 010011....1011 (2nd 64 bits) - rotate this 64 bits

So, making rotate only by the 64 bits chunks, rather then the whole 128 bit chain. I.e: Forcing the data from "rotl64" to compute a rotation at 2 64bits chains (qword) at once.

Perhaps this could also helps optimizing a bit the sequence before it goes through the chunks of pxor MM1 MM0 .....

Siekmanski · April 29, 2020, 06:46:40 AM

I have no knowledge regarding Hash Functions, so I had a look on the internet.
Couldn't find a logical explanation of the algorithm which I can understand.
It's a bit hard for me to decipher your code example.

Is it so you only rotate 1 bit at a time?

jj2007 · April 29, 2020, 07:20:46 AM

Quote from: Siekmanski on April 28, 2020, 01:05:39 PMHere is my contribution, left and right rotation in 1 routine.

Looks comvincing

Code Select

  .Repeat
	inc NumRotateBits
	call Rotate128Bits
	deb 4, "Out", b:xmm0
  .Until NumRotateBits>=64
  .Repeat
	dec NumRotateBits
	call Rotate128Bits
	deb 4, "Out", b:xmm0
  .Until NumRotateBits==-64

Code Select

Out     b:xmm0          00011000000110100001110000011110
Out     b:xmm0          00110000001101000011100000111100
Out     b:xmm0          01100000011010000111000001111000
Out     b:xmm0          11000000110100001110000011110000
Out     b:xmm0          10000001101000011100000111100000
Out     b:xmm0          00000011010000111000001111000000
Out     b:xmm0          00000110100001110000011110000000
Out     b:xmm0          00001101000011100000111100000000
Out     b:xmm0          00011010000111000001111000000000
Out     b:xmm0          00110100001110000011110000000000
Out     b:xmm0          01101000011100000111100000000000
Out     b:xmm0          11010000111000001111000000000000
Out     b:xmm0          10100001110000011110000000000000
Out     b:xmm0          01000011100000111100000000000000
Out     b:xmm0          10000111000001111000000000000000
Out     b:xmm0          00001110000011110000000000000001
Out     b:xmm0          00011100000111100000000000000010
Out     b:xmm0          00111000001111000000000000000100
Out     b:xmm0          01110000011110000000000000001000
Out     b:xmm0          11100000111100000000000000010000
Out     b:xmm0          11000001111000000000000000100000
Out     b:xmm0          10000011110000000000000001000000
Out     b:xmm0          00000111100000000000000010000001
Out     b:xmm0          00001111000000000000000100000010
Out     b:xmm0          00011110000000000000001000000100
Out     b:xmm0          00111100000000000000010000001000
Out     b:xmm0          01111000000000000000100000010000
Out     b:xmm0          11110000000000000001000000100000
Out     b:xmm0          11100000000000000010000001000000
Out     b:xmm0          11000000000000000100000010000000
Out     b:xmm0          10000000000000001000000100000001
Out     b:xmm0          00000000000000010000001000000011
Out     b:xmm0          00000000000000100000010000000110
Out     b:xmm0          00000000000001000000100000001100
Out     b:xmm0          00000000000010000001000000011000
Out     b:xmm0          00000000000100000010000000110000
Out     b:xmm0          00000000001000000100000001100000
Out     b:xmm0          00000000010000001000000011000001
Out     b:xmm0          00000000100000010000000110000010
Out     b:xmm0          00000001000000100000001100000100
Out     b:xmm0          00000010000001000000011000001000
Out     b:xmm0          00000100000010000000110000010000
Out     b:xmm0          00001000000100000001100000100000
Out     b:xmm0          00010000001000000011000001000000
Out     b:xmm0          00100000010000000110000010000000
Out     b:xmm0          01000000100000001100000100000001
Out     b:xmm0          10000001000000011000001000000010
Out     b:xmm0          00000010000000110000010000000101
Out     b:xmm0          00000100000001100000100000001010
Out     b:xmm0          00001000000011000001000000010100
Out     b:xmm0          00010000000110000010000000101000
Out     b:xmm0          00100000001100000100000001010000
Out     b:xmm0          01000000011000001000000010100000
Out     b:xmm0          10000000110000010000000101000001
Out     b:xmm0          00000001100000100000001010000011
Out     b:xmm0          00000011000001000000010100000110
Out     b:xmm0          00000110000010000000101000001100
Out     b:xmm0          00001100000100000001010000011000
Out     b:xmm0          00011000001000000010100000110000
Out     b:xmm0          00110000010000000101000001100000
Out     b:xmm0          01100000100000001010000011000000
Out     b:xmm0          11000001000000010100000110000001
Out     b:xmm0          10000010000000101000001100000011
Out     b:xmm0          00000100000001010000011000000111
Out     b:xmm0          10000010000000101000001100000011
Out     b:xmm0          11000001000000010100000110000001
Out     b:xmm0          01100000100000001010000011000000
Out     b:xmm0          00110000010000000101000001100000
Out     b:xmm0          00011000001000000010100000110000
Out     b:xmm0          00001100000100000001010000011000
Out     b:xmm0          00000110000010000000101000001100
Out     b:xmm0          00000011000001000000010100000110
Out     b:xmm0          00000001100000100000001010000011
Out     b:xmm0          10000000110000010000000101000001
Out     b:xmm0          01000000011000001000000010100000
Out     b:xmm0          00100000001100000100000001010000
Out     b:xmm0          00010000000110000010000000101000
Out     b:xmm0          00001000000011000001000000010100
Out     b:xmm0          00000100000001100000100000001010
Out     b:xmm0          00000010000000110000010000000101
Out     b:xmm0          10000001000000011000001000000010
Out     b:xmm0          01000000100000001100000100000001
Out     b:xmm0          00100000010000000110000010000000
Out     b:xmm0          00010000001000000011000001000000
Out     b:xmm0          00001000000100000001100000100000
Out     b:xmm0          00000100000010000000110000010000
Out     b:xmm0          00000010000001000000011000001000
Out     b:xmm0          00000001000000100000001100000100
Out     b:xmm0          00000000100000010000000110000010
Out     b:xmm0          00000000010000001000000011000001
Out     b:xmm0          00000000001000000100000001100000
Out     b:xmm0          00000000000100000010000000110000
Out     b:xmm0          00000000000010000001000000011000
Out     b:xmm0          00000000000001000000100000001100
Out     b:xmm0          00000000000000100000010000000110
Out     b:xmm0          00000000000000010000001000000011
Out     b:xmm0          10000000000000001000000100000001
Out     b:xmm0          11000000000000000100000010000000
Out     b:xmm0          11100000000000000010000001000000
Out     b:xmm0          11110000000000000001000000100000
Out     b:xmm0          01111000000000000000100000010000
Out     b:xmm0          00111100000000000000010000001000
Out     b:xmm0          00011110000000000000001000000100
Out     b:xmm0          00001111000000000000000100000010
Out     b:xmm0          00000111100000000000000010000001
Out     b:xmm0          10000011110000000000000001000000
Out     b:xmm0          11000001111000000000000000100000
Out     b:xmm0          11100000111100000000000000010000
Out     b:xmm0          01110000011110000000000000001000
Out     b:xmm0          00111000001111000000000000000100
Out     b:xmm0          00011100000111100000000000000010
Out     b:xmm0          00001110000011110000000000000001
Out     b:xmm0          10000111000001111000000000000000
Out     b:xmm0          01000011100000111100000000000000
Out     b:xmm0          10100001110000011110000000000000
Out     b:xmm0          11010000111000001111000000000000
Out     b:xmm0          01101000011100000111100000000000
Out     b:xmm0          00110100001110000011110000000000
Out     b:xmm0          00011010000111000001111000000000
Out     b:xmm0          00001101000011100000111100000000
Out     b:xmm0          00000110100001110000011110000000
Out     b:xmm0          00000011010000111000001111000000
Out     b:xmm0          10000001101000011100000111100000
Out     b:xmm0          11000000110100001110000011110000
Out     b:xmm0          01100000011010000111000001111000
Out     b:xmm0          00110000001101000011100000111100
Out     b:xmm0          00011000000110100001110000011110
Out     b:xmm0          00001100000011010000111000001111
Out     b:xmm0          10000110000001101000011100000111
Out     b:xmm0          11000011000000110100001110000011
Out     b:xmm0          01100001100000011010000111000001
Out     b:xmm0          10110000110000001101000011100000
Out     b:xmm0          01011000011000000110100001110000
Out     b:xmm0          00101100001100000011010000111000
Out     b:xmm0          00010110000110000001101000011100
Out     b:xmm0          00001011000011000000110100001110
Out     b:xmm0          00000101100001100000011010000111
Out     b:xmm0          10000010110000110000001101000011
Out     b:xmm0          01000001011000011000000110100001
Out     b:xmm0          10100000101100001100000011010000
Out     b:xmm0          01010000010110000110000001101000
Out     b:xmm0          00101000001011000011000000110100
Out     b:xmm0          00010100000101100001100000011010
Out     b:xmm0          00001010000010110000110000001101
Out     b:xmm0          10000101000001011000011000000110
Out     b:xmm0          01000010100000101100001100000011
Out     b:xmm0          00100001010000010110000110000001
Out     b:xmm0          10010000101000001011000011000000
Out     b:xmm0          01001000010100000101100001100000
Out     b:xmm0          00100100001010000010110000110000
Out     b:xmm0          00010010000101000001011000011000
Out     b:xmm0          00001001000010100000101100001100
Out     b:xmm0          00000100100001010000010110000110
Out     b:xmm0          00000010010000101000001011000011
Out     b:xmm0          00000001001000010100000101100001
Out     b:xmm0          10000000100100001010000010110000
Out     b:xmm0          01000000010010000101000001011000
Out     b:xmm0          00100000001001000010100000101100
Out     b:xmm0          00010000000100100001010000010110
Out     b:xmm0          00001000000010010000101000001011
Out     b:xmm0          10000100000001001000010100000101
Out     b:xmm0          11000010000000100100001010000010
Out     b:xmm0          11100001000000010010000101000001
Out     b:xmm0          01110000100000001001000010100000
Out     b:xmm0          00111000010000000100100001010000
Out     b:xmm0          00011100001000000010010000101000
Out     b:xmm0          00001110000100000001001000010100
Out     b:xmm0          00000111000010000000100100001010
Out     b:xmm0          00000011100001000000010010000101
Out     b:xmm0          10000001110000100000001001000010
Out     b:xmm0          11000000111000010000000100100001
Out     b:xmm0          01100000011100001000000010010000
Out     b:xmm0          00110000001110000100000001001000
Out     b:xmm0          00011000000111000010000000100100
Out     b:xmm0          00001100000011100001000000010010
Out     b:xmm0          00000110000001110000100000001001
Out     b:xmm0          10000011000000111000010000000100
Out     b:xmm0          01000001100000011100001000000010
Out     b:xmm0          10100000110000001110000100000001
Out     b:xmm0          01010000011000000111000010000000
Out     b:xmm0          00101000001100000011100001000000
Out     b:xmm0          00010100000110000001110000100000
Out     b:xmm0          00001010000011000000111000010000
Out     b:xmm0          00000101000001100000011100001000
Out     b:xmm0          00000010100000110000001110000100
Out     b:xmm0          00000001010000011000000111000010
Out     b:xmm0          10000000101000001100000011100001
Out     b:xmm0          01000000010100000110000001110000
Out     b:xmm0          00100000001010000011000000111000
Out     b:xmm0          00010000000101000001100000011100
Out     b:xmm0          00001000000010100000110000001110
Out     b:xmm0          00000100000001010000011000000111

guga · April 29, 2020, 07:24:39 AM

Hi Scarmatil

yes...Your algo works like a charm. Rotating 1 bit at a time.

Can it also work for 64 bits inside a XMM also rotating 1 bit at a time, but restricted to the low and half Qwords of the 128 bits ?

Maybe this could help better optimizing this.

About the SHA3 , i built one biased to these ones:

http://gauss.ececs.uc.edu/Courses/c6053/lectures/Hashing/sha3
https://github.com/magurosan/sha3-odzhan

The second link, didn´t worked as expected and it was very slow, so i had to create a variation of it biased on both links to make it work properly.

I´ll post here the test i made containing the embeded source (RosAsm syntax) and also i´ll try to make one for masm syntax as well to make it easier for people read.

Siekmanski · April 29, 2020, 08:17:30 AM

First I was "PSHUFD" with Magnus, and now with Scarmatil.

Code Select

    ;rotate64left 1 bit
    movq    xmm0,qword ptr [Number64]
    movq    xmm1,xmm0
    psllq   xmm0,1  
    psrlq   xmm1,63 
    pxor    xmm0,xmm1

    ;rotate64right 1 bit
    movq    xmm0,qword ptr [Number64]
    movq    xmm1,xmm0
    psllq   xmm0,63 
    psrlq   xmm1,1
    pxor    xmm0,xmm1

guga · April 29, 2020, 08:47:52 AM

Quote from: Siekmanski on April 29, 2020, 08:17:30 AM

First I was "PSHUFD" with Magnus, and now with Scarmatil.

Ooooppppsss... Sorry

Siekmanski.

I was with RosAsm opene3d and one of the routines was made by an former contributor called Scarmatil

Tks for the code. I´l take a look and see if i can adapt it. I´m also trying to port my SHA3 cypher to masm, before i upload both versions.

Siekmanski · April 29, 2020, 08:51:54 AM

mineiro · April 29, 2020, 09:59:52 PM

I ran some tests and in the 64-bit long mode, common instructions are more efficient than SSE to accomplish this goal.
As an example:

Code Select

;rotate 1 bit to left
mov rax,9090909090909090h	;64 bits number
rcl rax,1
adc rax,0

Siekmanski · April 29, 2020, 11:16:31 PM

Hi mineiro,

As far as I understand and observed, I don't see possibilities for parallel execution speed optimizations in guga's routine. ( could be I missed something... )
Most are 64 bit calculations with wide spreaded reads and writes.
In 64 bit long mode, your solution fits his routine best.
It's a pity we don't have rotate bits instructions in SIMD.

mineiro · April 30, 2020, 12:47:04 AM

You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.

Some results to "rept 10000":
your rot
Processor 0
Clock Core cyc Instruct Uops BrTaken
164772 162016 210012 243834 30000
140896 152612 210001 240022 30000
140874 152643 210001 240013 30000
140954 152642 210001 240013 30000

Code Select


align 16
rot3 proc
mov ecx,[counter]
movdqu xmm0,oword ptr [number]
.if ecx >= 64
	pshufd  xmm0,xmm0,01001110b
	sub ecx,64
.endif
movdqu xmm1,xmm0
movd xmm3,ecx
psllq xmm0,xmm3
sub ecx,64
neg ecx
movd xmm4,ecx
psrlq xmm1,xmm4
pshufd  xmm1,xmm1,01001110b
por xmm0,xmm1
ret
rot3 endp

Processor 0
Clock Core cyc Instruct Uops BrTaken
180560 174677 190012 233851 20000
135620 146905 190001 230022 20000
135610 146932 190001 230019 20000
135648 146918 190001 230019 20000

Code Select


align 16
rot6 proc
mov ecx,counter
mov rax,[number+8]
mov rdx,[number]
.if ecx >= 64
    mov rbx,rax
    mov rax,rdx
    mov rdx,rbx
    sub ecx,64
.endif
xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx
ret
rot6 endp

Processor 0
Clock Core cyc Instruct Uops BrTaken
148766 144041 180012 283841 20000
122292 132467 180001 280025 20000
122310 132504 180001 280019 20000
122304 132506 180001 280020 20000

daydreamer · April 30, 2020, 04:59:22 AM

Quote from: mineiro on April 30, 2020, 12:47:04 AM
You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.

Some results to "rept 10000":
your rot
Processor 0
Clock Core cyc Instruct Uops BrTaken
164772 162016 210012 243834 30000
140896 152612 210001 240022 30000
140874 152643 210001 240013 30000
140954 152642 210001 240013 30000
Code Select Expand
align 16 rot3 proc mov ecx,[counter] movdqu xmm0,oword ptr [number] .if ecx >= 64 pshufd xmm0,xmm0,01001110b sub ecx,64 .endif movdqu xmm1,xmm0 movd xmm3,ecx psllq xmm0,xmm3 sub ecx,64 neg ecx movd xmm4,ecx psrlq xmm1,xmm4 pshufd xmm1,xmm1,01001110b por xmm0,xmm1 ret rot3 endp
Processor 0
Clock Core cyc Instruct Uops BrTaken
180560 174677 190012 233851 20000
135620 146905 190001 230022 20000
135610 146932 190001 230019 20000
135648 146918 190001 230019 20000

Code Select Expand
align 16 rot6 proc mov ecx,counter mov rax,[number+8] mov rdx,[number] .if ecx >= 64 mov rbx,rax mov rax,rdx mov rdx,rbx sub ecx,64 .endif xor ebx,ebx shld rbx,rax,cl shld rax,rdx,cl shl rdx,cl or rdx,rbx ret rot6 endp
Processor 0
Clock Core cyc Instruct Uops BrTaken
148766 144041 180012 283841 20000
122292 132467 180001 280025 20000
122310 132504 180001 280019 20000
122304 132506 180001 280020 20000

the below code have much depency on earlier instruction

Code Select


xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx

maybe if unroll it like this breaks depency
and if cpu has several shift execution units it starts working on 2shifts at a time
forgive me if r9,r10,r11 is wrong register choice,because I am not experienced with 64bit asm

Code Select


xor ebx,ebx
xor r9,r9
shld rbx,rax,cl
shld r9,r10,cl
shld rax,rdx,cl
shld r10,r11,cl
shl rdx,cl
shl r11,cl

or rdx,rbx
or r11,r9

on the other hand there are some instructions that are starved on execution units,so it has no use unroll those except its only few cpus that has slow shift

because its suppose to be part of bigger code,macro it instead of proc could gain few cycles

mineiro · April 30, 2020, 06:03:54 AM

hello sir daydreamer;
No problems regarding the calling convention and register to be preserved, they are just tests; I didn't use the usual Windows or Linux convention myself. If others join so this can be a problem.

I understood what you said about register related dependency, execution out of order. I didn't find a viable way to solve the dependency to rotate 128 bits.
The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
Thanks for your comments and sugestion.

guga · April 30, 2020, 08:13:00 AM

Hi Guys, i finished porting the SHA3 algorithm to masm, i just need to test what i have done wrong with the porting because the result is wrong. I´ll open another post containingt the algorithm as soon i fix the masm version. This is the algorithm from where i´m using to test this SSE rotation functions to speed up a little bit. And also, creating some functions related to Rotate and Shift 128 (or more bits) in other situations can also be helpful to other things :)

The MASM Forum

News:

Rotate Bits SSE

Siekmanski

guga

guga

Siekmanski

jj2007

guga

Siekmanski

guga

Siekmanski

mineiro

Siekmanski

mineiro

daydreamer

mineiro

guga