The MASM Forum

General => The Laboratory => Topic started by: guga on April 27, 2020, 06:30:22 AM

Title: Rotate Bits SSE
Post by: guga on April 27, 2020, 06:30:22 AM
Hi Guys

I made a couple of routines to rotate bits from rigth and left using MMX instructions, but i´m failing to do it for Se2. Can someone help ? Also, can these be optimized ?

The MMX instruction are as this:

Rotate Left
Code: [Select]

      rotl64 -  Rotate Left bits in MM0 register N times.

       Parameters:
                   Count (Input). The total amount of times the register should be shifted left

       Return Value:
                    The shifted value is stored in MM0 register. (See also, remarks below)

       Remarks:
                  The function uses MM0 register to perform the rotation. So, on input, MM0 must already be filled.

       Example of usage:

      [tttt: Q$ 1] ; A qword variable that hold "1" as value
      movq MM0 Q$Value
      call rotl64 1 ; rotate 1 left by bit


Proc rotl64:
    Arguments @Count

    mov eax D@Count
    movq MM2 MM0
    movd MM3 eax
    sub eax 64
    neg eax
    movd MM4 eax
    psllq MM0 MM3
    psrlq MM2 MM4
    por MM0 MM2

EndP



Rotate Right
Code: [Select]

      rotr64 -  Rotate Right bits in MM0 register N times.

       Parameters:
                   Count (Input). The total amount of times the register should be shifted right

       Return Value:
                    The shifted value is stored in MM0 register. (See also, remarks below)

       Remarks:
                  The function uses MM0 register to perform the rotation. So, on input, MM0 must already be filled.

      [tttt: Q$ 1] ; A qword variable that hold "1" as value
      movq MM0 Q$Value
      call rotr64 1 ; rotate 1 right by bit

Proc rotr64:
    Arguments @Count

    mov eax D@Count
    movq MM2 MM0
    movd MM3 eax
    sub eax 64
    neg eax
    movd MM4 eax
    psrlq MM0 MM3
    psllq MM2 MM4
    por MM0 MM2

EndP


Now...How to do a similar thing (left and right) for 128 Bits ? I tried this one below, but it is not working :(

Code: [Select]
Proc rot128:
    Arguments @Count

    mov eax D@Count
    movdqu XMM2 XMM0
    movd XMM3 eax
    sub eax 128
    neg eax
    movd XMM4 eax
    psllq XMM0 XMM3
    psrlq XMM2 XMM4
    por XMM0 XMM2

EndP
Title: Re: Rotate Bits SSE
Post by: daydreamer on April 27, 2020, 03:55:20 PM

Now...How to do a similar thing (left and right) for 128 Bits ? I tried this one below, but it is not working :(

Code: [Select]
Proc rot128:
    Arguments @Count

    mov eax D@Count
    movdqu XMM2 XMM0
    movd XMM3 eax
    sub eax 128
    neg eax
    movd XMM4 eax
    psllq XMM0 XMM3
    psrlq XMM2 XMM4
    por XMM0 XMM2

EndP
you should use PSLLDQ (double quadword) instead of PSLLQ (only 64bit quadword) versions instead
Title: Re: Rotate Bits SSE
Post by: jj2007 on April 27, 2020, 08:24:14 PM
you should use PSLLDQ (double quadword) instead of PSLLQ (only 64bit quadword) versions instead

Marinus, Magnus, PSLLDQ shifts bytes while PSLLQ shifts bits.
Title: Re: Rotate Bits SSE
Post by: mineiro on April 27, 2020, 10:19:56 PM
The version below is not SSE2. Sorry, I didn't check for optimization or others instructions.

mov rax,0123456789abcdefh   ;high
mov rdx,0123456789abcdefh   ;low
mov ecx,8         ;count
shld rax,rdx,cl         ;shift left rotate higher
shl rdx,cl         ;shift left rotate lower

Rotating to the left is to multiply the number N times by 2. Depending, you may need two variables for the carry even to create a rcl or rol version at each step.
Maybe this can be usefull to others tries.

---edit---
I assembled your source file and now understand. It's something like:
Code: [Select]
mov rax,0f123456789abcdefh ;high
 mov rdx,0123456789abcdefh ;low
 mov rbx,0
 mov ecx,8 ;count
 shld rbx,rax,cl                 ;rbx=high bits of rax
 shld rax,rdx,cl ;rotate n bits agregating
 shl rdx,cl ;shift left rotate lower, inserting zeros at right side
 or rdx,rbx                ;join, logical add
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 27, 2020, 11:01:39 PM
you should use PSLLDQ (double quadword) instead of PSLLQ (only 64bit quadword) versions instead

Marinus, PSLLDQ shifts bytes while PSLLQ shifts bits.

Am I PSHUFD with Magnus?  :biggrin:
Title: Re: Rotate Bits SSE
Post by: mineiro on April 28, 2020, 12:46:41 AM
hello sir guga,
This is a quick try;  psllq deals with 2 qwords instead of 1 oword(128)

Code: [Select]
.data
align 32
number    dq 0123456789abcdefh,0fedcba9876543210h
low_mask dq 0ffffffffffffffffh,0000000000000000h ;reversed because using qword
high_mask  dq 0000000000000000h,0ffffffffffffffffh

.code
mov eax,4 ;count
movdqu xmm0,oword ptr [number] ;xmm0=fedcba98765432100123456789abcdef
movdqu xmm1,xmm0
movdqu xmm2,xmm0
pand xmm1,oword ptr [high_mask] ;xmm1=FEDCBA98765432100000000000000000
pand xmm2,oword ptr [low_mask] ;xmm2=00000000000000000123456789ABCDEF
movd xmm3,eax ;xmm3=counter
psllq xmm0,xmm3 ;xmm0=EDCBA98765432100123456789ABCDEF0

sub eax,64
neg eax
movd xmm4,eax ;xmm4=3c
psrlq xmm1,xmm4 ;xmm1=000000000000000F0000000000000000
psrlq xmm2,xmm4 ;xmm2=00000000000000000000000000000000
pxor xmm3,xmm3 ;zero
movhlps xmm3,xmm1 ;high part of 1 to lower part of 3
movlhps xmm3,xmm2 ;lower part of 2 to higher part of 3
;xmm3=0000000000000000000000000000000F
por xmm0,xmm3 ;concatenate

--edit---
I measure both procedures and the first that don't uses SSE perform better in my machine.
Title: Re: Rotate Bits SSE
Post by: jj2007 on April 28, 2020, 02:54:05 AM
Am I PSHUFD with Magnus?  :biggrin:

Corrected, sorry :badgrin:
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 28, 2020, 04:26:37 AM
No worries, 'The mouth speaks what the heart is full of'  :rofl:
Title: Re: Rotate Bits SSE
Post by: guga on April 28, 2020, 06:15:59 AM
Hi Mineiro, tks, but not working. Try with Rotating 64 and 65 Bits. The result will be 0.

number    dq 0123456789abcdefh,0fedcba9876543210h
The input are:
0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0111_0110_0101_0100_0011_0010_0001_0000_1000_1001_1010_1011_1100_1101_1110_1111

and the output should be:

Rotate 64 Bts
1110_1100_1010_1000_0110_0100_0010_0001_0001_0011_0101_0111_1001_1011_1101_1110_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000

Rotate 65 bits
1101_1001_0101_0000_1100_1000_0100_0010_0010_0110_1010_1111_0011_0111_1011_1100_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0001
Title: Re: Rotate Bits SSE
Post by: guga on April 28, 2020, 06:17:25 AM
Ops...edited, wrong :greensml: :greensml: :greensml: :greensml:
Title: Re: Rotate Bits SSE
Post by: mineiro on April 28, 2020, 06:29:27 AM
That's working or not to your needs sir guga? I can change the code to fit your needs.
I suppose shl deals with N-1, so, maximum counter can hold will be 63. Well, now that you said that I need review result with counter being zero, 64 or above.
Title: Re: Rotate Bits SSE
Post by: guga on April 28, 2020, 06:56:11 AM
Hi Mineiro

No. It´s not working. The goal is to rotate left and right all the bits on a 128 bit data. What the code was doing is shifting the bits (also only for 64) which causes the zeroing of the lo half of the 128 bits, rather then rotating all bits by XXX times.
Title: Re: Rotate Bits SSE
Post by: mineiro on April 28, 2020, 08:11:41 AM
Maybe this can work, please check. If this works we can deal with rotate right.
Code: [Select]
.data
 align 32
 number    dq 3,0
 low_mask dq 0ffffffffffffffffh,0000000000000000h ;reversed because using qword
 high_mask  dq 0000000000000000h,0ffffffffffffffffh

 .code
  mov eax,127 ;count range 0 to 128
  movdqu xmm0,oword ptr [number]
  movdqu xmm1,xmm0
  .if eax >= 64
      movhlps xmm0,xmm1 ;switch high and low
    movlhps xmm0,xmm1
    sub eax,64
  .endif

    movdqu xmm1,xmm0
    movdqu xmm2,xmm0

    pand xmm1,oword ptr [high_mask]
    pand xmm2,oword ptr [low_mask]
    movd xmm3,eax
    psllq xmm0,xmm3

    sub eax,64
    neg eax
    movd xmm4,eax
    psrlq xmm1,xmm4
    psrlq xmm2,xmm4
    pxor xmm3,xmm3
    movhlps xmm3,xmm1
    movlhps xmm3,xmm2
    por xmm0,xmm3
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 28, 2020, 09:21:41 AM
Cool challenge.

EDIT: Sorry, I misunderstood the question and posted the wrong code.  :rolleyes:
I'll try again and if it works, post some code that does the job.
Title: Re: Rotate Bits SSE
Post by: mineiro on April 28, 2020, 11:15:17 AM
by XXX times.
obscene  :rofl:

I figured it was not necessary to use masks, I cleaned the code.
Code: [Select]
  mov eax,127 ;count range 0 to 128
  movdqu xmm0,oword ptr [number]

  .if eax >= 64
    movdqu xmm1,xmm0
      movhlps xmm0,xmm1 ;switch high and low
    movlhps xmm0,xmm1
    sub eax,64
  .endif

    movdqu xmm1,xmm0
    movd xmm3,eax
    psllq xmm0,xmm3

    sub eax,64
    neg eax
    movd xmm4,eax
    psrlq xmm1,xmm4
    movhlps xmm3,xmm1
    movlhps xmm3,xmm1
    por xmm0,xmm3
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 28, 2020, 01:05:39 PM
Hi guga,

Here is my contribution, left and right rotation in 1 routine.

Code: [Select]
Number         oword 000102030405060708090a0b0c0d0e0fh
NumRotateBits  dd -4    ; number of bits to rotate ( range = -127 to +127 )

Rotate128Bits:
    movdqu  xmm0,oword ptr [Number]
    mov     ecx,64                  ; 64 bits Barrier.
    mov     eax,NumRotateBits       ; Number of bits to rotate.
    test    eax,eax                 ; Positive = Left and Negative = Right Bit Rotation
    jns     TestDirection           ; Rotate Left or Right?
    add     eax,128                 ; Correct the position.
TestDirection:
    cmp     eax,ecx
    jb      Test64bitBarrier
    pshufd  xmm0,xmm0,01001110b     ; Swap High and Low 64 bits if NumRotateBits > 64
Test64bitBarrier:
    and     eax,63                  ; Keep it within 64 bit.
    movd    xmm2,eax                ; Bits to shift.
    movd    xmm3,ecx                ; 64 bit Shift-Range.
    movdqu  xmm1,xmm0               ; Make a copy of the 128 bits.
    psubq   xmm3,xmm2               ; Calculate Shift-Range.
    psllq   xmm0,xmm2               ; Do the bit shifts to the left.
    psrlq   xmm1,xmm3               ; Do the bit shifts to the right.
    pshufd  xmm1,xmm1,01001110b     ; Swap High and Low 64 bit.
    pxor    xmm0,xmm1               ; Glue the 128 bits together.
    ret                             ; result in xmm0

Title: Re: Rotate Bits SSE
Post by: guga on April 29, 2020, 04:06:45 AM
Tks you so much, Siekmanski.

Works like a charm.


This is for a routine i´m making to create a SHA3 Cypher. So far, the code and result works  ok, but there are some minor issues i´m, trying to optimize it more. For example....How to optimize this and port from MMX to SSE2 ?

Code: [Select]

; Where the arguments State is a 25 dupe Qword (so, 50 qwords in total) = KeccakState: Q$ 0 #25 and BC is a 5 dupe qword. (10 qwords). [BC: Q$ 0 #5]

Proc keccakf_Theta::
    Arguments @State
    Uses esi, edi, ecx, ebx, edx

    mov esi D@State

    movups XMM0 X$esi
    movups XMM1 X$esi+40 | xorps XMM0 XMM1
    movups XMM2 X$esi+80 | xorps XMM0 XMM2
    movups XMM2 X$esi+120 | xorps XMM0 XMM2
    movups XMM2 X$esi+160 | xorps XMM0 XMM2
    movups X$BC XMM0

    movups XMM0 X$esi+16
    movups XMM1 X$esi+56 | xorps XMM0 XMM1
    movups XMM2 X$esi+96 | xorps XMM0 XMM2
    movups XMM2 X$esi+136 | xorps XMM0 XMM2
    movups XMM2 X$esi+176 | xorps XMM0 XMM2
    movups X$BC+16 XMM0

    movq MM0 Q$esi+32 | pxor MM0 Q$esi+72 | pxor MM0 Q$esi+112 | pxor MM0 Q$esi+152 | pxor MM0 Q$esi+192 | movq Q$BC+32 MM0

    mov edi BC

    movq MM0 Q$edi+8
    call rotl64 1
    pxor MM0 Q$edi+32

    movq MM1 Q$esi | pxor MM1 MM0 | movq Q$esi MM1; esi
    movq MM1 Q$esi+40 | pxor MM1 MM0 | movq Q$esi+40 MM1 ; esi+40
    movq MM1 Q$esi+80 | pxor MM1 MM0 | movq Q$esi+80 MM1; esi+80
    movq MM1 Q$esi+120 | pxor MM1 MM0 | movq Q$esi+120 MM1;esi+120
    movq MM1 Q$esi+160 | pxor MM1 MM0 | movq Q$esi+160 MM1;esi+160

    movq MM0 Q$edi+16
    call rotl64 1
    pxor MM0 Q$edi

    movq MM1 Q$esi+8 | pxor MM1 MM0 | movq Q$esi+8 MM1;esi+8
    movq MM1 Q$esi+48 | pxor MM1 MM0 | movq Q$esi+48 MM1;esi+48
    movq MM1 Q$esi+88 | pxor MM1 MM0 | movq Q$esi+88 MM1;esi+88
    movq MM1 Q$esi+128 | pxor MM1 MM0 | movq Q$esi+128 MM1;esi+128
    movq MM1 Q$esi+168 | pxor MM1 MM0 | movq Q$esi+168 MM1;esi+168

    movq MM0 Q$edi+24
    call rotl64 1
    pxor MM0 Q$edi+8

    movq MM1 Q$esi+16 | pxor MM1 MM0 | movq Q$esi+16 MM1 ; eax = 16
    movq MM1 Q$esi+56 | pxor MM1 MM0 | movq Q$esi+56 MM1 ; eax = 56
    movq MM1 Q$esi+96 | pxor MM1 MM0 | movq Q$esi+96 MM1 ; eax = 96
    movq MM1 Q$esi+136 | pxor MM1 MM0 | movq Q$esi+136 MM1; eax = 136
    movq MM1 Q$esi+176 | pxor MM1 MM0 | movq Q$esi+176 MM1 ; eax = 176

    movq MM0 Q$edi+32
    call rotl64 1
    pxor MM0 Q$edi+16

    movq MM1 Q$esi+24 | pxor MM1 MM0 | movq Q$esi+24 MM1 ; eax = 24
    movq MM1 Q$esi+64 | pxor MM1 MM0 | movq Q$esi+64 MM1 ; eax = 64
    movq MM1 Q$esi+104 | pxor MM1 MM0 | movq Q$esi+104 MM1 ; eax  = 104
    movq MM1 Q$esi+144 | pxor MM1 MM0 | movq Q$esi+144 MM1 ; eax = 144
    movq MM1 Q$esi+184 | pxor MM1 MM0 | movq Q$esi+184 MM1 ; eax = 184

    movq MM0 Q$edi
    call rotl64 1
    pxor MM0 Q$edi+24

    movq MM1 Q$esi+32 | pxor MM1 MM0 | movq Q$esi+32 MM1
    movq MM1 Q$esi+72 | pxor MM1 MM0 | movq Q$esi+72 MM1
    movq MM1 Q$esi+112 | pxor MM1 MM0 | movq Q$esi+112 MM1
    movq MM1 Q$esi+152 | pxor MM1 MM0 | movq Q$esi+152 MM1
    movq MM1 Q$esi+192 | pxor MM1 MM0 | movq Q$esi+192 MM1

EndP

; where rotl64 =
Code: [Select]
Proc rotl64:
    Arguments @Count

    mov eax D@Count
    movq MM2 MM0
    movd MM3 eax
    sub eax 64
    neg eax
    movd MM4 eax
    psllq MM0 MM3
    psrlq MM2 MM4
    por MM0 MM2

EndP



The above is the unrolled loop routine from my keccakf routine related to compute the loop to calculate Theta,. I´ll post here laater if needed, since i´m trying to optimize other parts of the algo as well :)

So far, my SHA3 cypher works ok, for all different types such as 224, 256, 384 and 512. But more optimization is needed :)

Ex: To compute the SHA3 from string "guga" my algo returns the corrected values (Although, i believe it is still needs more optimization):

SHA3-224    fa6d153b62efd49a85c2b4ae0b53e2f2361d54e51dc2940281b6691d
SHA3-256    0aa747b1b0f38a900fccf52a50386ff57b59e7123ecfc2df1547973e2eeaa19d
SHA3-384    e11df0d663cac016f88a24c36309075972adadc1bd5d6d13928849d334444354f6a68e15f18915d2a4b285960ceeada4
SHA3-512    603c3235eb25a491b5fd8148065066ab307769b714e5abc43448a37f361864e4d786afaff22d5ac5c839d350413d36c953837bf0b29171cae18f7395d720762c

https://md5calc.com/hash/sha3-256/guga
Title: Re: Rotate Bits SSE
Post by: guga on April 29, 2020, 04:36:01 AM
Or, maybe rotating each of the High and Low 64 bit at once. (Could also works for other needs as well, such as a routine to rotate bytes (16 at once), words (8 at once), qwords (2 at once) that are stored inside a XMM register)

For example:
XMM0 = 001000....0010    (1st 64 bits) - rotate this 64 bits
XMM0 = 010011....1011    (2nd 64 bits) - rotate this 64 bits

So, making rotate only by the 64 bits chunks, rather then the whole 128 bit chain. I.e: Forcing the data from "rotl64" to compute a rotation at 2 64bits chains (qword) at once.

Perhaps this could also helps optimizing a bit the sequence before it goes through the chunks of  pxor MM1 MM0 .....
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 29, 2020, 06:46:40 AM
I have no knowledge regarding Hash Functions, so I had a look on the internet.
Couldn't find a logical explanation of the algorithm which I can understand.
It's a bit hard for me to decipher your code example.

Is it so you only rotate 1 bit at a time?
Title: Re: Rotate Bits SSE
Post by: jj2007 on April 29, 2020, 07:20:46 AM
Here is my contribution, left and right rotation in 1 routine.

Looks comvincing :thumbsup:
Code: [Select]
  .Repeat
inc NumRotateBits
call Rotate128Bits
deb 4, "Out", b:xmm0
  .Until NumRotateBits>=64
  .Repeat
dec NumRotateBits
call Rotate128Bits
deb 4, "Out", b:xmm0
  .Until NumRotateBits==-64
Code: [Select]
Out     b:xmm0          00011000000110100001110000011110
Out     b:xmm0          00110000001101000011100000111100
Out     b:xmm0          01100000011010000111000001111000
Out     b:xmm0          11000000110100001110000011110000
Out     b:xmm0          10000001101000011100000111100000
Out     b:xmm0          00000011010000111000001111000000
Out     b:xmm0          00000110100001110000011110000000
Out     b:xmm0          00001101000011100000111100000000
Out     b:xmm0          00011010000111000001111000000000
Out     b:xmm0          00110100001110000011110000000000
Out     b:xmm0          01101000011100000111100000000000
Out     b:xmm0          11010000111000001111000000000000
Out     b:xmm0          10100001110000011110000000000000
Out     b:xmm0          01000011100000111100000000000000
Out     b:xmm0          10000111000001111000000000000000
Out     b:xmm0          00001110000011110000000000000001
Out     b:xmm0          00011100000111100000000000000010
Out     b:xmm0          00111000001111000000000000000100
Out     b:xmm0          01110000011110000000000000001000
Out     b:xmm0          11100000111100000000000000010000
Out     b:xmm0          11000001111000000000000000100000
Out     b:xmm0          10000011110000000000000001000000
Out     b:xmm0          00000111100000000000000010000001
Out     b:xmm0          00001111000000000000000100000010
Out     b:xmm0          00011110000000000000001000000100
Out     b:xmm0          00111100000000000000010000001000
Out     b:xmm0          01111000000000000000100000010000
Out     b:xmm0          11110000000000000001000000100000
Out     b:xmm0          11100000000000000010000001000000
Out     b:xmm0          11000000000000000100000010000000
Out     b:xmm0          10000000000000001000000100000001
Out     b:xmm0          00000000000000010000001000000011
Out     b:xmm0          00000000000000100000010000000110
Out     b:xmm0          00000000000001000000100000001100
Out     b:xmm0          00000000000010000001000000011000
Out     b:xmm0          00000000000100000010000000110000
Out     b:xmm0          00000000001000000100000001100000
Out     b:xmm0          00000000010000001000000011000001
Out     b:xmm0          00000000100000010000000110000010
Out     b:xmm0          00000001000000100000001100000100
Out     b:xmm0          00000010000001000000011000001000
Out     b:xmm0          00000100000010000000110000010000
Out     b:xmm0          00001000000100000001100000100000
Out     b:xmm0          00010000001000000011000001000000
Out     b:xmm0          00100000010000000110000010000000
Out     b:xmm0          01000000100000001100000100000001
Out     b:xmm0          10000001000000011000001000000010
Out     b:xmm0          00000010000000110000010000000101
Out     b:xmm0          00000100000001100000100000001010
Out     b:xmm0          00001000000011000001000000010100
Out     b:xmm0          00010000000110000010000000101000
Out     b:xmm0          00100000001100000100000001010000
Out     b:xmm0          01000000011000001000000010100000
Out     b:xmm0          10000000110000010000000101000001
Out     b:xmm0          00000001100000100000001010000011
Out     b:xmm0          00000011000001000000010100000110
Out     b:xmm0          00000110000010000000101000001100
Out     b:xmm0          00001100000100000001010000011000
Out     b:xmm0          00011000001000000010100000110000
Out     b:xmm0          00110000010000000101000001100000
Out     b:xmm0          01100000100000001010000011000000
Out     b:xmm0          11000001000000010100000110000001
Out     b:xmm0          10000010000000101000001100000011
Out     b:xmm0          00000100000001010000011000000111
Out     b:xmm0          10000010000000101000001100000011
Out     b:xmm0          11000001000000010100000110000001
Out     b:xmm0          01100000100000001010000011000000
Out     b:xmm0          00110000010000000101000001100000
Out     b:xmm0          00011000001000000010100000110000
Out     b:xmm0          00001100000100000001010000011000
Out     b:xmm0          00000110000010000000101000001100
Out     b:xmm0          00000011000001000000010100000110
Out     b:xmm0          00000001100000100000001010000011
Out     b:xmm0          10000000110000010000000101000001
Out     b:xmm0          01000000011000001000000010100000
Out     b:xmm0          00100000001100000100000001010000
Out     b:xmm0          00010000000110000010000000101000
Out     b:xmm0          00001000000011000001000000010100
Out     b:xmm0          00000100000001100000100000001010
Out     b:xmm0          00000010000000110000010000000101
Out     b:xmm0          10000001000000011000001000000010
Out     b:xmm0          01000000100000001100000100000001
Out     b:xmm0          00100000010000000110000010000000
Out     b:xmm0          00010000001000000011000001000000
Out     b:xmm0          00001000000100000001100000100000
Out     b:xmm0          00000100000010000000110000010000
Out     b:xmm0          00000010000001000000011000001000
Out     b:xmm0          00000001000000100000001100000100
Out     b:xmm0          00000000100000010000000110000010
Out     b:xmm0          00000000010000001000000011000001
Out     b:xmm0          00000000001000000100000001100000
Out     b:xmm0          00000000000100000010000000110000
Out     b:xmm0          00000000000010000001000000011000
Out     b:xmm0          00000000000001000000100000001100
Out     b:xmm0          00000000000000100000010000000110
Out     b:xmm0          00000000000000010000001000000011
Out     b:xmm0          10000000000000001000000100000001
Out     b:xmm0          11000000000000000100000010000000
Out     b:xmm0          11100000000000000010000001000000
Out     b:xmm0          11110000000000000001000000100000
Out     b:xmm0          01111000000000000000100000010000
Out     b:xmm0          00111100000000000000010000001000
Out     b:xmm0          00011110000000000000001000000100
Out     b:xmm0          00001111000000000000000100000010
Out     b:xmm0          00000111100000000000000010000001
Out     b:xmm0          10000011110000000000000001000000
Out     b:xmm0          11000001111000000000000000100000
Out     b:xmm0          11100000111100000000000000010000
Out     b:xmm0          01110000011110000000000000001000
Out     b:xmm0          00111000001111000000000000000100
Out     b:xmm0          00011100000111100000000000000010
Out     b:xmm0          00001110000011110000000000000001
Out     b:xmm0          10000111000001111000000000000000
Out     b:xmm0          01000011100000111100000000000000
Out     b:xmm0          10100001110000011110000000000000
Out     b:xmm0          11010000111000001111000000000000
Out     b:xmm0          01101000011100000111100000000000
Out     b:xmm0          00110100001110000011110000000000
Out     b:xmm0          00011010000111000001111000000000
Out     b:xmm0          00001101000011100000111100000000
Out     b:xmm0          00000110100001110000011110000000
Out     b:xmm0          00000011010000111000001111000000
Out     b:xmm0          10000001101000011100000111100000
Out     b:xmm0          11000000110100001110000011110000
Out     b:xmm0          01100000011010000111000001111000
Out     b:xmm0          00110000001101000011100000111100
Out     b:xmm0          00011000000110100001110000011110
Out     b:xmm0          00001100000011010000111000001111
Out     b:xmm0          10000110000001101000011100000111
Out     b:xmm0          11000011000000110100001110000011
Out     b:xmm0          01100001100000011010000111000001
Out     b:xmm0          10110000110000001101000011100000
Out     b:xmm0          01011000011000000110100001110000
Out     b:xmm0          00101100001100000011010000111000
Out     b:xmm0          00010110000110000001101000011100
Out     b:xmm0          00001011000011000000110100001110
Out     b:xmm0          00000101100001100000011010000111
Out     b:xmm0          10000010110000110000001101000011
Out     b:xmm0          01000001011000011000000110100001
Out     b:xmm0          10100000101100001100000011010000
Out     b:xmm0          01010000010110000110000001101000
Out     b:xmm0          00101000001011000011000000110100
Out     b:xmm0          00010100000101100001100000011010
Out     b:xmm0          00001010000010110000110000001101
Out     b:xmm0          10000101000001011000011000000110
Out     b:xmm0          01000010100000101100001100000011
Out     b:xmm0          00100001010000010110000110000001
Out     b:xmm0          10010000101000001011000011000000
Out     b:xmm0          01001000010100000101100001100000
Out     b:xmm0          00100100001010000010110000110000
Out     b:xmm0          00010010000101000001011000011000
Out     b:xmm0          00001001000010100000101100001100
Out     b:xmm0          00000100100001010000010110000110
Out     b:xmm0          00000010010000101000001011000011
Out     b:xmm0          00000001001000010100000101100001
Out     b:xmm0          10000000100100001010000010110000
Out     b:xmm0          01000000010010000101000001011000
Out     b:xmm0          00100000001001000010100000101100
Out     b:xmm0          00010000000100100001010000010110
Out     b:xmm0          00001000000010010000101000001011
Out     b:xmm0          10000100000001001000010100000101
Out     b:xmm0          11000010000000100100001010000010
Out     b:xmm0          11100001000000010010000101000001
Out     b:xmm0          01110000100000001001000010100000
Out     b:xmm0          00111000010000000100100001010000
Out     b:xmm0          00011100001000000010010000101000
Out     b:xmm0          00001110000100000001001000010100
Out     b:xmm0          00000111000010000000100100001010
Out     b:xmm0          00000011100001000000010010000101
Out     b:xmm0          10000001110000100000001001000010
Out     b:xmm0          11000000111000010000000100100001
Out     b:xmm0          01100000011100001000000010010000
Out     b:xmm0          00110000001110000100000001001000
Out     b:xmm0          00011000000111000010000000100100
Out     b:xmm0          00001100000011100001000000010010
Out     b:xmm0          00000110000001110000100000001001
Out     b:xmm0          10000011000000111000010000000100
Out     b:xmm0          01000001100000011100001000000010
Out     b:xmm0          10100000110000001110000100000001
Out     b:xmm0          01010000011000000111000010000000
Out     b:xmm0          00101000001100000011100001000000
Out     b:xmm0          00010100000110000001110000100000
Out     b:xmm0          00001010000011000000111000010000
Out     b:xmm0          00000101000001100000011100001000
Out     b:xmm0          00000010100000110000001110000100
Out     b:xmm0          00000001010000011000000111000010
Out     b:xmm0          10000000101000001100000011100001
Out     b:xmm0          01000000010100000110000001110000
Out     b:xmm0          00100000001010000011000000111000
Out     b:xmm0          00010000000101000001100000011100
Out     b:xmm0          00001000000010100000110000001110
Out     b:xmm0          00000100000001010000011000000111
Title: Re: Rotate Bits SSE
Post by: guga on April 29, 2020, 07:24:39 AM
Hi Scarmatil

yes...Your algo works like a charm. Rotating 1 bit at a time.

Can it also work for 64 bits inside a XMM also rotating 1 bit at a time, but restricted to the low and half Qwords of the 128 bits ?

Maybe this could help better optimizing this.

About the SHA3 , i built one biased  to these ones:

http://gauss.ececs.uc.edu/Courses/c6053/lectures/Hashing/sha3
https://github.com/magurosan/sha3-odzhan

The second link, didn´t worked as expected and it was very slow, so i had to create a variation of it biased on both links to make it work properly.

I´ll post here the test i made containing the embeded source (RosAsm syntax) and also i´ll try to make one for masm syntax as well to make it easier for people read.


Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 29, 2020, 08:17:30 AM
 :biggrin:
First I was "PSHUFD" with Magnus, and now with Scarmatil.   :thumbsup:

Code: [Select]
    ;rotate64left 1 bit
    movq    xmm0,qword ptr [Number64]
    movq    xmm1,xmm0
    psllq   xmm0,1 
    psrlq   xmm1,63
    pxor    xmm0,xmm1

    ;rotate64right 1 bit
    movq    xmm0,qword ptr [Number64]
    movq    xmm1,xmm0
    psllq   xmm0,63
    psrlq   xmm1,1
    pxor    xmm0,xmm1
Title: Re: Rotate Bits SSE
Post by: guga on April 29, 2020, 08:47:52 AM
:biggrin:
First I was "PSHUFD" with Magnus, and now with Scarmatil.   :thumbsup:

Ooooppppsss... Sorry  :bgrin: :bgrin: :bgrin: :bgrin: :bgrin: Siekmanski. :greensml: :greensml: :greensml: I was with RosAsm opene3d and one of the routines was made by an former contributor called Scarmatil :greensml: :greensml: :greensml: :greensml: :greensml:

Tks for the code. I´l take a look and see if i can adapt it. I´m also trying to port my SHA3 cypher to masm, before i upload both versions.
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 29, 2020, 08:51:54 AM
 :thumbsup:
Title: Re: Rotate Bits SSE
Post by: mineiro on April 29, 2020, 09:59:52 PM
I ran some tests and in the 64-bit long mode, common instructions are more efficient than SSE to accomplish this goal.
As an example:
Code: [Select]
;rotate 1 bit to left
mov rax,9090909090909090h ;64 bits number
rcl rax,1
adc rax,0
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 29, 2020, 11:16:31 PM
Hi mineiro,

As far as I understand and observed, I don't see possibilities for parallel execution speed optimizations in guga's routine. ( could be I missed something... )
Most are 64 bit calculations with wide spreaded reads and writes.
In 64 bit long mode, your solution fits his routine best.
It's a pity we don't have rotate bits instructions in SIMD.  :sad:
Title: Re: Rotate Bits SSE
Post by: mineiro on April 30, 2020, 12:47:04 AM
You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.

Some results to "rept 10000":
your rot
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    164772     162016     210012     243834      30000
    140896     152612     210001     240022      30000
    140874     152643     210001     240013      30000
    140954     152642     210001     240013      30000
Code: [Select]
align 16
rot3 proc
mov ecx,[counter]
movdqu xmm0,oword ptr [number]
.if ecx >= 64
pshufd  xmm0,xmm0,01001110b
sub ecx,64
.endif
movdqu xmm1,xmm0
movd xmm3,ecx
psllq xmm0,xmm3
sub ecx,64
neg ecx
movd xmm4,ecx
psrlq xmm1,xmm4
pshufd  xmm1,xmm1,01001110b
por xmm0,xmm1
ret
rot3 endp
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    180560     174677     190012     233851      20000
    135620     146905     190001     230022      20000
    135610     146932     190001     230019      20000
    135648     146918     190001     230019      20000

Code: [Select]
align 16
rot6 proc
mov ecx,counter
mov rax,[number+8]
mov rdx,[number]
.if ecx >= 64
    mov rbx,rax
    mov rax,rdx
    mov rdx,rbx
    sub ecx,64
.endif
xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx
ret
rot6 endp
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    148766     144041     180012     283841      20000
    122292     132467     180001     280025      20000
    122310     132504     180001     280019      20000
    122304     132506     180001     280020      20000
Title: Re: Rotate Bits SSE
Post by: daydreamer on April 30, 2020, 04:59:22 AM
You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.

Some results to "rept 10000":
your rot
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    164772     162016     210012     243834      30000
    140896     152612     210001     240022      30000
    140874     152643     210001     240013      30000
    140954     152642     210001     240013      30000
Code: [Select]
align 16
rot3 proc
mov ecx,[counter]
movdqu xmm0,oword ptr [number]
.if ecx >= 64
pshufd  xmm0,xmm0,01001110b
sub ecx,64
.endif
movdqu xmm1,xmm0
movd xmm3,ecx
psllq xmm0,xmm3
sub ecx,64
neg ecx
movd xmm4,ecx
psrlq xmm1,xmm4
pshufd  xmm1,xmm1,01001110b
por xmm0,xmm1
ret
rot3 endp
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    180560     174677     190012     233851      20000
    135620     146905     190001     230022      20000
    135610     146932     190001     230019      20000
    135648     146918     190001     230019      20000

Code: [Select]
align 16
rot6 proc
mov ecx,counter
mov rax,[number+8]
mov rdx,[number]
.if ecx >= 64
    mov rbx,rax
    mov rax,rdx
    mov rdx,rbx
    sub ecx,64
.endif
xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx
ret
rot6 endp
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    148766     144041     180012     283841      20000
    122292     132467     180001     280025      20000
    122310     132504     180001     280019      20000
    122304     132506     180001     280020      20000
the below code have much depency on earlier instruction
Code: [Select]
xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx
maybe if unroll it like this breaks depency
and if cpu has several shift execution units it starts working on 2shifts at a time
forgive me if r9,r10,r11 is wrong register choice,because I am not experienced with 64bit asm
Code: [Select]
xor ebx,ebx
xor r9,r9
shld rbx,rax,cl
shld r9,r10,cl
shld rax,rdx,cl
shld r10,r11,cl
shl rdx,cl
shl r11,cl

or rdx,rbx
or r11,r9
on the other hand there are some instructions that are starved on execution units,so it has no use unroll those except its only few cpus that has slow shift

because its suppose to be part of bigger code,macro it instead of proc could gain few cycles
Title: Re: Rotate Bits SSE
Post by: mineiro on April 30, 2020, 06:03:54 AM
hello sir daydreamer;
No problems regarding the calling convention and register to be preserved, they are just tests; I didn't use the usual Windows or Linux convention myself. If others join so this can be a problem.

I understood what you said about register related dependency, execution out of order. I didn't find a viable way to solve the dependency to rotate 128 bits.
The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
Thanks for your comments and sugestion.
Title: Re: Rotate Bits SSE
Post by: guga on April 30, 2020, 08:13:00 AM
Hi Guys, i finished porting the SHA3 algorithm to masm, i just need to test what i have done wrong with the porting because the result is wrong. I´ll open another post containingt the algorithm as soon i fix the masm version. This is the algorithm from where i´m using to test this SSE rotation functions to speed up a little bit. And also, creating some functions related to Rotate and Shift 128 (or more bits) in other situations can also be helpful to other things :)

Title: Re: Rotate Bits SSE
Post by: jj2007 on April 30, 2020, 02:37:21 PM
Another approach, no idea how fast it is:

include \masm32\MasmBasic\MasmBasic.inc
.data
Number         OWORD 10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110y
.code
RotateLeft128 proc pNumber      ; single shift left
  mov eax, pNumber
  movups xmm0, OWORD ptr [eax]
  psllq xmm0, 1                 ; shift left two qwords by one bit
  movsx edx, byte ptr [eax+7]   ; get sign of low qword
  test sbyte ptr [eax+15], 128  ; get sign of high qword
  movups OWORD ptr [eax], xmm0
  .if Sign?
        or byte ptr [eax], 1    ; rotate sign of high qword in
  .endif
  test edx, edx
  .if Sign?
        or byte ptr [eax+8], 1  ; set bit 0 of low qword
  .endif
  ret
RotateLeft128 endp
  Init
  Cls
  deb 4, "in", b:Number:4       ; show number as binary, 4 dwords
  invoke RotateLeft128, offset Number
  deb 4, "out", b:Number:4
EndOfCode


Output:
Code: [Select]
in      b:Number:4      10000110111111101111111011111110 10001110111111101111111011111110 10011110111111101111111011111110 10111110111111101111111011111110
out     b:Number:4      00001101111111011111110111111101 00011101111111011111110111111101 00111101111111011111110111111101 01111101111111011111110111111101
Title: Re: Rotate Bits SSE
Post by: daydreamer on April 30, 2020, 07:01:30 PM

The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
Thanks for your comments and sugestion.
specific case of rotate bits ala byte can be done
look below on RGB shuffle alternative instead of bitshift channels
http://masm32.com/board/index.php?topic=7687.0 (http://masm32.com/board/index.php?topic=7687.0)
Title: Re: Rotate Bits SSE
Post by: jj2007 on April 30, 2020, 08:44:16 PM
Compliments to Marinus :thumbsup:
Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
in      b:MyOword:4d    10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J   b:MyOword:4d    00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M   b:MyOword:4d    00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
259 ms for Marinus
555 ms for Jochen
Title: Re: Rotate Bits SSE
Post by: mineiro on April 30, 2020, 09:33:06 PM
sir jj code results in my machine:
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    336500     345020     140012     217061      30000
    306974     332504     140001     213686      30000
    305980     331467     140001     213544      30000
    307002     332557     140001     213615      30000

sir daydreamer, I'm reading your code and will check what can be done. Thanks.
Title: Re: Rotate Bits SSE
Post by: jj2007 on April 30, 2020, 10:23:27 PM
sir jj code results in my machine:
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    336500     345020     140012     217061      30000
    306974     332504     140001     213686      30000
    305980     331467     140001     213544      30000
    307002     332557     140001     213615      30000

Can you explain these numbers?
Title: Re: Rotate Bits SSE
Post by: Siekmanski on April 30, 2020, 10:24:04 PM
 :cool:
Code: [Select]
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
in      b:MyOword:4d    10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J   b:MyOword:4d    00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M   b:MyOword:4d    00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
226 ms for Marinus
418 ms for Jochen
Title: Re: Rotate Bits SSE
Post by: mineiro on May 01, 2020, 01:28:29 AM
Can you explain these numbers?
Thats better explained in agner fog blog (testp.pdf). I'm using PMCTestB64.asm file and assembling with uasm in linux.
Basically clock, core cycles, instructions, micro operations and branchs taken.

Clock is the clock count measured with the RDTSC instruction.
Core cyc is the clock count measured with the "core clock cycles" counter. The CPU can change its core frequency dynamically in order to save power. ...

7.14 What do I need the performance monitor counters for?
These counters are useful for counting instructions, micro-operations, cache misses, branch mispredictions and other events that are useful for testing program performance.
Title: Re: Rotate Bits SSE
Post by: jj2007 on May 01, 2020, 01:38:01 AM
Yes, sure, I know all that, but how does it relate to the algos tested?
Title: Re: Rotate Bits SSE
Post by: mineiro on May 01, 2020, 01:45:00 AM
good question, I don't know.
305980 / 140874 = 2,172
555ms  /  259ms = 2,1429
418ms  /  226ms = 1,849557522
Title: Re: Rotate Bits SSE
Post by: Siekmanski on May 01, 2020, 04:10:22 AM
 :biggrin:
Time is the only constant factor. ( At least at sea level )
Different CPU architectures have their own pros and cons.
Title: Re: Rotate Bits SSE
Post by: mineiro on May 01, 2020, 08:53:39 AM
good question, I don't know.
305980 / 140874 = 2,172
555ms  /  259ms = 2,1429
418ms  /  226ms = 1,849557522

Code: [Select]
mineiro@assembly:~/asm$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping        : 3
microcode       : 0xcc
cpu MHz         : 800.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch xsaveopt invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bogomips        : 6816.06
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

rept 10000, constant_tsc and cpu MHz : 800.000,
305980 / 800000000 = 0,000382475 seconds
140874 / 800000000 = 0,0001760925 seconds

382475 / 176092  == 2,17201803603
Title: Re: Rotate Bits SSE
Post by: guga on May 01, 2020, 04:57:17 PM
A few rapid tricks concerning rotating bits.

For faster rotation of specific bits we can also use pshufd mnemonic (Tks Siekmanski for the tip) :thumbsup: :thumbsup: :thumbsup:

To do that, i create a table of equates of all possible 256 values used in pshufd. For rotation i created these:
Code: [Select]
[SSE_ROTATE_RIGHT_96BITS   147]    ; Dwords in xmm are copied from 1234 to 2341 ordering. Therefore, it is rotating right 96 bits. (Which is the same as rotating left 32 bits)
[SSE_ROTATE_LEFT_32BITS 147]        ; the same things and result as SSE_ROTATE_RIGHT_96BITS

[SSE_ROTATE_LEFT_96BITS     57]      ; Dwords in xmm are copied from 1234 to 4123 ordering. Therefore, it is rotating left 96 bits. (Which is the same as rotating right 32 bits)
[SSE_ROTATE_RIGHT_32BITS  57]      ; the same things and result as SSE_ROTATE_LEFT_96BITS

for 64 bits, we have a simple Swaping of both Qwords inside xmm, therefore we are rotating (left and right) 64 bits. Rotating left and right 64 bits are the same thing

[SSE_SWAP_QWORDS 78]        ; Dwords in xmm are copied from 1234 to 3412 ordering. Therefore, it is rotating right and left left 64 bits.
[SSE_ROTATE_LEFT_64BITS 78]      ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_RIGHT_64BITS 78]    ; same as above
[SSE_ROTATE_64BITS 78]                ; same as above


We can also use pshufd to invert the order of dwords simple as:
[SSE_INVERT_DWORDS 27]      ; Dwords in xmm are copied from 1234 to 4321 ordering.



Example of usage:

pshufd xmm1 xmm0 SSE_INVERT_DWORDS ; xmm1 will contains the inverted order of dwords in xmm0
pshufd xmm0 xmm0 SSE_INVERT_DWORDS ; inverted the order of dwords in xmm0

pshufd xmm1 xmm0 SSE_SWAP_QWORDS ; swap qwords and copy them from xmm0 to xmm1
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap qwords in xmm0

pshufd xmm1 xmm0 SSE_ROTATE_RIGHT_96BITS ; rotate 96 bits left in xmm0 and copy them onto xmm1

etc etc

I´m finishing building the labels for all these equates related to pshufd and will put them on another thread. Since pshufd is a bit confusing in having to remember all the bits location etc, making equates with the proper label names is better to understand what to do and also, build a set of macros if needed (for rosasm, masm, nasm, fasm etc)


Title: Re: Rotate Bits SSE
Post by: Siekmanski on May 01, 2020, 05:42:28 PM
You can use this MACRO for the shuffle positions,

Shuffle MACRO V0,V1,V2,V3
    EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

    pshufd  xmm0,xmm0,Shuffle(1,0,3,2)
    is equal to:
    pshufd  xmm0,xmm0,01001110b
Title: Re: Rotate Bits SSE
Post by: guga on May 02, 2020, 10:38:26 AM
You can use this MACRO for the shuffle positions,

Shuffle MACRO V0,V1,V2,V3
    EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

    pshufd  xmm0,xmm0,Shuffle(1,0,3,2)
    is equal to:
    pshufd  xmm0,xmm0,01001110b

Great !!! Thanks, siekmanski. I[´ll adapt it to RosAsm and make some tests :)  using macros is much much easier to identify how to use this SSE mnemonic:)
Title: Re: Rotate Bits SSE
Post by: guga on May 04, 2020, 06:37:15 AM
Hi Siekmanski

About the macro, how to make ones for pshufhw, pshuflw, pshufw, shufpd, shufps ?
Title: Re: Rotate Bits SSE
Post by: Siekmanski on May 04, 2020, 07:24:23 AM
The Shuffle MACRO can be used for all the 64/128bit shuffle instructions.
It sets the 8bit shuffle bitfield ( imm8 ) in the right order for each DWORD/REAL4 ( each DWORD/REAL4 has 2 bits per position (0-3) )
Each QWORD/REAL8 must be treated as 2 DWORD/REAL4 pairs.
The same for the 64bit shuffle, only then there are 2 position bits per WORD.

shuffle_instruction xmm0,xmm1,imm8
Title: Re: Rotate Bits SSE
Post by: guga on May 07, 2020, 06:15:34 PM
:biggrin:
First I was "PSHUFD" with Magnus, and now with Scarmatil.   :thumbsup:

Code: [Select]
    ;rotate64left 1 bit
    movq    xmm0,qword ptr [Number64]
    movq    xmm1,xmm0
    psllq   xmm0,1 
    psrlq   xmm1,63
    pxor    xmm0,xmm1

    ;rotate64right 1 bit
    movq    xmm0,qword ptr [Number64]
    movq    xmm1,xmm0
    psllq   xmm0,63
    psrlq   xmm1,1
    pxor    xmm0,xmm1

Hi Siekmanski. I tested the routine to rotate the qwords inside a xmm but it seems not working. On the High Qword Bit127 is not rotating to bit 64 and on LowQword Bit 63 is not rotating to bit 0

Ex:
For rotating left, say in xmm0 we have this (From bit 127 to 0):
10101000 10111010 10001001 10011100 00111101 01100100 11011101 10000100 11110111 01110110 01001010 00101111 10100010 11111111 11001111 11110100

So, the HiQword (Bit 127 to 64) is:
10101000 10111010 10001001 10011100 00111101 01100100 11011101 10000100

and the low qword is (Bit 63 to 0):
11110111 01110110 01001010 00101111 10100010 11111111 11001111 11110100


When i use the rotate64left routine, it turns onto:

New HiQword (Bit 127 to 64) is:
01010001 01110101 00010011 00111000 01111010 11001001 10111011 00001000

and the new low qword is (Bit 63 to 0):
11101110 11101100 10010100 01011111 01000101 11111111 10011111 11101000

When they should be:

New HiQword (Bit 127 to 64) is:
01010001 01110101 00010011 00111000 01111010 11001001 10111011 00001001

and the new low qword is (Bit 63 to 0):
11101110 11101100 10010100 01011111 01000101 11111111 10011111 11101001
Title: Re: Rotate Bits SSE
Post by: guga on May 07, 2020, 06:57:05 PM
I guess i found. It seems that replacing

 movq    xmm1,xmm0

with
 movdqu    xmm1,xmm0

Should do the trick, right ? :)


And how to rotate N bits and not only 1 ? I do like this ?

Code: [Select]
Proc rot64left:
    Arguments @Count

    mov eax D@Count
    mov ecx 64             ; 64 bits Barrier.
    movd xmm3 ecx                ; 64 bit Shift-Range.
    movdqu xmm1 xmm0
    movd xmm2 eax
    psubq xmm3 xmm2               ; Calculate Shift-Range.
    psllq xmm0 xmm2
    psrlq xmm1 xmm3
    pxor xmm0 xmm1
 
EndP
Title: Re: Rotate Bits SSE
Post by: Siekmanski on May 07, 2020, 10:30:38 PM
Hi guga,

Here is the logic of N bits rotation,

N = 7 
movdqu xmm1,xmm0
psllq  xmm0,N
psrlq  xmm1,64-N
pxor   xmm0,xmm1

the other direction

N = 7 
movdqu xmm1,xmm0
psllq  xmm0,64-N
psrlq  xmm1,N
pxor   xmm0,xmm1
Title: Re: Rotate Bits SSE
Post by: guga on May 08, 2020, 05:21:19 AM
Great :)

Tks a lot, Siekmanski

I´m trying to optimize that Sha3 Algorithm and it do uses those rotate left and right (and also shift) routines to MMX. I´m trying to port them to SSE2.

So far i succeeded to partially convert the MMX to SSE on the Theta routine inside the keccakf function, but still it have a long way to fully understand this algorithm.One good thing is that if i succed to port, then maybe (just a Huge maybe) there should have a way to reverse it back.

Some of the routines inside the keccakf are basically a copy of the data belonging to the state of the words. For example, i found that on the "Chi" routine, all the second "for" is simply a binary copy. Therefore, this routine is unnecessary and can be optimized

Code: [Select]
         Chi
         for (j = 0; j < 25; j += 5) {
           for (i = 0; i < 5; i++) <----------------- This is a simple copy from st (state) to bc variable.
             bc[i] = st[j + i];       <-----------------
           for (i = 0; i < 5; i++)
             st[j + i] ^= (~bc[(i+1)%5]) & bc[(i+2)%5];
         }

And even this copy of chunks maybe removed later. The main problem of this Algorithm is that it is hard to follow and understand what it is doing, but if i succeed to optimize and simplify it, then maybe this can be reverted (If not totally, at least part of it could be, i hope).

Why revert ? Well, because the algorithm seems to act more like an encoder then a hash per se. So, if (and it is a huge huge huge if) this can be reversed, then we could, in theory, use it to compress whatever file, text etc etc is needed forcing the data to be limited to a 256 or 512 etc etc bytes long. (Kind of impossible, i know, but the behavior of this parts of the routines i´m trying to work with, seems to act more like an encoder)
Title: Re: Rotate Bits SSE
Post by: daydreamer on June 17, 2020, 04:32:20 AM
so if I want to use SSE shift/rotate in combination with a winapi calls on modern win version,are there any safe XMM regs,similar like some gp registers are or I need to save/restore everytime in the loop(s)?