Hi Guys
I made a couple of routines to rotate bits from rigth and left using MMX instructions, but i´m failing to do it for Se2. Can someone help ? Also, can these be optimized ?
The MMX instruction are as this:
Rotate Left
rotl64 - Rotate Left bits in MM0 register N times.
Parameters:
Count (Input). The total amount of times the register should be shifted left
Return Value:
The shifted value is stored in MM0 register. (See also, remarks below)
Remarks:
The function uses MM0 register to perform the rotation. So, on input, MM0 must already be filled.
Example of usage:
[tttt: Q$ 1] ; A qword variable that hold "1" as value
movq MM0 Q$Value
call rotl64 1 ; rotate 1 left by bit
Proc rotl64:
Arguments @Count
mov eax D@Count
movq MM2 MM0
movd MM3 eax
sub eax 64
neg eax
movd MM4 eax
psllq MM0 MM3
psrlq MM2 MM4
por MM0 MM2
EndP
Rotate Right
rotr64 - Rotate Right bits in MM0 register N times.
Parameters:
Count (Input). The total amount of times the register should be shifted right
Return Value:
The shifted value is stored in MM0 register. (See also, remarks below)
Remarks:
The function uses MM0 register to perform the rotation. So, on input, MM0 must already be filled.
[tttt: Q$ 1] ; A qword variable that hold "1" as value
movq MM0 Q$Value
call rotr64 1 ; rotate 1 right by bit
Proc rotr64:
Arguments @Count
mov eax D@Count
movq MM2 MM0
movd MM3 eax
sub eax 64
neg eax
movd MM4 eax
psrlq MM0 MM3
psllq MM2 MM4
por MM0 MM2
EndP
Now...How to do a similar thing (left and right) for 128 Bits ? I tried this one below, but it is not working :(
Proc rot128:
Arguments @Count
mov eax D@Count
movdqu XMM2 XMM0
movd XMM3 eax
sub eax 128
neg eax
movd XMM4 eax
psllq XMM0 XMM3
psrlq XMM2 XMM4
por XMM0 XMM2
EndP
Quote from: guga on April 27, 2020, 06:30:22 AM
Now...How to do a similar thing (left and right) for 128 Bits ? I tried this one below, but it is not working :(
Proc rot128:
Arguments @Count
mov eax D@Count
movdqu XMM2 XMM0
movd XMM3 eax
sub eax 128
neg eax
movd XMM4 eax
psllq XMM0 XMM3
psrlq XMM2 XMM4
por XMM0 XMM2
EndP
you should use PSLLDQ (double quadword) instead of PSLLQ (only 64bit quadword) versions instead
Quote from: daydreamer on April 27, 2020, 03:55:20 PMyou should use PSLLDQ (double quadword) instead of PSLLQ (only 64bit quadword) versions instead
Marinus, Magnus, PSLLDQ shifts
bytes while PSLLQ shifts
bits.
The version below is not SSE2. Sorry, I didn't check for optimization or others instructions.
mov rax,0123456789abcdefh ;high
mov rdx,0123456789abcdefh ;low
mov ecx,8 ;count
shld rax,rdx,cl ;shift left rotate higher
shl rdx,cl ;shift left rotate lower
Rotating to the left is to multiply the number N times by 2. Depending, you may need two variables for the carry even to create a rcl or rol version at each step.
Maybe this can be usefull to others tries.
---edit---
I assembled your source file and now understand. It's something like:
mov rax,0f123456789abcdefh ;high
mov rdx,0123456789abcdefh ;low
mov rbx,0
mov ecx,8 ;count
shld rbx,rax,cl ;rbx=high bits of rax
shld rax,rdx,cl ;rotate n bits agregating
shl rdx,cl ;shift left rotate lower, inserting zeros at right side
or rdx,rbx ;join, logical add
Quote from: jj2007 on April 27, 2020, 08:24:14 PM
Quote from: daydreamer on April 27, 2020, 03:55:20 PMyou should use PSLLDQ (double quadword) instead of PSLLQ (only 64bit quadword) versions instead
Marinus, PSLLDQ shifts bytes while PSLLQ shifts bits.
Am I PSHUFD with Magnus? :biggrin:
hello sir guga,
This is a quick try; psllq deals with 2 qwords instead of 1 oword(128)
.data
align 32
number dq 0123456789abcdefh,0fedcba9876543210h
low_mask dq 0ffffffffffffffffh,0000000000000000h ;reversed because using qword
high_mask dq 0000000000000000h,0ffffffffffffffffh
.code
mov eax,4 ;count
movdqu xmm0,oword ptr [number] ;xmm0=fedcba98765432100123456789abcdef
movdqu xmm1,xmm0
movdqu xmm2,xmm0
pand xmm1,oword ptr [high_mask] ;xmm1=FEDCBA98765432100000000000000000
pand xmm2,oword ptr [low_mask] ;xmm2=00000000000000000123456789ABCDEF
movd xmm3,eax ;xmm3=counter
psllq xmm0,xmm3 ;xmm0=EDCBA98765432100123456789ABCDEF0
sub eax,64
neg eax
movd xmm4,eax ;xmm4=3c
psrlq xmm1,xmm4 ;xmm1=000000000000000F0000000000000000
psrlq xmm2,xmm4 ;xmm2=00000000000000000000000000000000
pxor xmm3,xmm3 ;zero
movhlps xmm3,xmm1 ;high part of 1 to lower part of 3
movlhps xmm3,xmm2 ;lower part of 2 to higher part of 3
;xmm3=0000000000000000000000000000000F
por xmm0,xmm3 ;concatenate
--edit---
I measure both procedures and the first that don't uses SSE perform better in my machine.
Quote from: Siekmanski on April 27, 2020, 11:01:39 PMAm I PSHUFD with Magnus? :biggrin:
Corrected, sorry :badgrin:
No worries, 'The mouth speaks what the heart is full of' :rofl:
Hi Mineiro, tks, but not working. Try with Rotating 64 and 65 Bits. The result will be 0.
number dq 0123456789abcdefh,0fedcba9876543210h
The input are:
0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0111_0110_0101_0100_0011_0010_0001_0000_1000_1001_1010_1011_1100_1101_1110_1111
and the output should be:
Rotate 64 Bts
1110_1100_1010_1000_0110_0100_0010_0001_0001_0011_0101_0111_1001_1011_1101_1110_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000
Rotate 65 bits
1101_1001_0101_0000_1100_1000_0100_0010_0010_0110_1010_1111_0011_0111_1011_1100_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0001
Ops...edited, wrong :greensml: :greensml: :greensml: :greensml:
That's working or not to your needs sir guga? I can change the code to fit your needs.
I suppose shl deals with N-1, so, maximum counter can hold will be 63. Well, now that you said that I need review result with counter being zero, 64 or above.
Hi Mineiro
No. It´s not working. The goal is to rotate left and right all the bits on a 128 bit data. What the code was doing is shifting the bits (also only for 64) which causes the zeroing of the lo half of the 128 bits, rather then rotating all bits by XXX times.
Maybe this can work, please check. If this works we can deal with rotate right.
.data
align 32
number dq 3,0
low_mask dq 0ffffffffffffffffh,0000000000000000h ;reversed because using qword
high_mask dq 0000000000000000h,0ffffffffffffffffh
.code
mov eax,127 ;count range 0 to 128
movdqu xmm0,oword ptr [number]
movdqu xmm1,xmm0
.if eax >= 64
movhlps xmm0,xmm1 ;switch high and low
movlhps xmm0,xmm1
sub eax,64
.endif
movdqu xmm1,xmm0
movdqu xmm2,xmm0
pand xmm1,oword ptr [high_mask]
pand xmm2,oword ptr [low_mask]
movd xmm3,eax
psllq xmm0,xmm3
sub eax,64
neg eax
movd xmm4,eax
psrlq xmm1,xmm4
psrlq xmm2,xmm4
pxor xmm3,xmm3
movhlps xmm3,xmm1
movlhps xmm3,xmm2
por xmm0,xmm3
Cool challenge.
EDIT: Sorry, I misunderstood the question and posted the wrong code. :rolleyes:
I'll try again and if it works, post some code that does the job.
Quote from: guga on April 28, 2020, 06:56:11 AM
by XXX times.
obscene :rofl:
I figured it was not necessary to use masks, I cleaned the code.
mov eax,127 ;count range 0 to 128
movdqu xmm0,oword ptr [number]
.if eax >= 64
movdqu xmm1,xmm0
movhlps xmm0,xmm1 ;switch high and low
movlhps xmm0,xmm1
sub eax,64
.endif
movdqu xmm1,xmm0
movd xmm3,eax
psllq xmm0,xmm3
sub eax,64
neg eax
movd xmm4,eax
psrlq xmm1,xmm4
movhlps xmm3,xmm1
movlhps xmm3,xmm1
por xmm0,xmm3
Hi guga,
Here is my contribution, left and right rotation in 1 routine.
Number oword 000102030405060708090a0b0c0d0e0fh
NumRotateBits dd -4 ; number of bits to rotate ( range = -127 to +127 )
Rotate128Bits:
movdqu xmm0,oword ptr [Number]
mov ecx,64 ; 64 bits Barrier.
mov eax,NumRotateBits ; Number of bits to rotate.
test eax,eax ; Positive = Left and Negative = Right Bit Rotation
jns TestDirection ; Rotate Left or Right?
add eax,128 ; Correct the position.
TestDirection:
cmp eax,ecx
jb Test64bitBarrier
pshufd xmm0,xmm0,01001110b ; Swap High and Low 64 bits if NumRotateBits > 64
Test64bitBarrier:
and eax,63 ; Keep it within 64 bit.
movd xmm2,eax ; Bits to shift.
movd xmm3,ecx ; 64 bit Shift-Range.
movdqu xmm1,xmm0 ; Make a copy of the 128 bits.
psubq xmm3,xmm2 ; Calculate Shift-Range.
psllq xmm0,xmm2 ; Do the bit shifts to the left.
psrlq xmm1,xmm3 ; Do the bit shifts to the right.
pshufd xmm1,xmm1,01001110b ; Swap High and Low 64 bit.
pxor xmm0,xmm1 ; Glue the 128 bits together.
ret ; result in xmm0
Tks you so much, Siekmanski.
Works like a charm.
This is for a routine i´m making to create a SHA3 Cypher. So far, the code and result works ok, but there are some minor issues i´m, trying to optimize it more. For example....How to optimize this and port from MMX to SSE2 ?
; Where the arguments State is a 25 dupe Qword (so, 50 qwords in total) = KeccakState: Q$ 0 #25 and BC is a 5 dupe qword. (10 qwords). [BC: Q$ 0 #5]
Proc keccakf_Theta::
Arguments @State
Uses esi, edi, ecx, ebx, edx
mov esi D@State
movups XMM0 X$esi
movups XMM1 X$esi+40 | xorps XMM0 XMM1
movups XMM2 X$esi+80 | xorps XMM0 XMM2
movups XMM2 X$esi+120 | xorps XMM0 XMM2
movups XMM2 X$esi+160 | xorps XMM0 XMM2
movups X$BC XMM0
movups XMM0 X$esi+16
movups XMM1 X$esi+56 | xorps XMM0 XMM1
movups XMM2 X$esi+96 | xorps XMM0 XMM2
movups XMM2 X$esi+136 | xorps XMM0 XMM2
movups XMM2 X$esi+176 | xorps XMM0 XMM2
movups X$BC+16 XMM0
movq MM0 Q$esi+32 | pxor MM0 Q$esi+72 | pxor MM0 Q$esi+112 | pxor MM0 Q$esi+152 | pxor MM0 Q$esi+192 | movq Q$BC+32 MM0
mov edi BC
movq MM0 Q$edi+8
call rotl64 1
pxor MM0 Q$edi+32
movq MM1 Q$esi | pxor MM1 MM0 | movq Q$esi MM1; esi
movq MM1 Q$esi+40 | pxor MM1 MM0 | movq Q$esi+40 MM1 ; esi+40
movq MM1 Q$esi+80 | pxor MM1 MM0 | movq Q$esi+80 MM1; esi+80
movq MM1 Q$esi+120 | pxor MM1 MM0 | movq Q$esi+120 MM1;esi+120
movq MM1 Q$esi+160 | pxor MM1 MM0 | movq Q$esi+160 MM1;esi+160
movq MM0 Q$edi+16
call rotl64 1
pxor MM0 Q$edi
movq MM1 Q$esi+8 | pxor MM1 MM0 | movq Q$esi+8 MM1;esi+8
movq MM1 Q$esi+48 | pxor MM1 MM0 | movq Q$esi+48 MM1;esi+48
movq MM1 Q$esi+88 | pxor MM1 MM0 | movq Q$esi+88 MM1;esi+88
movq MM1 Q$esi+128 | pxor MM1 MM0 | movq Q$esi+128 MM1;esi+128
movq MM1 Q$esi+168 | pxor MM1 MM0 | movq Q$esi+168 MM1;esi+168
movq MM0 Q$edi+24
call rotl64 1
pxor MM0 Q$edi+8
movq MM1 Q$esi+16 | pxor MM1 MM0 | movq Q$esi+16 MM1 ; eax = 16
movq MM1 Q$esi+56 | pxor MM1 MM0 | movq Q$esi+56 MM1 ; eax = 56
movq MM1 Q$esi+96 | pxor MM1 MM0 | movq Q$esi+96 MM1 ; eax = 96
movq MM1 Q$esi+136 | pxor MM1 MM0 | movq Q$esi+136 MM1; eax = 136
movq MM1 Q$esi+176 | pxor MM1 MM0 | movq Q$esi+176 MM1 ; eax = 176
movq MM0 Q$edi+32
call rotl64 1
pxor MM0 Q$edi+16
movq MM1 Q$esi+24 | pxor MM1 MM0 | movq Q$esi+24 MM1 ; eax = 24
movq MM1 Q$esi+64 | pxor MM1 MM0 | movq Q$esi+64 MM1 ; eax = 64
movq MM1 Q$esi+104 | pxor MM1 MM0 | movq Q$esi+104 MM1 ; eax = 104
movq MM1 Q$esi+144 | pxor MM1 MM0 | movq Q$esi+144 MM1 ; eax = 144
movq MM1 Q$esi+184 | pxor MM1 MM0 | movq Q$esi+184 MM1 ; eax = 184
movq MM0 Q$edi
call rotl64 1
pxor MM0 Q$edi+24
movq MM1 Q$esi+32 | pxor MM1 MM0 | movq Q$esi+32 MM1
movq MM1 Q$esi+72 | pxor MM1 MM0 | movq Q$esi+72 MM1
movq MM1 Q$esi+112 | pxor MM1 MM0 | movq Q$esi+112 MM1
movq MM1 Q$esi+152 | pxor MM1 MM0 | movq Q$esi+152 MM1
movq MM1 Q$esi+192 | pxor MM1 MM0 | movq Q$esi+192 MM1
EndP
; where rotl64 =
Proc rotl64:
Arguments @Count
mov eax D@Count
movq MM2 MM0
movd MM3 eax
sub eax 64
neg eax
movd MM4 eax
psllq MM0 MM3
psrlq MM2 MM4
por MM0 MM2
EndP
The above is the unrolled loop routine from my keccakf routine related to compute the loop to calculate Theta,. I´ll post here laater if needed, since i´m trying to optimize other parts of the algo as well :)
So far, my SHA3 cypher works ok, for all different types such as 224, 256, 384 and 512. But more optimization is needed :)
Ex: To compute the SHA3 from string "guga" my algo returns the corrected values (Although, i believe it is still needs more optimization):
SHA3-224 fa6d153b62efd49a85c2b4ae0b53e2f2361d54e51dc2940281b6691d
SHA3-256 0aa747b1b0f38a900fccf52a50386ff57b59e7123ecfc2df1547973e2eeaa19d
SHA3-384 e11df0d663cac016f88a24c36309075972adadc1bd5d6d13928849d334444354f6a68e15f18915d2a4b285960ceeada4
SHA3-512 603c3235eb25a491b5fd8148065066ab307769b714e5abc43448a37f361864e4d786afaff22d5ac5c839d350413d36c953837bf0b29171cae18f7395d720762c
https://md5calc.com/hash/sha3-256/guga
Or, maybe rotating each of the High and Low 64 bit at once. (Could also works for other needs as well, such as a routine to rotate bytes (16 at once), words (8 at once), qwords (2 at once) that are stored inside a XMM register)
For example:
XMM0 = 001000....0010 (1st 64 bits) - rotate this 64 bits
XMM0 = 010011....1011 (2nd 64 bits) - rotate this 64 bits
So, making rotate only by the 64 bits chunks, rather then the whole 128 bit chain. I.e: Forcing the data from "rotl64" to compute a rotation at 2 64bits chains (qword) at once.
Perhaps this could also helps optimizing a bit the sequence before it goes through the chunks of pxor MM1 MM0 .....
I have no knowledge regarding Hash Functions, so I had a look on the internet.
Couldn't find a logical explanation of the algorithm which I can understand.
It's a bit hard for me to decipher your code example.
Is it so you only rotate 1 bit at a time?
Quote from: Siekmanski on April 28, 2020, 01:05:39 PMHere is my contribution, left and right rotation in 1 routine.
Looks comvincing :thumbsup:
.Repeat
inc NumRotateBits
call Rotate128Bits
deb 4, "Out", b:xmm0
.Until NumRotateBits>=64
.Repeat
dec NumRotateBits
call Rotate128Bits
deb 4, "Out", b:xmm0
.Until NumRotateBits==-64
Out b:xmm0 00011000000110100001110000011110
Out b:xmm0 00110000001101000011100000111100
Out b:xmm0 01100000011010000111000001111000
Out b:xmm0 11000000110100001110000011110000
Out b:xmm0 10000001101000011100000111100000
Out b:xmm0 00000011010000111000001111000000
Out b:xmm0 00000110100001110000011110000000
Out b:xmm0 00001101000011100000111100000000
Out b:xmm0 00011010000111000001111000000000
Out b:xmm0 00110100001110000011110000000000
Out b:xmm0 01101000011100000111100000000000
Out b:xmm0 11010000111000001111000000000000
Out b:xmm0 10100001110000011110000000000000
Out b:xmm0 01000011100000111100000000000000
Out b:xmm0 10000111000001111000000000000000
Out b:xmm0 00001110000011110000000000000001
Out b:xmm0 00011100000111100000000000000010
Out b:xmm0 00111000001111000000000000000100
Out b:xmm0 01110000011110000000000000001000
Out b:xmm0 11100000111100000000000000010000
Out b:xmm0 11000001111000000000000000100000
Out b:xmm0 10000011110000000000000001000000
Out b:xmm0 00000111100000000000000010000001
Out b:xmm0 00001111000000000000000100000010
Out b:xmm0 00011110000000000000001000000100
Out b:xmm0 00111100000000000000010000001000
Out b:xmm0 01111000000000000000100000010000
Out b:xmm0 11110000000000000001000000100000
Out b:xmm0 11100000000000000010000001000000
Out b:xmm0 11000000000000000100000010000000
Out b:xmm0 10000000000000001000000100000001
Out b:xmm0 00000000000000010000001000000011
Out b:xmm0 00000000000000100000010000000110
Out b:xmm0 00000000000001000000100000001100
Out b:xmm0 00000000000010000001000000011000
Out b:xmm0 00000000000100000010000000110000
Out b:xmm0 00000000001000000100000001100000
Out b:xmm0 00000000010000001000000011000001
Out b:xmm0 00000000100000010000000110000010
Out b:xmm0 00000001000000100000001100000100
Out b:xmm0 00000010000001000000011000001000
Out b:xmm0 00000100000010000000110000010000
Out b:xmm0 00001000000100000001100000100000
Out b:xmm0 00010000001000000011000001000000
Out b:xmm0 00100000010000000110000010000000
Out b:xmm0 01000000100000001100000100000001
Out b:xmm0 10000001000000011000001000000010
Out b:xmm0 00000010000000110000010000000101
Out b:xmm0 00000100000001100000100000001010
Out b:xmm0 00001000000011000001000000010100
Out b:xmm0 00010000000110000010000000101000
Out b:xmm0 00100000001100000100000001010000
Out b:xmm0 01000000011000001000000010100000
Out b:xmm0 10000000110000010000000101000001
Out b:xmm0 00000001100000100000001010000011
Out b:xmm0 00000011000001000000010100000110
Out b:xmm0 00000110000010000000101000001100
Out b:xmm0 00001100000100000001010000011000
Out b:xmm0 00011000001000000010100000110000
Out b:xmm0 00110000010000000101000001100000
Out b:xmm0 01100000100000001010000011000000
Out b:xmm0 11000001000000010100000110000001
Out b:xmm0 10000010000000101000001100000011
Out b:xmm0 00000100000001010000011000000111
Out b:xmm0 10000010000000101000001100000011
Out b:xmm0 11000001000000010100000110000001
Out b:xmm0 01100000100000001010000011000000
Out b:xmm0 00110000010000000101000001100000
Out b:xmm0 00011000001000000010100000110000
Out b:xmm0 00001100000100000001010000011000
Out b:xmm0 00000110000010000000101000001100
Out b:xmm0 00000011000001000000010100000110
Out b:xmm0 00000001100000100000001010000011
Out b:xmm0 10000000110000010000000101000001
Out b:xmm0 01000000011000001000000010100000
Out b:xmm0 00100000001100000100000001010000
Out b:xmm0 00010000000110000010000000101000
Out b:xmm0 00001000000011000001000000010100
Out b:xmm0 00000100000001100000100000001010
Out b:xmm0 00000010000000110000010000000101
Out b:xmm0 10000001000000011000001000000010
Out b:xmm0 01000000100000001100000100000001
Out b:xmm0 00100000010000000110000010000000
Out b:xmm0 00010000001000000011000001000000
Out b:xmm0 00001000000100000001100000100000
Out b:xmm0 00000100000010000000110000010000
Out b:xmm0 00000010000001000000011000001000
Out b:xmm0 00000001000000100000001100000100
Out b:xmm0 00000000100000010000000110000010
Out b:xmm0 00000000010000001000000011000001
Out b:xmm0 00000000001000000100000001100000
Out b:xmm0 00000000000100000010000000110000
Out b:xmm0 00000000000010000001000000011000
Out b:xmm0 00000000000001000000100000001100
Out b:xmm0 00000000000000100000010000000110
Out b:xmm0 00000000000000010000001000000011
Out b:xmm0 10000000000000001000000100000001
Out b:xmm0 11000000000000000100000010000000
Out b:xmm0 11100000000000000010000001000000
Out b:xmm0 11110000000000000001000000100000
Out b:xmm0 01111000000000000000100000010000
Out b:xmm0 00111100000000000000010000001000
Out b:xmm0 00011110000000000000001000000100
Out b:xmm0 00001111000000000000000100000010
Out b:xmm0 00000111100000000000000010000001
Out b:xmm0 10000011110000000000000001000000
Out b:xmm0 11000001111000000000000000100000
Out b:xmm0 11100000111100000000000000010000
Out b:xmm0 01110000011110000000000000001000
Out b:xmm0 00111000001111000000000000000100
Out b:xmm0 00011100000111100000000000000010
Out b:xmm0 00001110000011110000000000000001
Out b:xmm0 10000111000001111000000000000000
Out b:xmm0 01000011100000111100000000000000
Out b:xmm0 10100001110000011110000000000000
Out b:xmm0 11010000111000001111000000000000
Out b:xmm0 01101000011100000111100000000000
Out b:xmm0 00110100001110000011110000000000
Out b:xmm0 00011010000111000001111000000000
Out b:xmm0 00001101000011100000111100000000
Out b:xmm0 00000110100001110000011110000000
Out b:xmm0 00000011010000111000001111000000
Out b:xmm0 10000001101000011100000111100000
Out b:xmm0 11000000110100001110000011110000
Out b:xmm0 01100000011010000111000001111000
Out b:xmm0 00110000001101000011100000111100
Out b:xmm0 00011000000110100001110000011110
Out b:xmm0 00001100000011010000111000001111
Out b:xmm0 10000110000001101000011100000111
Out b:xmm0 11000011000000110100001110000011
Out b:xmm0 01100001100000011010000111000001
Out b:xmm0 10110000110000001101000011100000
Out b:xmm0 01011000011000000110100001110000
Out b:xmm0 00101100001100000011010000111000
Out b:xmm0 00010110000110000001101000011100
Out b:xmm0 00001011000011000000110100001110
Out b:xmm0 00000101100001100000011010000111
Out b:xmm0 10000010110000110000001101000011
Out b:xmm0 01000001011000011000000110100001
Out b:xmm0 10100000101100001100000011010000
Out b:xmm0 01010000010110000110000001101000
Out b:xmm0 00101000001011000011000000110100
Out b:xmm0 00010100000101100001100000011010
Out b:xmm0 00001010000010110000110000001101
Out b:xmm0 10000101000001011000011000000110
Out b:xmm0 01000010100000101100001100000011
Out b:xmm0 00100001010000010110000110000001
Out b:xmm0 10010000101000001011000011000000
Out b:xmm0 01001000010100000101100001100000
Out b:xmm0 00100100001010000010110000110000
Out b:xmm0 00010010000101000001011000011000
Out b:xmm0 00001001000010100000101100001100
Out b:xmm0 00000100100001010000010110000110
Out b:xmm0 00000010010000101000001011000011
Out b:xmm0 00000001001000010100000101100001
Out b:xmm0 10000000100100001010000010110000
Out b:xmm0 01000000010010000101000001011000
Out b:xmm0 00100000001001000010100000101100
Out b:xmm0 00010000000100100001010000010110
Out b:xmm0 00001000000010010000101000001011
Out b:xmm0 10000100000001001000010100000101
Out b:xmm0 11000010000000100100001010000010
Out b:xmm0 11100001000000010010000101000001
Out b:xmm0 01110000100000001001000010100000
Out b:xmm0 00111000010000000100100001010000
Out b:xmm0 00011100001000000010010000101000
Out b:xmm0 00001110000100000001001000010100
Out b:xmm0 00000111000010000000100100001010
Out b:xmm0 00000011100001000000010010000101
Out b:xmm0 10000001110000100000001001000010
Out b:xmm0 11000000111000010000000100100001
Out b:xmm0 01100000011100001000000010010000
Out b:xmm0 00110000001110000100000001001000
Out b:xmm0 00011000000111000010000000100100
Out b:xmm0 00001100000011100001000000010010
Out b:xmm0 00000110000001110000100000001001
Out b:xmm0 10000011000000111000010000000100
Out b:xmm0 01000001100000011100001000000010
Out b:xmm0 10100000110000001110000100000001
Out b:xmm0 01010000011000000111000010000000
Out b:xmm0 00101000001100000011100001000000
Out b:xmm0 00010100000110000001110000100000
Out b:xmm0 00001010000011000000111000010000
Out b:xmm0 00000101000001100000011100001000
Out b:xmm0 00000010100000110000001110000100
Out b:xmm0 00000001010000011000000111000010
Out b:xmm0 10000000101000001100000011100001
Out b:xmm0 01000000010100000110000001110000
Out b:xmm0 00100000001010000011000000111000
Out b:xmm0 00010000000101000001100000011100
Out b:xmm0 00001000000010100000110000001110
Out b:xmm0 00000100000001010000011000000111
Hi Scarmatil
yes...Your algo works like a charm. Rotating 1 bit at a time.
Can it also work for 64 bits inside a XMM also rotating 1 bit at a time, but restricted to the low and half Qwords of the 128 bits ?
Maybe this could help better optimizing this.
About the SHA3 , i built one biased to these ones:
http://gauss.ececs.uc.edu/Courses/c6053/lectures/Hashing/sha3
https://github.com/magurosan/sha3-odzhan
The second link, didn´t worked as expected and it was very slow, so i had to create a variation of it biased on both links to make it work properly.
I´ll post here the test i made containing the embeded source (RosAsm syntax) and also i´ll try to make one for masm syntax as well to make it easier for people read.
:biggrin:
First I was "PSHUFD" with Magnus, and now with Scarmatil. :thumbsup:
;rotate64left 1 bit
movq xmm0,qword ptr [Number64]
movq xmm1,xmm0
psllq xmm0,1
psrlq xmm1,63
pxor xmm0,xmm1
;rotate64right 1 bit
movq xmm0,qword ptr [Number64]
movq xmm1,xmm0
psllq xmm0,63
psrlq xmm1,1
pxor xmm0,xmm1
Quote from: Siekmanski on April 29, 2020, 08:17:30 AM
:biggrin:
First I was "PSHUFD" with Magnus, and now with Scarmatil. :thumbsup:
Ooooppppsss... Sorry :bgrin: :bgrin: :bgrin: :bgrin: :bgrin: Siekmanski. :greensml: :greensml: :greensml: I was with RosAsm opene3d and one of the routines was made by an former contributor called Scarmatil :greensml: :greensml: :greensml: :greensml: :greensml:
Tks for the code. I´l take a look and see if i can adapt it. I´m also trying to port my SHA3 cypher to masm, before i upload both versions.
:thumbsup:
I ran some tests and in the 64-bit long mode, common instructions are more efficient than SSE to accomplish this goal.
As an example:
;rotate 1 bit to left
mov rax,9090909090909090h ;64 bits number
rcl rax,1
adc rax,0
Hi mineiro,
As far as I understand and observed, I don't see possibilities for parallel execution speed optimizations in guga's routine. ( could be I missed something... )
Most are 64 bit calculations with wide spreaded reads and writes.
In 64 bit long mode, your solution fits his routine best.
It's a pity we don't have rotate bits instructions in SIMD. :sad:
You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.
Some results to "rept 10000":
your rot
Processor 0
Clock Core cyc Instruct Uops BrTaken
164772 162016 210012 243834 30000
140896 152612 210001 240022 30000
140874 152643 210001 240013 30000
140954 152642 210001 240013 30000
align 16
rot3 proc
mov ecx,[counter]
movdqu xmm0,oword ptr [number]
.if ecx >= 64
pshufd xmm0,xmm0,01001110b
sub ecx,64
.endif
movdqu xmm1,xmm0
movd xmm3,ecx
psllq xmm0,xmm3
sub ecx,64
neg ecx
movd xmm4,ecx
psrlq xmm1,xmm4
pshufd xmm1,xmm1,01001110b
por xmm0,xmm1
ret
rot3 endp
Processor 0
Clock Core cyc Instruct Uops BrTaken
180560 174677 190012 233851 20000
135620 146905 190001 230022 20000
135610 146932 190001 230019 20000
135648 146918 190001 230019 20000
align 16
rot6 proc
mov ecx,counter
mov rax,[number+8]
mov rdx,[number]
.if ecx >= 64
mov rbx,rax
mov rax,rdx
mov rdx,rbx
sub ecx,64
.endif
xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx
ret
rot6 endp
Processor 0
Clock Core cyc Instruct Uops BrTaken
148766 144041 180012 283841 20000
122292 132467 180001 280025 20000
122310 132504 180001 280019 20000
122304 132506 180001 280020 20000
Quote from: mineiro on April 30, 2020, 12:47:04 AM
You are right, your right and left solution was elegant and runs fast.
What I did was take a snippet of your code and adapt it to what I had done, the gain is minimal to say the least. And there are differences between machines, so the measurements can be different.
Some results to "rept 10000":
your rot
Processor 0
Clock Core cyc Instruct Uops BrTaken
164772 162016 210012 243834 30000
140896 152612 210001 240022 30000
140874 152643 210001 240013 30000
140954 152642 210001 240013 30000
align 16
rot3 proc
mov ecx,[counter]
movdqu xmm0,oword ptr [number]
.if ecx >= 64
pshufd xmm0,xmm0,01001110b
sub ecx,64
.endif
movdqu xmm1,xmm0
movd xmm3,ecx
psllq xmm0,xmm3
sub ecx,64
neg ecx
movd xmm4,ecx
psrlq xmm1,xmm4
pshufd xmm1,xmm1,01001110b
por xmm0,xmm1
ret
rot3 endp
Processor 0
Clock Core cyc Instruct Uops BrTaken
180560 174677 190012 233851 20000
135620 146905 190001 230022 20000
135610 146932 190001 230019 20000
135648 146918 190001 230019 20000
align 16
rot6 proc
mov ecx,counter
mov rax,[number+8]
mov rdx,[number]
.if ecx >= 64
mov rbx,rax
mov rax,rdx
mov rdx,rbx
sub ecx,64
.endif
xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx
ret
rot6 endp
Processor 0
Clock Core cyc Instruct Uops BrTaken
148766 144041 180012 283841 20000
122292 132467 180001 280025 20000
122310 132504 180001 280019 20000
122304 132506 180001 280020 20000
the below code have much depency on earlier instruction
xor ebx,ebx
shld rbx,rax,cl
shld rax,rdx,cl
shl rdx,cl
or rdx,rbx
maybe if unroll it like this breaks depency
and if cpu has several shift execution units it starts working on 2shifts at a time
forgive me if r9,r10,r11 is wrong register choice,because I am not experienced with 64bit asm
xor ebx,ebx
xor r9,r9
shld rbx,rax,cl
shld r9,r10,cl
shld rax,rdx,cl
shld r10,r11,cl
shl rdx,cl
shl r11,cl
or rdx,rbx
or r11,r9
on the other hand there are some instructions that are starved on execution units,so it has no use unroll those except its only few cpus that has slow shift
because its suppose to be part of bigger code,macro it instead of proc could gain few cycles
hello sir daydreamer;
No problems regarding the calling convention and register to be preserved, they are just tests; I didn't use the usual Windows or Linux convention myself. If others join so this can be a problem.
I understood what you said about register related dependency, execution out of order. I didn't find a viable way to solve the dependency to rotate 128 bits.
The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
Thanks for your comments and sugestion.
Hi Guys, i finished porting the SHA3 algorithm to masm, i just need to test what i have done wrong with the porting because the result is wrong. I´ll open another post containingt the algorithm as soon i fix the masm version. This is the algorithm from where i´m using to test this SSE rotation functions to speed up a little bit. And also, creating some functions related to Rotate and Shift 128 (or more bits) in other situations can also be helpful to other things :)
Another approach, no idea how fast it is:
include \masm32\MasmBasic\MasmBasic.inc
.data
Number OWORD 10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110y
.code
RotateLeft128 proc pNumber ; single shift left
mov eax, pNumber
movups xmm0, OWORD ptr [eax]
psllq xmm0, 1 ; shift left two qwords by one bit
movsx edx, byte ptr [eax+7] ; get sign of low qword
test sbyte ptr [eax+15], 128 ; get sign of high qword
movups OWORD ptr [eax], xmm0
.if Sign?
or byte ptr [eax], 1 ; rotate sign of high qword in
.endif
test edx, edx
.if Sign?
or byte ptr [eax+8], 1 ; set bit 0 of low qword
.endif
ret
RotateLeft128 endp
Init
Cls
deb 4, "in", b:Number:4 ; show number as binary, 4 dwords
invoke RotateLeft128, offset Number
deb 4, "out", b:Number:4
EndOfCode
Output:
in b:Number:4 10000110111111101111111011111110 10001110111111101111111011111110 10011110111111101111111011111110 10111110111111101111111011111110
out b:Number:4 00001101111111011111110111111101 00011101111111011111110111111101 00111101111111011111110111111101 01111101111111011111110111111101
Quote from: mineiro on April 30, 2020, 06:03:54 AM
The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
Thanks for your comments and sugestion.
specific case of rotate bits ala byte can be done
look below on RGB shuffle alternative instead of bitshift channels
http://masm32.com/board/index.php?topic=7687.0 (http://masm32.com/board/index.php?topic=7687.0)
Compliments to Marinus :thumbsup:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
in b:MyOword:4d 10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J b:MyOword:4d 00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M b:MyOword:4d 00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
259 ms for Marinus
555 ms for Jochen
sir jj code results in my machine:
Processor 0
Clock Core cyc Instruct Uops BrTaken
336500 345020 140012 217061 30000
306974 332504 140001 213686 30000
305980 331467 140001 213544 30000
307002 332557 140001 213615 30000
sir daydreamer, I'm reading your code and will check what can be done. Thanks.
Quote from: mineiro on April 30, 2020, 09:33:06 PM
sir jj code results in my machine:
Processor 0
Clock Core cyc Instruct Uops BrTaken
336500 345020 140012 217061 30000
306974 332504 140001 213686 30000
305980 331467 140001 213544 30000
307002 332557 140001 213615 30000
Can you explain these numbers?
:cool:
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
in b:MyOword:4d 10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J b:MyOword:4d 00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M b:MyOword:4d 00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
226 ms for Marinus
418 ms for Jochen
Quote from: jj2007 on April 30, 2020, 10:23:27 PM
Can you explain these numbers?
Thats better explained in agner fog blog (testp.pdf). I'm using PMCTestB64.asm file and assembling with uasm in linux.
Basically clock, core cycles, instructions, micro operations and branchs taken.
Clock is the clock count measured with the RDTSC instruction.
Core cyc is the clock count measured with the "core clock cycles" counter. The CPU can change its core frequency dynamically in order to save power. ...
7.14 What do I need the performance monitor counters for?
These counters are useful for counting instructions, micro-operations, cache misses, branch mispredictions and other events that are useful for testing program performance.
Yes, sure, I know all that, but how does it relate to the algos tested?
good question, I don't know.
305980 / 140874 = 2,172
555ms / 259ms = 2,1429
418ms / 226ms = 1,849557522
:biggrin:
Time is the only constant factor. ( At least at sea level )
Different CPU architectures have their own pros and cons.
Quote from: mineiro on May 01, 2020, 01:45:00 AM
good question, I don't know.
305980 / 140874 = 2,172
555ms / 259ms = 2,1429
418ms / 226ms = 1,849557522
mineiro@assembly:~/asm$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping : 3
microcode : 0xcc
cpu MHz : 800.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch xsaveopt invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bogomips : 6816.06
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
rept 10000, constant_tsc and cpu MHz : 800.000,
305980 / 800000000 = 0,000382475 seconds
140874 / 800000000 = 0,0001760925 seconds
382475 / 176092 == 2,17201803603
A few rapid tricks concerning rotating bits.
For faster rotation of specific bits we can also use pshufd mnemonic (Tks Siekmanski for the tip) :thumbsup: :thumbsup: :thumbsup:
To do that, i create a table of equates of all possible 256 values used in pshufd. For rotation i created these:
[SSE_ROTATE_RIGHT_96BITS 147] ; Dwords in xmm are copied from 1234 to 2341 ordering. Therefore, it is rotating right 96 bits. (Which is the same as rotating left 32 bits)
[SSE_ROTATE_LEFT_32BITS 147] ; the same things and result as SSE_ROTATE_RIGHT_96BITS
[SSE_ROTATE_LEFT_96BITS 57] ; Dwords in xmm are copied from 1234 to 4123 ordering. Therefore, it is rotating left 96 bits. (Which is the same as rotating right 32 bits)
[SSE_ROTATE_RIGHT_32BITS 57] ; the same things and result as SSE_ROTATE_LEFT_96BITS
for 64 bits, we have a simple Swaping of both Qwords inside xmm, therefore we are rotating (left and right) 64 bits. Rotating left and right 64 bits are the same thing
[SSE_SWAP_QWORDS 78] ; Dwords in xmm are copied from 1234 to 3412 ordering. Therefore, it is rotating right and left left 64 bits.
[SSE_ROTATE_LEFT_64BITS 78] ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_RIGHT_64BITS 78] ; same as above
[SSE_ROTATE_64BITS 78] ; same as above
We can also use pshufd to invert the order of dwords simple as:
[SSE_INVERT_DWORDS 27] ; Dwords in xmm are copied from 1234 to 4321 ordering.
Example of usage:
pshufd xmm1 xmm0 SSE_INVERT_DWORDS ; xmm1 will contains the inverted order of dwords in xmm0
pshufd xmm0 xmm0 SSE_INVERT_DWORDS ; inverted the order of dwords in xmm0
pshufd xmm1 xmm0 SSE_SWAP_QWORDS ; swap qwords and copy them from xmm0 to xmm1
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap qwords in xmm0
pshufd xmm1 xmm0 SSE_ROTATE_RIGHT_96BITS ; rotate 96 bits left in xmm0 and copy them onto xmm1
etc etc
I´m finishing building the labels for all these equates related to pshufd and will put them on another thread. Since pshufd is a bit confusing in having to remember all the bits location etc, making equates with the proper label names is better to understand what to do and also, build a set of macros if needed (for rosasm, masm, nasm, fasm etc)
You can use this MACRO for the shuffle positions,
Shuffle MACRO V0,V1,V2,V3
EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM
pshufd xmm0,xmm0,Shuffle(1,0,3,2)
is equal to:
pshufd xmm0,xmm0,01001110b
Quote from: Siekmanski on May 01, 2020, 05:42:28 PM
You can use this MACRO for the shuffle positions,
Shuffle MACRO V0,V1,V2,V3
EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM
pshufd xmm0,xmm0,Shuffle(1,0,3,2)
is equal to:
pshufd xmm0,xmm0,01001110b
Great !!! Thanks, siekmanski. I[´ll adapt it to RosAsm and make some tests :) using macros is much much easier to identify how to use this SSE mnemonic:)
Hi Siekmanski
About the macro, how to make ones for pshufhw, pshuflw, pshufw, shufpd, shufps ?
The Shuffle MACRO can be used for all the 64/128bit shuffle instructions.
It sets the 8bit shuffle bitfield ( imm8 ) in the right order for each DWORD/REAL4 ( each DWORD/REAL4 has 2 bits per position (0-3) )
Each QWORD/REAL8 must be treated as 2 DWORD/REAL4 pairs.
The same for the 64bit shuffle, only then there are 2 position bits per WORD.
shuffle_instruction xmm0,xmm1,imm8
Quote from: Siekmanski on April 29, 2020, 08:17:30 AM
:biggrin:
First I was "PSHUFD" with Magnus, and now with Scarmatil. :thumbsup:
;rotate64left 1 bit
movq xmm0,qword ptr [Number64]
movq xmm1,xmm0
psllq xmm0,1
psrlq xmm1,63
pxor xmm0,xmm1
;rotate64right 1 bit
movq xmm0,qword ptr [Number64]
movq xmm1,xmm0
psllq xmm0,63
psrlq xmm1,1
pxor xmm0,xmm1
Hi Siekmanski. I tested the routine to rotate the qwords inside a xmm but it seems not working. On the High Qword Bit127 is not rotating to bit 64 and on LowQword Bit 63 is not rotating to bit 0
Ex:
For rotating left, say in xmm0 we have this (From bit 127 to 0):
10101000 10111010 10001001 10011100 00111101 01100100 11011101 10000100 11110111 01110110 01001010 00101111 10100010 11111111 11001111 11110100
So, the HiQword (Bit 127 to 64) is:
10101000 10111010 10001001 10011100 00111101 01100100 11011101 10000100
and the low qword is (Bit 63 to 0):
11110111 01110110 01001010 00101111 10100010 11111111 11001111 11110100
When i use the rotate64left routine, it turns onto:
New HiQword (Bit 127 to 64) is:
01010001 01110101 00010011 00111000 01111010 11001001 10111011 00001000
and the new low qword is (Bit 63 to 0):
11101110 11101100 10010100 01011111 01000101 11111111 10011111 11101000
When they should be:
New HiQword (Bit 127 to 64) is:
01010001 01110101 00010011 00111000 01111010 11001001 10111011 0000100
1and the new low qword is (Bit 63 to 0):
11101110 11101100 10010100 01011111 01000101 11111111 10011111 1110100
1
I guess i found. It seems that replacing
movq xmm1,xmm0
with
movdqu xmm1,xmm0
Should do the trick, right ? :)
And how to rotate N bits and not only 1 ? I do like this ?
Proc rot64left:
Arguments @Count
mov eax D@Count
mov ecx 64 ; 64 bits Barrier.
movd xmm3 ecx ; 64 bit Shift-Range.
movdqu xmm1 xmm0
movd xmm2 eax
psubq xmm3 xmm2 ; Calculate Shift-Range.
psllq xmm0 xmm2
psrlq xmm1 xmm3
pxor xmm0 xmm1
EndP
Hi guga,
Here is the logic of N bits rotation,
N = 7
movdqu xmm1,xmm0
psllq xmm0,N
psrlq xmm1,64-N
pxor xmm0,xmm1
the other direction
N = 7
movdqu xmm1,xmm0
psllq xmm0,64-N
psrlq xmm1,N
pxor xmm0,xmm1
Great :)
Tks a lot, Siekmanski
I´m trying to optimize that Sha3 Algorithm and it do uses those rotate left and right (and also shift) routines to MMX. I´m trying to port them to SSE2.
So far i succeeded to partially convert the MMX to SSE on the Theta routine inside the keccakf function, but still it have a long way to fully understand this algorithm.One good thing is that if i succed to port, then maybe (just a Huge maybe) there should have a way to reverse it back.
Some of the routines inside the keccakf are basically a copy of the data belonging to the state of the words. For example, i found that on the "Chi" routine, all the second "for" is simply a binary copy. Therefore, this routine is unnecessary and can be optimized
Chi
for (j = 0; j < 25; j += 5) {
for (i = 0; i < 5; i++) <----------------- This is a simple copy from st (state) to bc variable.
bc[i] = st[j + i]; <-----------------
for (i = 0; i < 5; i++)
st[j + i] ^= (~bc[(i+1)%5]) & bc[(i+2)%5];
}
And even this copy of chunks maybe removed later. The main problem of this Algorithm is that it is hard to follow and understand what it is doing, but if i succeed to optimize and simplify it, then maybe this can be reverted (If not totally, at least part of it could be, i hope).
Why revert ? Well, because the algorithm seems to act more like an encoder then a hash per se. So, if (and it is a huge huge huge if) this can be reversed, then we could, in theory, use it to compress whatever file, text etc etc is needed forcing the data to be limited to a 256 or 512 etc etc bytes long. (Kind of impossible, i know, but the behavior of this parts of the routines i´m trying to work with, seems to act more like an encoder)
so if I want to use SSE shift/rotate in combination with a winapi calls on modern win version,are there any safe XMM regs,similar like some gp registers are or I need to save/restore everytime in the loop(s)?