News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Rotate Bits SSE

Started by guga, April 27, 2020, 06:30:22 AM

Previous topic - Next topic

jj2007

Another approach, no idea how fast it is:

include \masm32\MasmBasic\MasmBasic.inc
.data
Number         OWORD 10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110y
.code
RotateLeft128 proc pNumber      ; single shift left
  mov eax, pNumber
  movups xmm0, OWORD ptr [eax]
  psllq xmm0, 1                 ; shift left two qwords by one bit
  movsx edx, byte ptr [eax+7]   ; get sign of low qword
  test sbyte ptr [eax+15], 128  ; get sign of high qword
  movups OWORD ptr [eax], xmm0
  .if Sign?
        or byte ptr [eax], 1    ; rotate sign of high qword in
  .endif
  test edx, edx
  .if Sign?
        or byte ptr [eax+8], 1  ; set bit 0 of low qword
  .endif
  ret
RotateLeft128 endp
  Init
  Cls
  deb 4, "in", b:Number:4       ; show number as binary, 4 dwords
  invoke RotateLeft128, offset Number
  deb 4, "out", b:Number:4
EndOfCode


Output:
in      b:Number:4      10000110111111101111111011111110 10001110111111101111111011111110 10011110111111101111111011111110 10111110111111101111111011111110
out     b:Number:4      00001101111111011111110111111101 00011101111111011111110111111101 00111101111111011111110111111101 01111101111111011111110111111101

daydreamer

Quote from: mineiro on April 30, 2020, 06:03:54 AM

The code you posted is useful if you need to rotate 256 bits, then using 2 blocks of data and more registers is valid. The example I have in mind is to rotate 65 bits and then to rotate 129 bits ,193 bits and 255, assuming 256 bits in total.
An obstacle will arise because the counter of the shld or shl instruction is up to N bits. In this case I assume that using instructions greater than SSE2 may be feasible.
From what I read the compatible instruction set running on any 64-bit O.S. (x86-64) is SSE2.
Thanks for your comments and sugestion.
specific case of rotate bits ala byte can be done
look below on RGB shuffle alternative instead of bitshift channels
http://masm32.com/board/index.php?topic=7687.0
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Compliments to Marinus :thumbsup:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
in      b:MyOword:4d    10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J   b:MyOword:4d    00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M   b:MyOword:4d    00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
259 ms for Marinus
555 ms for Jochen

mineiro

sir jj code results in my machine:
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    336500     345020     140012     217061      30000
    306974     332504     140001     213686      30000
    305980     331467     140001     213544      30000
    307002     332557     140001     213615      30000

sir daydreamer, I'm reading your code and will check what can be done. Thanks.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: mineiro on April 30, 2020, 09:33:06 PM
sir jj code results in my machine:
Processor 0
     Clock   Core cyc   Instruct       Uops    BrTaken
    336500     345020     140012     217061      30000
    306974     332504     140001     213686      30000
    305980     331467     140001     213544      30000
    307002     332557     140001     213615      30000

Can you explain these numbers?

Siekmanski

 :cool:
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
in      b:MyOword:4d    10000110111111101111111011111110100011101111111011111110111111101001111011111110111111101111111010111110111111101111111011111110
out J   b:MyOword:4d    00001101111111011111110111111101000111011111110111111101111111010011110111111101111111011111110101111101111111011111110111111101
out M   b:MyOword:4d    00011011111110111111101111111010001110111111101111111011111110100111101111111011111110111111101011111011111110111111101111111010
226 ms for Marinus
418 ms for Jochen
Creative coders use backward thinking techniques as a strategy.

mineiro

Quote from: jj2007 on April 30, 2020, 10:23:27 PM
Can you explain these numbers?
Thats better explained in agner fog blog (testp.pdf). I'm using PMCTestB64.asm file and assembling with uasm in linux.
Basically clock, core cycles, instructions, micro operations and branchs taken.

Clock is the clock count measured with the RDTSC instruction.
Core cyc is the clock count measured with the "core clock cycles" counter. The CPU can change its core frequency dynamically in order to save power. ...

7.14 What do I need the performance monitor counters for?
These counters are useful for counting instructions, micro-operations, cache misses, branch mispredictions and other events that are useful for testing program performance.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Yes, sure, I know all that, but how does it relate to the algos tested?

mineiro

good question, I don't know.
305980 / 140874 = 2,172
555ms  /  259ms = 2,1429
418ms  /  226ms = 1,849557522
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

Siekmanski

 :biggrin:
Time is the only constant factor. ( At least at sea level )
Different CPU architectures have their own pros and cons.
Creative coders use backward thinking techniques as a strategy.

mineiro

Quote from: mineiro on May 01, 2020, 01:45:00 AM
good question, I don't know.
305980 / 140874 = 2,172
555ms  /  259ms = 2,1429
418ms  /  226ms = 1,849557522

mineiro@assembly:~/asm$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
stepping        : 3
microcode       : 0xcc
cpu MHz         : 800.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch xsaveopt invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bogomips        : 6816.06
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:


rept 10000, constant_tsc and cpu MHz : 800.000,
305980 / 800000000 = 0,000382475 seconds
140874 / 800000000 = 0,0001760925 seconds

382475 / 176092  == 2,17201803603
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

guga

A few rapid tricks concerning rotating bits.

For faster rotation of specific bits we can also use pshufd mnemonic (Tks Siekmanski for the tip) :thumbsup: :thumbsup: :thumbsup:

To do that, i create a table of equates of all possible 256 values used in pshufd. For rotation i created these:

[SSE_ROTATE_RIGHT_96BITS   147]    ; Dwords in xmm are copied from 1234 to 2341 ordering. Therefore, it is rotating right 96 bits. (Which is the same as rotating left 32 bits)
[SSE_ROTATE_LEFT_32BITS 147]        ; the same things and result as SSE_ROTATE_RIGHT_96BITS

[SSE_ROTATE_LEFT_96BITS     57]      ; Dwords in xmm are copied from 1234 to 4123 ordering. Therefore, it is rotating left 96 bits. (Which is the same as rotating right 32 bits)
[SSE_ROTATE_RIGHT_32BITS  57]      ; the same things and result as SSE_ROTATE_LEFT_96BITS

for 64 bits, we have a simple Swaping of both Qwords inside xmm, therefore we are rotating (left and right) 64 bits. Rotating left and right 64 bits are the same thing

[SSE_SWAP_QWORDS 78]        ; Dwords in xmm are copied from 1234 to 3412 ordering. Therefore, it is rotating right and left left 64 bits.
[SSE_ROTATE_LEFT_64BITS 78]      ; the same things and result as SSE_SWAP_QWORDS, SSE_ROTATE_RIGHT_64BITS, SSE_ROTATE_64BITS
[SSE_ROTATE_RIGHT_64BITS 78]    ; same as above
[SSE_ROTATE_64BITS 78]                ; same as above


We can also use pshufd to invert the order of dwords simple as:
[SSE_INVERT_DWORDS 27]      ; Dwords in xmm are copied from 1234 to 4321 ordering.



Example of usage:

pshufd xmm1 xmm0 SSE_INVERT_DWORDS ; xmm1 will contains the inverted order of dwords in xmm0
pshufd xmm0 xmm0 SSE_INVERT_DWORDS ; inverted the order of dwords in xmm0

pshufd xmm1 xmm0 SSE_SWAP_QWORDS ; swap qwords and copy them from xmm0 to xmm1
pshufd xmm0 xmm0 SSE_SWAP_QWORDS ; swap qwords in xmm0

pshufd xmm1 xmm0 SSE_ROTATE_RIGHT_96BITS ; rotate 96 bits left in xmm0 and copy them onto xmm1

etc etc


I´m finishing building the labels for all these equates related to pshufd and will put them on another thread. Since pshufd is a bit confusing in having to remember all the bits location etc, making equates with the proper label names is better to understand what to do and also, build a set of macros if needed (for rosasm, masm, nasm, fasm etc)


Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

You can use this MACRO for the shuffle positions,

Shuffle MACRO V0,V1,V2,V3
    EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

    pshufd  xmm0,xmm0,Shuffle(1,0,3,2)
    is equal to:
    pshufd  xmm0,xmm0,01001110b
Creative coders use backward thinking techniques as a strategy.

guga

Quote from: Siekmanski on May 01, 2020, 05:42:28 PM
You can use this MACRO for the shuffle positions,

Shuffle MACRO V0,V1,V2,V3
    EXITM %((V0 shl 6) or (V1 shl 4) or (V2 shl 2) or (V3))
ENDM

    pshufd  xmm0,xmm0,Shuffle(1,0,3,2)
    is equal to:
    pshufd  xmm0,xmm0,01001110b

Great !!! Thanks, siekmanski. I[´ll adapt it to RosAsm and make some tests :)  using macros is much much easier to identify how to use this SSE mnemonic:)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Hi Siekmanski

About the macro, how to make ones for pshufhw, pshuflw, pshufw, shufpd, shufps ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com