News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Easy one or may be not

Started by aw27, January 19, 2018, 05:34:50 AM

Previous topic - Next topic

aw27

How to swap the contents of rax and rdx using only logical instructions?

Vortex

xor rax,rdx
xor rdx,rax
xor rax,rdx


https://en.wikipedia.org/wiki/XOR_swap_algorithm

https://betterexplained.com/articles/swap-two-variables-using-xor/


jj2007

Quote from: Vortex on January 19, 2018, 06:45:45 AM
xor rax,rdx
xor rdx,rax
xor rax,rdx

It works, but where is the advantage? Xchg is shorter and twice as fast, see attachment.include \Masm32\MasmBasic\Res\JBasic.inc
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")
  mov rax, 123
  mov rdx, 456
  usedeb=1
  deb 4, "start", rax, rdx
  xor rax, rdx
  xor rdx, rax
  xor rax, rdx
  deb 4, "xored", rax, rdx
  xchg rax, rdx
  deb 4, "xchged", rax, rdx
EndOfCode


Output:start
rax     123
rdx     456
xored
rax     456
rdx     123
xchged
rax     123
rdx     456

aw27

Quote from: jj2007 on January 19, 2018, 02:00:32 PM
It works, but where is the advantage?
The advantage is to become aware that is good to have a CISC processor and not a RISC one.

jj2007

So processors that can xor regs are RISC, those who can xchg regs are CISC?

aw27

Quote from: jj2007 on January 19, 2018, 02:20:52 PM
So processors that can xor regs are RISC, those who can xchg regs are CISC?
No, things are more complicated. Actually current Intel processors are CISC outside and RISC inside, they use microcode for complex instructions. So when you look at an instruction like xchg you may not realize that 3 xors took place.

jj2007

Quote from: aw27 on January 19, 2018, 02:39:19 PMSo when you look at an instruction like xchg you may not realize that 3 xors took place.

Perhaps, but the CPU has also lots of shadow registers, so it might as well use a temporary reg, see third option below. In any case, the "inside" micro-coding of xchg rax, rdx is more than twice as fast as the "outside" RISC coding via three xor instructions:3*xor
rax     123
rdx     456
ticks 3*xor     rax     6630
xchg
rax     123
rdx     456
ticks 1*xchg    rax     2543
3*mov
rax     123
rdx     456
ticks 3*mov     rax     2528
This code was assembled with ml64 in 64-bit format


Timings on a Core i5. I have added a third option:@@:
  mov r8, rdx
  mov rdx, rax
  mov rax, r8
  dec rcx
  jne @B


Options 2+3 are a factor 2.6 faster than the 3*xor sequence, corrected for loop overhead (see attached project, *.asc opens in WordPad).

aw27

I believe that xchg might be faster, simply because it would be a shame for Intel/AMD if it could be replaced with advantage by a trio of xors.
However, I don't believe much in speed tests, this has been discussed many times before.

jj2007

#9
Quote from: aw27 on January 19, 2018, 08:37:56 PMI don't believe much in speed tests, this has been discussed many times before.

Sure, they are devil's work! Better to have a theory, empirics is for the Warmduscher fraction :greensml:
Core i5:
ticks 3*xor     rax     3322
ticks 1*xchg    rax     1294
ticks 3*mov     rax     1247

ticks 3*xor     rax     3339
ticks 1*xchg    rax     1295
ticks 3*mov     rax     1248

ticks 3*xor     rax     3308
ticks 1*xchg    rax     1295
ticks 3*mov     rax     1279

ticks 3*xor     rax     3323
ticks 1*xchg    rax     1280
ticks 3*mov     rax     1279


Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz:ticks 3*xor     rax     7047
ticks 1*xchg    rax     2797
ticks 3*mov     rax     2843

ticks 3*xor     rax     6001
ticks 1*xchg    rax     3109
ticks 3*mov     rax     3156

ticks 3*xor     rax     6171
ticks 1*xchg    rax     3390
ticks 3*mov     rax     3124

ticks 3*xor     rax     7000
ticks 1*xchg    rax     3578
ticks 3*mov     rax     3094

aw27

Quote
empirics is for the Warmduscher fraction :greensml:
OK, I will make a note. :t