Hi everybody.
It has been a long time since the last MASM FOR FUN adventure.
Time to be back?
I hope so, at least for a while.
Let's start with a question.
I have a buffer of 4096 consecutive dword, and I'd like to extract the low byte of each dword
and put it in a second buffer.
The operation in itself is not that difficult, but I'd like to do it with SIMD instructions, just to
have a little bit fun and discover some SSE2/SSE3 instructions.
With SIMD instructions I can work with 8/16 bytes at a time, speeding up the process as well.
I had a look at Intel manuals, but [as n00bist do] I didn't find any suitable SIMD OPCODE.
Anybody has gone through this problem and found an SSE2/SSE3 solution?
In pseudocode:
xmm0 = XFDC GFTI DEWA HYTO
pckwhat? eax, xmm0, n
And we have CIAO in eax.
CIAO
Great to see you back.
Quote from: Magnum on November 25, 2012, 03:05:34 AM
Great to see you back.
Thanks, it's good to be back.
With the old instructions I get:
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
28098 cycles for MOV DL
28009 cycles for MOV DL
--- ok ---
Attached the source and exe.
Frank
Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
8127 cycles for MOV DL
8372 cycles for MOV DL
Does having an extra core really make that much difference ?
Andy
Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
8127 cycles for MOV DL
8372 cycles for MOV DL
Does having an extra core really make that much difference ?
Andy
Well, the processor you are using is about 5 years younger, and
it means it runs faster then the old one I'm using.
RAM clock, data bus, cache memory... there are many things that speed up things, indeed.
Frank
I'm not sure if a single SIMD OPCODE is able to accomplish the task, but
probably more than one could do it:
if I have four xmm registers with the dword read from the dword buffer
movd xmm0, [ebx]
movd xmm1, [ebx + 4]
movd xmm2, [ebx + 8]
movd xmm3, [ebx + 12]
interleaving the bytes of the xmm registers and leaving out the bytes not
used should be possible.
While I wait for some suggestions, I carry on thinking and reading :icon_rolleyes:
Frank
Quote from: Magnum on November 25, 2012, 04:18:08 AM
Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (SSE4)
8127 cycles for MOV DL
8372 cycles for MOV DL
Does having an extra core really make that much difference ?
Andy
Hi Andy...
Multicore will perform better on an OS like Windows that is multi-tasking as some processes are assigned to each core, permitting much better multitasking than on single core machines.
However... inside your application, 1 core or 512 cores won't make much difference unless you are writing simultaneously executing multithreaded code so that some of your tasks can be spread out across different cores. In that case simply writing two simultaneous threads can (often) significantly increase the speed of your code.
Maybe I've found a solution. I'll try it and test its performance.
If it works it is a first step into translating the code into SSE WAY.
mov eax, offset Dest
mov ebx, offset Source
@@:
movd mm0, dword ptr [ebx]
movd mm1, dword ptr [ebx + 4]
movd mm2, dword ptr [ebx + 8]
movd mm3, dword ptr [ebx + 12]
punpcklbw mm0, mm2
punpcklbw mm1, mm3
punpcklbw mm0, mm1
movd dword ptr [eax], mm0
Well, it apparently works with some speed improvement:
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
---------------------------------------------------
28315 cycles for MOV DL
16479 cycles for MOVD DWORD PTR
28394 cycles for MOV DL
16637 cycles for MOVD DWORD PTR
--- ok ---
I'll see if a there is a better method, with less SSE OPCODES to get
better performance.
Frank
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)
What you are looking for is pshufb. There is a testpiece by Hutch here (http://www.masmforum.com/board/index.php?topic=15974.0).
You need to add this:
include \masm32\include\masm32rt.inc
.686p
.xmm
... and you need a modern CPU ;-)
Quote from: jj2007 on November 25, 2012, 07:50:00 AM
Ciao Frank,
Benvenuto al Forum, e grazie per il messaggio nell'altro thread ;-)
What you are looking for is pshufb. There is a testpiece by Hutch here (http://www.masmforum.com/board/index.php?topic=15974.0).
You need to add this:
include \masm32\include\masm32rt.inc
.686p
.xmm
... and you need a modern CPU ;-)
Ciao jj.
I have to switch to the II PC in order to use SSSE3 OPCODES, I'll do it as soon
as I I'm ready.
What do you think about the solution I used? It is MMX but not very fast, at
least on my P IV dual core 3.2 Ghz.
Frank
Quote from: frktons on November 25, 2012, 08:04:02 AM
What do you think about the solution I used?
Not fast indeed. PSHUFB must be much better, but I can't test it here...
See http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/About_IA-32_Instructions.htm#P-Instructions for some useful info.
Did you try the straightforward solution?
include \masm32\include\masm32rt.inc
.data
src db "xxxCxxxIxxxAxxxO", 0
dest db 20 dup(?)
.code
start:
mov esi, offset src+3
mov edi, offset dest
REPEAT 4
lodsd
stosb
ENDM
inkey offset dest
exit
end start
I've started some test on PSHUFB
in the meanwhile here is your proposal with
the previous ones. MMX code still leads.
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
---------------------------------------------------------
12412 cycles for MOV DL
6199 cycles for MMX/MOVD DWORD PTR
21114 cycles for STOSB
---------------------------------------------------------
12401 cycles for MOV DL
6235 cycles for MMX/MOVD DWORD PTR
21137 cycles for STOSB
---------------------------------------------------------
--- ok ---
Frank
hiyas Frank - good to see you :t
this might have a few less dependancies...
@@: mov edx,[ebx]
add ebx,4
mov [eax],dl
dec ecx
lea eax,[eax+1]
jnz @B
Quote from: dedndave on November 25, 2012, 11:56:38 AM
hiyas Frank - good to see you :t
this might have a few less dependancies...
@@: mov edx,[ebx]
add ebx,4
mov [eax],dl
dec ecx
lea eax,[eax+1]
jnz @B
Hi Dave. Nice to see you too.
The sequence you have used makes me doubt about
JNZ:
dec ecx
lea eax,[eax+1]
jnz @B
[/code]
does the jnz refers to
ECXor to
EAX?
By the way, I inserted your code inside the pgm, but I get strange results:
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, offset Dest
mov ecx, 4096
mov ebx, offset Source
@@: mov edx,[ebx]
add ebx,4
mov [eax],dl
dec ecx
lea eax,[eax+1]
jnz @B
print str$(eax), 9, "cycles for DAVE ", 13, 10, 13, 10
and:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
---------------------------------------------------------
12396 cycles for MOV DL
6193 cycles for MMX/MOVD DWORD PTR
21098 cycles for STOSB
20677248 cycles for DAVE
---------------------------------------------------------
12394 cycles for MOV DL
6207 cycles for MMX/MOVD DWORD PTR
21093 cycles for STOSB
20677248 cycles for DAVE
---------------------------------------------------------
--- ok ---
Did I mispell something, or what?
nope - i just write bad code :lol:
LEA does not affect the flags, so
lea eax,[eax+1]
adds one to EAX without altering the flags that were set by DEC ECX
the idea was to put something in between the instruction that sets the flags and the one that examines them
but, LEA is not a great performer on older CPU's
still, it shouldn't be that slow - lol
i got a slight improvement on my p4 prescott
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov eax, offset Dest
mov ecx, 4096
mov ebx, offset Source
@@: mov edx,[ebx]
add ebx,4
mov [eax],dl
dec ecx
lea eax,[eax+1]
jnz @B
; @@:
; mov edx, [ebx]
; mov byte ptr [eax], dl
; add eax, 1
; add ebx, 4
; dec ecx
; jnz @B
counter_end
Maybe try
mov dl,[ebx]
mov dh,[ebx+4]
mov [eax],dx
Dave,
It was only the PIV that was a poor performer with LEA, PIII and earlier and Core2 onwards are fine.
i knew it was something like that, Hutch - lol
sinsi has the right idea, i think...
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
mov ebx, offset Source
mov eax, offset Dest
mov ecx, 4096/4
@@: mov dh,[ebx+12]
mov dl,[ebx+8]
shl edx,16
mov dh,[ebx+4]
mov dl,[ebx]
add ebx,16
mov [eax],edx
dec ecx
lea eax,[eax+4]
jnz @B
counter_end
it would help if the destination array is 4-aligned - maybe the source, too
deleted
So on your puter Sinsi's solution is clearly the fastest. That's what I suspected ;-)
Not on mine, however:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
9968 cycles for MOV AX
9622 cycles for LEA
5181 cycles for MMX/MOVD DWORD PTR
On newer machine there is no game:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
---------------------------------------------------------
13874 cycles for MOV AX
13124 cycles for LEA
6193 cycles for MMX/MOVD DWORD PTR
18486 cycles for STOSB
---------------------------------------------------------
13856 cycles for MOV AX
13087 cycles for LEA
4129 cycles for MMX/MOVD DWORD PTR
18516 cycles for STOSB
---------------------------------------------------------
--- ok ---
later the pshufb solution that should win the race.
Here it is, the first quick shot with PSHUFB:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
---------------------------------------------------------
13866 cycles for MOV AX
13084 cycles for LEA
6205 cycles for MMX/MOVD DWORD PTR
4848 cycles for PSHUFB / I shot
18487 cycles for STOSB
---------------------------------------------------------
13852 cycles for MOV AX
13083 cycles for LEA
6194 cycles for MMX/MOVD DWORD PTR
4730 cycles for PSHUFB / I shot
18518 cycles for STOSB
---------------------------------------------------------
--- ok ---
later I'll try to improve it.
deleted
Some more tests:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSSE3)
---------------------------------------------------------
13927 cycles for MOV AX
13097 cycles for LEA
6203 cycles for MMX/PUNPCKLBW
4729 cycles for XMM/PSHUFB - I shot
3518 cycles for XMM/PSHUFB - II shot
15364 cycles for XMM/MASKMOVDQU - I shot
18506 cycles for STOSB
---------------------------------------------------------
13868 cycles for MOV AX
13096 cycles for LEA
6198 cycles for MMX/PUNPCKLBW
4732 cycles for XMM/PSHUFB - I shot
3520 cycles for XMM/PSHUFB - II shot
15360 cycles for XMM/MASKMOVDQU - I shot
18503 cycles for STOSB
---------------------------------------------------------
--- ok ---
Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:
source xxxCxxxIxxxAxxxO
wanted CIAO
effective C I A O
deleted
Quote from: jj2007 on November 26, 2012, 03:25:54 AM
Ciao Frank,
Apart from being slow, maskmovdqu does not what you want:
source xxxCxxxIxxxAxxxO
wanted CIAO
effective C I A O
If the mask is correctly set,
maskmovdqu should do the job :
F0h = byte to move, 00h = byte non moved, according to Intel Docs:
The most significant bit in each byte of the mask operand determines whether the
corresponding byte in the source operand is written to the corresponding byte location
in memory: 0 indicates no write and 1 indicates write.
[/b]
At least my previous test showed it can do the job, but maybe I didn't try it enough. ::)
Edit: it only works fine with consecutive bytes, probably, as you said, not the one I need
beside being slow.
Quote from: nidud on November 26, 2012, 04:34:28 AM
QuoteIntel(R) Core(TM) i3-2367M CPU @ 1.40GHz (SSE4)
---------------------------------------------------------
6334 cycles for MOV AX
5171 cycles for LEA
3123 cycles for MMX/MOVD DWORD PTR
2189 cycles for PSHUFB / I shot
10503 cycles for STOSB
---------------------------------------------------------
5243 cycles for MOV AX
5488 cycles for LEA
3150 cycles for MMX/MOVD DWORD PTR
2060 cycles for PSHUFB / I shot
9276 cycles for STOSB
---------------------------------------------------------
These SIMD instructions work a lot better with modern tech.
Try the last version only on I3.
deleted
deleted
Quote from: nidud on November 26, 2012, 07:28:02 AM
With adjustment for the loop (mov ecx,4096/16):
Quote---------------------------------------------------------
1125 cycles for XMM/PSHUFB - I shot
1088 cycles for XMM/PSHUFB - II shot
---------------------------------------------------------
1124 cycles for XMM/PSHUFB - I shot
1140 cycles for XMM/PSHUFB - II shot
---------------------------------------------------------
In the first shot
4096/4 refers to the dwords to elaborate in each cycle,
so it cannot be
4096/16.
The second one works on 16 dwords at a time, so it is
4096/16.
deleted
I'm going to prepare a real test, with some data to make the masks
a little bit more accurate. They are not tested for the time being, and
were used just to have an idea of their performances.
After testing on real data and adjusting the masks accordingly, the
test could be considered valid.
Up to now I've worked on uninitializes data, so there is no way to
know if the sequence of bit/bytes in the masks are correct. ::)
deleted
Quote from: nidud on November 26, 2012, 09:43:40 AM
I think it does what it suppose to do
Well, a good test should start with 4096 dword initialized with 00000001h
and then use the single routines with it, testing if at the end there are all 01h
in the Dest buffer, and to verify it you could use your routine.
Hi Frank,
Here is a testfile. The exe shows it, *.asc is the source in RTF/RichMasm format.
Hope it helps,
Jochen
deleted
Quote from: nidud on November 26, 2012, 11:54:27 AM
This was implemented using macros for each test. You need to reset the byte buffer (Dest) for each test, using 0 if source is 1, or 1 if source is 0 (as in this case).
Yes nidud, thanks.
Quote from: jj2007 on November 26, 2012, 10:29:39 AM
Hi Frank,
Here is a testfile. The exe shows it, *.asc is the source in RTF/RichMasm format.
Hope it helps,
Jochen
Grazie Jochen, il tuo aiuto è sempre benvenuto.
I'll give it a look as soon as I finish a couple of prelimary
things I'm working on.
Is there an opcode to compare two xmm register to verify
if they have the same content?
Again SIMD instructions are a bit tricky for simple instructions.
See the CMPxxx, COMxxx and PCMPxxx instructions: AMD64 Architecture Programmer's Manual Volume 4: 128-bit and 256 bit media instructions (http://support.amd.com/us/Processor_TechDocs/26568_APM_v4.pdf)
Quote from: qWord on November 27, 2012, 09:24:01 AM
See the CMPxxx, COMxxx and PCMPxxx instructions: AMD64 Architecture Programmer's Manual Volume 4: 128-bit and 256 bit media instructions (http://support.amd.com/us/Processor_TechDocs/26568_APM_v4.pdf)
Yes qWord,
Let's assume I use:
PCMPEQD xmm0,xmm1
considering this and the others don't affect the flags,
how do I
jmp somewhere after the test?
If they are equal or not, what tells me that?
PTEST affect the Zero Flag, but the opcode is out of my league (SSE4.1).
The first correct test for SSE instructions with proc to check the results:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
---------------------------------------------------------
13862 cycles for MOV AX - Test OK
13114 cycles for LEA - Test OK
6195 cycles for MMX/PUNPCKLBW - Test OK
3157 cycles for XMM/PSHUFB - I shot - Test OK
2375 cycles for XMM/PSHUFB - II shot - Test OK
12327 STOSB - Test OK
---------------------------------------------------------
9238 cycles for MOV AX - Test OK
8723 cycles for LEA - Test OK
4130 cycles for MMX/PUNPCKLBW - Test OK
3150 cycles for XMM/PSHUFB - I shot - Test OK
2375 cycles for XMM/PSHUFB - II shot - Test OK
16701 STOSB - Test OK
---------------------------------------------------------
--- ok ---
Attached last version.
Enjoy
deleted
Quote from: nidud on November 27, 2012, 11:12:38 AM
Seems to be possible to compare the low 8 bytes:
COMISD dest,source
The destination operand is an XMM register.
The source can be either an XMM register or a memory location.
The flags are set according to the following rules:
Result Flags Values
Unordered ZF,PF,CF 111
Greater than ZF,PF,CF 000
Less than ZF,PF,CF 001
Equal ZF,PF,CF 100
Maybe it's possible to shift (or rotate) the regs and then compare the high 8 bytes?
Probably there are many ways to do it in more than 1 step.
I'm trying to find a single SIMD instruction, like
PTEST, for the task
included in level SSE3.
Some more checking and I'll see.
Quote from: frktons on November 27, 2012, 09:37:57 AM
Let's assume I use:
PCMPEQD xmm0,xmm1
considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?
psubd xmm0, xmm1
pmovmskb eax, xmm0 ; set byte mask in eax
test eax, eax
Quote from: jj2007 on November 27, 2012, 11:27:06 AM
Quote from: frktons on November 27, 2012, 09:37:57 AM
Let's assume I use:
PCMPEQD xmm0,xmm1
considering this and the others don't affect the flags,
how do I jmp somewhere after the test?
If they are equal or not, what tells me that?
psubd xmm0, xmm1
pmovmskb eax, xmm0 ; set byte mask in eax
test eax, eax
Thanks Jochen, I'll arrange a new algo to test with your
suggestion.
I wrote a new CheckDestX PROC to use Jochen suggestion:
; -----------------------------------------------------------------------------------------------
CheckDestX proc
lea eax, Dest
mov ebx, 32323232h
mov ecx, (4096/16)
movd xmm0, ebx
pshufd xmm0, xmm0, 0
@@:
movdqa xmm1, [eax]
psubd xmm1, xmm0
pmovmskb edx, xmm1 ; set byte mask in edx
test edx, edx
jne CheckErr
add eax, 16
dec ecx
jnz @B
CheckOK:
lea eax, Check
movq mm0, qword ptr TestOK
movq qword ptr [eax], mm0
jmp EndCheck
CheckErr:
lea eax, Check
movq mm0, qword ptr TestERR
movq qword ptr [eax], mm0
EndCheck:
ret
CheckDestX endp
It gives the same results as CheckDest PROC and
probably is quite fast, but I didn't still test the performance of it.
But I'm still not satisfied from CPUID results:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
---------------------------------------------------------
13876 cycles for MOV AX - Test OK
8740 cycles for LEA - Test OK
4131 cycles for MMX/PUNPCKLBW - Test OK
3153 cycles for XMM/PSHUFB - I shot - Test OK
2376 cycles for XMM/PSHUFB - II shot - Test OK
12336 STOSB - Test OK
---------------------------------------------------------
9242 cycles for MOV AX - Test OK
8731 cycles for LEA - Test OK
4131 cycles for MMX/PUNPCKLBW - Test OK
3153 cycles for XMM/PSHUFB - I shot - Test OK
2376 cycles for XMM/PSHUFB - II shot - Test OK
12330 STOSB - Test OK
---------------------------------------------------------
--- ok ---
This time I've used PrintCpu and MasmBasic include,
but the results are still not accurate. My PC has SSSE3
capability, not SSE4.
Only Alex's code that I used a couple of year ago gives
a more accurate result:
┌─────────────────────────────────────────────────────────────[27-Nov-2012 at 10:57 GMT]─┐
│OS : Microsoft Windows 7 Ultimate Edition, 64-bit Service Pack 1 (build 7601) │
│CPU : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz with 2 logical core(s) with SSSE3 │
I've read the thread about the CPUID code, but didn't find anything new.
Should I still use Alex's code or there is a more accurate routine for modern
CPU?
CPU's may have changed a lot
but, operating systems change at a slower rate :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know
our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)
i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on
Quote from: dedndave on November 27, 2012, 10:16:07 PM
CPU's may have changed a lot
but, operating systems change at a slower rate :biggrin:
i have a p4, which supports SSE3, running XP
XP does not support AVX instructions, nor does vista, as far as i know
our CPUID programs don't have to be updated very often, either - lol
while we might detect AVX support on a CPU (pretty easy),
it is another thing to judge the level of support offered by the OS (not so easy)
i would guess 97% of the ibm-compatible pc's in use today probably support SSE2
if you go any higher than SSE2, it might be a good idea to provide a fallback routine
it depends on what range of platforms you want your program to run on
Yes Dave, the reasoning is quite fair.
I'm talking about the uncorrect data shown by old routines
while we have newer routines, like Alex's one, that are more
accurate, even if they don' go above SSE4.X.
Jochen's library is quite up to date and uses many SSE opcode [I imagine]
but the Macro [I think] PrintCpu should be updated to be more
correct, doesn't matter if it doesn't cover last AVX code or the like.
Well it is just my opinion, of course. Even the CPUID utility that Intel gives us
http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=7838 (http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=7838)
doesn't show that my PC has SSSE3 capabilities, but at least it doesn't say I have
SSE4.
oh - i see what you mean
well - there have been a few that report erroneously
but, to programatically determine if a specific extension is supported is pretty easy
i.e., i wouldn't use "Alex's" or "Jochen's" or even "Dave's" routine
their purpose is to identify the CPU and capabilities, primarily for forum comparisons
that is a different function than identifying extension support for a program to select routines
what you want to do is actually much simpler :t
; 0_1 values come from CPUID function 1
; 8_1 values come from CPUID function 80000001h
;
; Source Description
;
; 0_1edx:23 MMX
; 8_1edx:22 MMX+ (AMD only)
; 8_1edx:31 3DNow! (AMD only)
; 8_1edx:30 3DNow!+ (AMD only)
; 0_1edx:25 SSE
; 0_1edx:26 SSE2
; 0_1ecx:00 SSE3
; 0_1ecx:09 SSSE3
; 0_1ecx:19 SSE4.1
; 0_1ecx:20 SSE4.2 (Intel only)
; 8_1ecx:06 SSE4a (AMD only)
; 8_1ecx:11 SSE5 (AMD only) - this became one of the AVX feature bits
you can get most of what you want to know by examining ECX and EDX after this...
mov eax,1
cpuid
for example, ECX bit 0 will be 1 if SSE3 is supported
Thanks Dave.
CPUID is still an unknown land, I've never been in those bit-area.
Your introduction to the matter looks interesting, I'll give it a try. :t
i updated it a little Frank - you may want to reload the page :P
oh - and you have to use .586 or higher to use CPUID :t
I SEE SSE on the SEASHORE :icon_eek: 8)
Good to know.
say that 5 times real fast :lol:
Quote from: jj2007 on November 27, 2012, 11:27:06 AM
psubd xmm0, xmm1
pmovmskb eax, xmm0 ; set byte mask in eax
test eax, eax
This code is a little bit faster on my Core 2 duo:
psubd xmm1, xmm0
pmovmskb edx, xmm1 ; set byte mask in dx
cmp dx, 0
deleted
Well nidud :t
this seems to work as well as psubd, at the same performance.
So we have a couple of alternatives, at least.
nidud's code:
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
---------------------------------------------------------
2988 cycles for XMM/pcmpeqd
3004 cycles for XMM/psubd
---------------------------------------------------------
2987 cycles for XMM/pcmpeqd
3012 cycles for XMM/psubd
---------------------------------------------------------
2978 cycles for XMM/pcmpeqd
3001 cycles for XMM/psubd
---------------------------------------------------------
--- ok ---
----------------------------------------------------
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3
----------------------------------------------------
9242 cycles for MOV AX - Test OK
8731 cycles for LEA - Test OK
4144 cycles for MMX/PUNPCKLBW - Test OK
3158 cycles for XMM/PSHUFB - I shot - Test OK
2368 cycles for XMM/PSHUFB - II shot - Test OK
12328 cycles for STOSB - Test OK
2070 cycles for CheckDest - Test OK
547 cycles for CheckDestC - Test OK
544 cycles for CheckDestX - Test OK
----------------------------------------------------
9241 cycles for MOV AX - Test OK
8728 cycles for LEA - Test OK
4130 cycles for MMX/PUNPCKLBW - Test OK
3153 cycles for XMM/PSHUFB - I shot - Test OK
2379 cycles for XMM/PSHUFB - II shot - Test OK
12335 cycles for STOSB - Test OK
2069 cycles for CheckDest - Test OK
548 cycles for CheckDestC - Test OK
543 cycles for CheckDestX - Test OK
----------------------------------------------------
CheckDestC is nidud's code modified. For the CPU and SSE level
I used Alex's routine.
deleted
last nidud's code produce this on my laptop:
---------------------------------------------------------
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
Instructions: MMX, SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
---------------------------------------------------------
6675 cycles for STOSB - Test OK
4240 cycles for LEA - Test OK
3353 cycles for MOV DX - Test OK
3276 cycles for MOV AX - Test OK
1924 cycles for MMX/PUNPCKLBW - Test OK
1213 cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832 cycles for XMM/PSHUFB - I shot - Test OK
1539 cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6093 cycles for STOSB - Test OK
3806 cycles for LEA - Test OK
3403 cycles for MOV DX - Test OK
3277 cycles for MOV AX - Test OK
1945 cycles for MMX/PUNPCKLBW - Test OK
808 cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
904 cycles for XMM/PSHUFB - I shot - Test OK
1490 cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289 cycles for STOSB - Test OK
3805 cycles for LEA - Test OK
3668 cycles for MOV DX - Test OK
3684 cycles for MOV AX - Test OK
3044 cycles for MMX/PUNPCKLBW - Test OK
888 cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832 cycles for XMM/PSHUFB - I shot - Test OK
901 cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
6289 cycles for STOSB - Test OK
3805 cycles for LEA - Test OK
3240 cycles for MOV DX - Test OK
3255 cycles for MOV AX - Test OK
2527 cycles for MMX/PUNPCKLBW - Test OK
833 cycles for XMM/PSHUFB - xmm0,xmm1 - Test OK
832 cycles for XMM/PSHUFB - I shot - Test OK
858 cycles for XMM/PSHUFB - II shot - Test OK
---------------------------------------------------------
--- ok ---
Quote from: nidud on November 28, 2012, 05:53:33 PM
I rewrote the test file with a common loop count for all tests to even the result. I was wondering if using xmm0 register might be faster than xmm1, but the test seems to have random results, at least on this machine.
With regards to using pcmpeqd or psubd , I think the last one would be the better choice since this returns 0.
Edit: renamed test_pshufb to test_pshufb0
Since you changed the structure of some routines, the results are a little
bit different, I mean quite a lot different.
I still don't understand the logic of comparing two XMM with PSUBD.
If they are equal they return zero and after the PMOVMSKB it is possible to
test for zero the final result register.
But what happens if the source register is 1 greater than destination one?
The PMOVMSKB does or doesn't detect the difference? According to what I've
got up to now, it shouldn't. ::)
Quote from: frktons on November 28, 2012, 10:51:37 PM
The PMOVMSKB does or doesn't detect the difference?
It does. Launch some tests with Olly to see what happens. Anyway, PCM*** does the same job as PSUBD, and they are equally fast (e.g. one cycle on my AMD).
I compare two XMM register, with one of them greater
than the other.
According to this test, with PSUBD it doesn't detect it ::)
------------------------------------
Test on PCMPEQD - Test ERR
------------------------------------
Test on PSUBD - Test OK
------------------------------------
Press any key to continue ...
This is the code I used. Did I make any error?
; ---------------------------------------------------------------------
; TEST_PSUBD.ASM--
; http://www.masm32.com/board/index.php?topic=770.0
;-------------------------------------------------------------------------------
; Test the difference between PCMPEQD and PSUBD when comparing two XMM
; registers.
; 28/Nov/2012 - MASM FORUM - frktons
;-------------------------------------------------------------------------------
.nolist
include \masm32\include\masm32rt.inc
.686
.xmm
.data
align 8
Check db 8 dup(20h),0,0,0,0
PtrCheck dd Check
align 8
TestOK db "Test OK ",0,0,0,0
align 8
TestERR db "Test ERR",0,0,0,0
.code
start:
print "---------------------------------------------------------", 13, 10
print "Test on PCMPEQD - "
call PCMP_TEST
print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10
print "Test on PSUBD - "
call PSUB_TEST
print PtrCheck, 13, 10
print "---------------------------------------------------------", 13, 10, 13, 10
inkey
exit
; -----------------------------------------------------------------------------------------------
PSUB_TEST proc
mov ebx, 32323232h
mov edx, 00000001h
movd xmm2, edx
pshufd xmm2, xmm2, 0
movd xmm0, ebx
pshufd xmm0, xmm0, 0
movdqa xmm1, xmm0
paddd xmm1, xmm2
psubd xmm1,xmm0
pmovmskb edx, xmm1 ; set byte mask in dx
cmp dx, 0
jne CheckErr
CheckOK:
lea eax, Check
movq mm0, qword ptr TestOK
movq qword ptr [eax], mm0
jmp EndCheck
CheckErr:
lea eax, Check
movq mm0, qword ptr TestERR
movq qword ptr [eax], mm0
EndCheck:
ret
PSUB_TEST endp
; -----------------------------------------------------------------------------------------------
PCMP_TEST proc
mov ebx, 32323232h
mov edx, 00000001h
movd xmm2, edx
pshufd xmm2, xmm2, 0
movd xmm0, ebx
pshufd xmm0, xmm0, 0
movdqa xmm1, xmm0
paddd xmm1, xmm2
pcmpeqd xmm1,xmm0
pmovmskb edx, xmm1 ; set byte mask in dx
cmp dx, 0FFFFh
jne CheckErr
CheckOK:
lea eax, Check
movq mm0, qword ptr TestOK
movq qword ptr [eax], mm0
jmp EndCheck
CheckErr:
lea eax, Check
movq mm0, qword ptr TestERR
movq qword ptr [eax], mm0
EndCheck:
ret
PCMP_TEST endp
end start
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd for xmm1=xmm0: xmm1 becomes 0h
Quote from: jj2007 on November 29, 2012, 01:08:17 AM
The logic is inverted:
pcmpeqb for xmm1=xmm0: xmm1 becomes ffffffffffffffffh
psubd for xmm1=xmm0: xmm1 becomes 0h
So what is my error? I was aware that the logic is inverted
and I tested:
cmp dx, 0
jne CheckErr
for PSUBD, and
cmp dx, 0FFFFh
jne CheckErr
for PCMPEQD. ::)
It seems pcmpeqb returns always zero, unless the xmm bytes are FFh...
---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1 3617008641903833650
xmm0 3617008641903833650
pcmpeqd out xmm1 -1
pmovmskb
xmm1 -1
edx 65535
Test OK
---------------------------------------------------------
Test on PSUBD -
PSubD in
xmm1 3617008641903833650
xmm0 3617008641903833650
PSubD out xmm1 0
pmovmskb
xmm1 0
dx 0
Test OK
---------------------------------------------------------
---------------------------------------------------------
Test on PCMPEQD -
pcmpeqd in
xmm1 3617008646198800947
xmm0 3617008641903833650
pcmpeqd out xmm1 0
pmovmskb
xmm1 0
edx 0
Test ERR
---------------------------------------------------------
Test on PSUBD -
PSubD in
xmm1 3617008646198800947
xmm0 3617008641903833650
PSubD out xmm1 4294967297
pmovmskb
xmm1 4294967297
dx 0
Test OK
---------------------------------------------------------
deleted
deleted
When I read the Intel Manuals, about PMOVMSKB
I found something didn't match with the possibility to
compare two XMM register for equality:
Creates a mask made up of the most significant bit of each byte of the source
operand (second operand) and stores the result in the low byte or word of the destination
operand (first operand).
If only the MSBits are stored into the destination operand, and the difference is in other
bits, it will not be detected.
So My idea is that after PSUBD we have to use a different opcode to
detect is there are differences other then in the MSBits of the xmm we are testing.
On the other side, using PCMPEQD we can test both the equality and the difference
between the xmm registers, using PMOVMSKB.
This is what I've undestood so far.
Using PSUBD is a smart solution but it need to be followed by something
different than PMOVMSKB, in my opinion.
So far nidud's solution is the one I understand. Waiting for some other solution.
Quote from: nidud on November 29, 2012, 04:37:01 AM
Maybe you could use CMPNEQPS
The result should then be zero if equal
Yes, probably this opcode will work as well.
Quote
What does pxor xmm0, xmm0 do ?
The same thing that xor rax, rax ?
yes again. So far I think the
PCMPEQD variant is the complete one
for testing equality. Something is missing, in my opinion for
PSUBD.
What do you want to compare? FP or integer value?
Quote from: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?
xmm integer packed data is what I'd like to compare.
I want to know, for example, if xmm0 is equal to xmm1.
Quote from: frktons on November 29, 2012, 04:54:25 AM
Quote from: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?
xmm integer packed data is what I'd like to compare.
I want to know, for example, if xmm0 is equal to xmm1.
so, you are rigth with PCMPEQxx + PMOVMSKB.
Quote from: qWord on November 29, 2012, 04:50:32 AM
What do you want to compare? FP or integer value?
Test for equality only, so FP or INT won't make a difference. Although I wonder how CMPNEQPS aka CMPPS xmmDest, xmmSrc, 4 handles exotic cases (NaN vs 0 etc). In any case, PCMPEQxx is the right choice, as qWord already wrote.
Quote from: jj2007 on November 29, 2012, 05:02:03 AM
Although I wonder how CMPNEQPS aka CMPPS xmmDest, xmmSrc, 4 handles exotic cases (NaN vs 0 etc).
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.
Quote from: qWord on November 29, 2012, 05:07:05 AM
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.
Grazie, good to know :t
Quote from: jj2007 on November 29, 2012, 05:13:34 AM
Quote from: qWord on November 29, 2012, 05:07:05 AM
can't wok because there are N possibilities to represent a NaN, whereas CMPxxPS returns true for all pairs of NaNs.
Grazie, good to know :t
ups..., that applies for the unordered compare, for CMPEQPS it allways return false.
I'm glad to see everybody agreed eventually. :t
Now let's go further.
How do I compare for greater than?
:P
Same registers, same data type.
reverse the operands ?
Quote from: frktons on November 29, 2012, 05:24:12 AM
Now let's go further.
How do I compare for greater than?
:P
Same registers, same data type.
PCMPGTD ::)
Quote from: qWord on November 29, 2012, 06:20:50 AM
Quote from: frktons on November 29, 2012, 05:24:12 AM
Now let's go further.
How do I compare for greater than?
:P
Same registers, same data type.
PCMPGTD ::)
Thanks qWord, maybe this time I'll take less time to get the info.
Well it looks quite simple to manage:
Quote
If a data element in the destination operand is greater
than the corresponding date element in the source operand, the corresponding data
element in the destination operand is set to all 1s; otherwise, it is set to all 0s.
deleted
deleted
Quote from: nidud on November 29, 2012, 08:32:39 AM
To conclude:
pcmpeqb compares 16 bytes, pcmpeqd compares 4 doublewords, result in xmm0 is the same.
pcmpgtb compares 16 bytes, pcmpgtd compares 4 doublewords, result in xmm0 is not the same.
pcmpeqb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if equal
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,0FFFFh
je is_equal
pcmpgtb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if greater
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,8000h ; ?
jle is_great
I don't think so about the test for GT. So far I got:
when we use
pcmpgtbwe have FF only in the bytes that are greater, not all of them,
and the same for the dword, with
pcmpgtd,only the dword that are greater
are switched to FF, the remaining of them are switched to
00 if are equal or less than.
Instead of this:
pcmpgtb xmm0,xmm1 ; xmm0 to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFh if greater
pmovmskb eax,xmm0 ; eax to 0000FFFFh
cmp ax,8000h ; ?
jle is_great
Something like:
pcmpgtd xmm0,xmm1
pmovmskb eax,xmm0
.if bit ax, 15
jmp IsGreater
.endif
If bit 15 not 1 The way is longer, and we have to test
other things:
pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
.if bit bx, 15
jmp IsGreater; The second original value tested
.endif
.if ax == bx
jmp AreEqual
.elseif ax > bx
jmp IsGreater
.else
jmp IsLessThan
.endif
Not already tested, but this is the idea.
deleted
Quote from: nidud on November 29, 2012, 10:59:37 AM
The upper byte should always be the same (FF,? or FFFFFFFF,? if greater)
Yes and not. It depends if the byte or dword tested is greater, not the whole xmm register.
But if upper byte or word or dword is FF, you know the entire xmm register is greater, otherwise
you have to check other things.
Quote from: frktons on November 29, 2012, 09:52:49 AM
pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
.if bit bx, 15
jmp IsGreater; The second original value tested
.endif
.if ax == bx
jmp AreEqual
.else
jmp IsLessThan; The second original value tested
.endif
Not already tested, but this is the idea.
assuming ax is zero, I think that is correct
[/quote]
I modified the code, there was a logical error, have a look.
It should work with whatever value ax and bx assume.
Quote
same as:
test ah,80h
jnz is_great
test ax,ax
jz is_equal
jmp is_less
I think you have to use 2 registers not only ax. According to my understanding
you cannot check if greater, equal or less than with a single passage. Only
if you are lucky you can find the answer in the first check, if the upper byte is FF
you can say xmm0 is greater than xmm1.
So the complete test for greater, less than or equal should
be something like this:
pcmpgtd xmm0,xmm1
pmovmskb eax,xmm0
.if bit ax, 15
jmp IsGreater
.endif
pcmpgtd xmm2,xmm3; same values in reverse order
pmovmskb ebx,xmm2
.if bit bx, 15
jmp IsLessThan
.endif
.if ax == bx
jmp AreEqual
.elseif ax > bx
jmp IsGreater
.else
jmp IsLessThan
.endif
I'll try the code and let you know.
deleted
Read again my last posts, yes we can know if the
compare gives GT, LT or EQ.
deleted
deleted
deleted
my test for atol:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
--------------------------------------------------------
429117 cycles for atol LODSB
350976 cycles for atol SHL
411231 cycles for atol LEA
--------------------------------------------------------
430242 cycles for atol LODSB
509282 cycles for atol SHL
395102 cycles for atol LEA
--------------------------------------------------------
well they are a bit random, as you said.
Memcpy:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
--------------------------------------------------------
757223 cycles for memcpy A
288975 cycles for memcpy movdqa xmm0 A
1352024 cycles for memcpy movdqu xmm0 A
1367569 cycles for memcpy movdqu xmm0..xmm7 A
5668726 cycles for memcpy movdqu xmm0 U
4563076 cycles for memcpy movdqu xmm0..xmm7 U
--------------------------------------------------------
749649 cycles for memcpy A
302916 cycles for memcpy movdqa xmm0 A
1737163 cycles for memcpy movdqu xmm0 A
1841807 cycles for memcpy movdqu xmm0..xmm7 A
6136384 cycles for memcpy movdqu xmm0 U
4055501 cycles for memcpy movdqu xmm0..xmm7 U
--------------------------------------------------------
and the last routines:
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
--------------------------------------------------------
5825483 cycles for crt_memcpy A
7352188 cycles for memcpy A
7269901 cycles for memcpyd A
4146083 cycles for memcpy movdqa A
8700368 cycles for memcpy movdqu A
27212513 cycles for memcpy movdqu U
--------------------------------------------------------
6180618 cycles for crt_memcpy A
9100718 cycles for memcpy A
7028934 cycles for memcpyd A
4151090 cycles for memcpy movdqa A
11923657 cycles for memcpy movdqu A
29665081 cycles for memcpy movdqu U
--------------------------------------------------------
You could find interesting the test we did a couple of years ago:
deleted
memcpy?? Looks familiar (http://www.masmforum.com/board/index.php?topic=11454.msg87610#msg87610) :biggrin:
deleted
Quote from: nidud on December 04, 2012, 06:48:06 AM
We either all use the same CPU or stick to the basic then :lol:
Optimization is quite a strange beast indeed, it comes and goes
depending on many [maybe too many] factors.
Nevertheless we can try and find something that we didn't expect :lol:
Quote from: nidud on December 04, 2012, 06:48:06 AM
Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz (SSE4)
6.065.640 cycles for RtlZeroMemory
5.992.196 cycles for rep stosd
7.240.756 cycles for MOVNTDQ
It seems Intel is still working on rep stosd..!
I tried it for fun (haven't even read messages here though so this is prolly bad code) :
Gabriel,
I tried to use the 3 algos to see what they did but I could not get it working, would it be possible for you to put the algos in a test piece ?
I think I did basically what the original post asked for, which was to convert an array of 4096 dwords to an array of 4096 bytes (src is the 4096 dword array and dst the 4096 byte array)
I'll try making a more flexible algorithm that doesn't assume size nor alignment I guess lol.
The different functions do the same thing, just with different extensions (progressing through them)
So I made a few new revised ones (attached), which should probably be a lot easier to test.
Basically they do what the original post asked to do, which is to move sz dwords from the array pointed to by src, transform them into bytes, and then put them into the array pointed to by dst.
What have you built this with ? I downloaded the latest version of ML.EXE but it throws an error on "movd".
Microsoft (R) Macro Assembler Version 14.15.26730.0
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: K:\asm32\gabriel\2\dwordToByte.asm
K:\asm32\gabriel\2\dwordToByte.asm(90) : error A2070:invalid instruction operands
K:\asm32\gabriel\2\dwordToByte.asm(91) : error A2070:invalid instruction operands
K:\asm32\gabriel\2\dwordToByte.asm(182) : error A2070:invalid instruction operands
OK, that was easy to fix, just added DWORD PTR to the three lines and it builds into an object module with no problems.
I prototyped the three procedures so it should be callable using normal MASM "invoke".
dwordToByte PROTO dst:dword, src:dword, sz:dword
dwordToByteSSE2 PROTO dst:dword, src:dword, sz:dword
dwordToByteSSSE3 PROTO dst:dword, src:dword, sz:dword
Now all I need is the data format that calls the three procedures.
I used JWasm, I guess it's more flexible and knows that movd always uses dword ptr for memory operands
Also, I used dword instead of ptr for dst and src lel.
But um dst is a pointer to an array of sz bytes (so its size is sz), src is a pointer to an array of sz dwords (so its size is sz * 4), and sz is the size.
Also I attached a version that should work with MASM. I also just found Uasm, so I'll prolly be able to make versions for AVX2 and later.
You will certainly do better with UASM as John has done some very good work there, JWASM was rough around the edges and was not properly MASM compatible. Don't be afraid to have a look at nidud's ASMC either as he has done a lot of good work as well. It is pretty much the case that ML.EXE is the only 32 bit MASM compatible assembler as it is the reference.
On the Intel 64 and IA-32 Architectures Optimization Reference Manual there is a chapter on Data Gather and Scatter which is actually what we are talking about here.