Comparing 128-bit numbers aka OWORDs

Antariy · August 27, 2013, 10:16:33 PM

Hi Steve! Thank you :t

Quote from: FORTRANS on August 27, 2013, 09:47:36 PM
Okay, here it is. 16-bit, but would be easy to convert to 32-bits.

Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-mathing bytes, instead of CMPSB, but other logic is the same.

Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM

(this version is faster than earlier included in the testbed)

FORTRANS · August 27, 2013, 10:20:43 PM

Code Select

 
Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
720551	cycles [x][x][x] - Cmp128Nidud
760062	cycles [x][x][x] - Cmp128NidudSSE
1047587	cycles [x][x][x] - Cmp128Dave
2544535	cycles [x][x][x] - Cmp128Dave2
1452129	cycles [x][x][x] - Cmp128JJAlexSSE_1
1704401	cycles [x][x][x] - Cmp128JJAlexSSE_2
1711250	cycles [x][x][x] - Cmp128JJAlexSSE_3
832572	cycles [x][x][ ] - Cmp128Alex
1228790	cycles [x][x][x] - Cmp128Alex_2
1249508	cycles [x][x][x] - Cmp128Alex_3
1849514	cycles [x][x][ ] - Cmp128JJSSE
959424	cycles [x][x][ ] - AxCMP128bitProc3
1008975	cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
652808	cycles [x][ ][ ] - Cmp128DaveU
649697	cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

FORTRANS · August 27, 2013, 10:25:40 PM

Quote from: Antariy on August 27, 2013, 10:16:33 PM

Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-matching bytes, instead of CMPSB, but other logic is the same.

Hi Alex,

Interesting. Nice to see the algorithm reworked.

Thanks,

Steve N.

Antariy · August 27, 2013, 10:30:40 PM

Quote from: FORTRANS on August 27, 2013, 10:25:40 PM
Interesting. Nice to see the algorithm reworked.

Yes, it always interesting to see different implementations of an idea

Thank you for the test, Steve! :t

nidud · August 28, 2013, 01:36:24 AM

deleted

FORTRANS · August 28, 2013, 04:13:24 AM

Hi,

First some more results.

Code Select


pre-P4 (SSE1)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

pre-P4
    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

Press any key to continue ...

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

Second, I tried to visualize the problems with setting the flags
correctly. So I wrote a program to plot the flags for a normal
byte comparison and then for a truncated byte representing a
partial data set. It did not help me in any particular way. But
here it is anyway. Mode 12H graphics.

Regards,

Steve N.

Antariy · August 28, 2013, 09:19:12 AM

Quote from: nidud on August 28, 2013, 01:36:24 AM
I like the "Intel 8086 Family Architecture" document because it includes timings for the upcodes, and I assumed backward compatible on all instruction set architectures based on the Intel 8086 CPU.

QuoteMany additions and extensions have been added to the x86 instruction set over the years, almost consistently with full backward compatibility

It's not your error :t (BTW in the Intel's 80386 reference information is as in your reference)

As for clocks - after PMMX and especially after P6 family released, old clocks numbers information may be very outdated, the more so this for more or less modern CPUs (here we have totally unpredictable timings not only between manufacturers, but even inside different models of one microarchitecture).

Quote from: FORTRANS on August 28, 2013, 04:13:24 AM
Hi,

First some more results.

Code Select Expand
pre-P4 (SSE1) All Flags Set: OV NG ZR AC PE CY BSF 1: OV NG NZ AC PE CY BSR 1: OV NG NZ AC PE CY All Flags Cleared: NV PL NZ NA PO NC BSF 1: NV PL NZ NA PO NC BSR 1: NV PL NZ NA PO NC Press any key to continue ... Intel(R) Pentium(R) M processor 1.70GHz (SSE2) All Flags Set: OV NG ZR AC PE CY BSF 1: OV NG NZ AC PE CY BSR 1: OV NG NZ AC PE CY All Flags Cleared: NV PL NZ NA PO NC BSF 1: NV PL NZ NA PO NC BSR 1: NV PL NZ NA PO NC Press any key to continue ... pre-P4 All Flags Set: OV NG ZR AC PE CY BSF 1: NV NG NZ AC PE NC BSR 1: OV NG NZ AC PE NC All Flags Cleared: NV PL NZ NA PO NC BSF 1: NV NG NZ AC PE NC BSR 1: OV NG NZ AC PE NC Press any key to continue ... Mobile Intel(R) Celeron(R) processor 600MHz (SSE2) All Flags Set: OV NG ZR AC PE CY BSF 1: OV NG NZ AC PE CY BSR 1: OV NG NZ AC PE CY All Flags Cleared: NV PL NZ NA PO NC BSF 1: NV PL NZ NA PO NC BSR 1: NV PL NZ NA PO NC Press any key to continue ...

Steve, is the third result for your PMMX?

The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?

FORTRANS · August 28, 2013, 10:26:34 PM

Quote from: Antariy on August 28, 2013, 09:19:12 AM
Quote from: FORTRANS on August 28, 2013, 04:13:24 AM
Hi,

First some more results.

Steve, is the third result for your PMMX?

Yes, the P-MMX with Windows 98. A pity that it
does not follow Intel's rules.

Quote
The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?

Maybe. I was trying to visualize what information from
the high nybble / byte / double, in a byte / word / OWORD
would tell me about the complete result. So I took the row
and column counters, both bytes, and plotted the results of
a compare for each of the four flags we were looking at.

Code Select


        MOV     AH,[RowCount]
        MOV     AL,[ColCount]
        CMP     AL,AH
        MOV     [ucPixel],0
        MOV     BX,[SaveBX]
        MOV     SI,[TestTable+BX]
        PUSHF
        POP     DI
        TEST    SI,DI
        JZ      Start5
        MOV     [ucPixel],15
Start5:
        CALL SetPixel10

If the flag is set, the pixel is white, otherwise black for the first
four plots or graphs. I then do the same for a truncated case.

Code Select


        MOV     AH,[RowCount]
        MOV     AL,[ColCount]
        AND     AH,0F0H
        AND     AL,0F0H
        CMP     AL,AH
        MOV     [ucPixel],0
        MOV     BX,[SaveBX]
        MOV     SI,[TestTable+BX]
        PUSHF
        POP     DI
        TEST    SI,DI
        JZ      Start8
        MOV     [ucPixel],15
Start8:
        CALL SetPixel10

I was hoping that it would show a short-cut or some such.
All it showed was that for the majority of cases there is no such
short-cut that I could possibly see. You have to do a bit better.
(I think, at least that is what I took away from this.)

For the color plot, I am using the Mode 12H planar graphics
mode. Sixteen colors with four planes. So I assigned a flag to
its own plane. So the colors show what flags are set by the
comparison. I probably should change the palette to make the
results be clearer. But I saw what I needed to, and so did not
bother. I could update that if you think it would help. (And you
see a good and proper mapping of the four flags to three primary
colors.)

The only notable fact that I saw was if the Zero Flag is set, no
others being considered are set as well. So you can take an early
exit from the algorithm if the CMPS result is zero. I did not bother
with mine.* (Though I, or someone, should time both versions.)

Given the other colors, any combination of the other three flags
is possible.** And again, no obvious simplification was apparent to
me.

I hope that explains most if not all of your question.

Regards,

Steve N.

Edit:

* Zero is set, on average, once out of 256 times for the byte
comparison. And it just gets worse as the size increases.

SRN

Edit:

** Said that wrong. The colors show which flags can be set
together, as shown by the labels. Not all the possibilities occur.

SRN

dedndave · August 28, 2013, 10:28:14 PM

that is really strange
if it isn't going to execute the instruction correctly, at least it could BSOD or something

nidud · August 29, 2013, 12:33:05 AM

deleted

dedndave · August 29, 2013, 02:52:20 AM

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2610027 cycles [x][x][x] - Cmp128Nidud
2886072 cycles [x][x][x] - Cmp128NidudSSE
2616611 cycles [x][x][x] - Cmp128Dave
3812015 cycles [x][x][x] - Cmp128Dave2
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1
1574406 cycles [x][x][x] - Cmp128JJAlexSSE_2
1786771 cycles [x][x][x] - Cmp128JJAlexSSE_3
953366  cycles [x][x][ ] - Cmp128Alex
1705162 cycles [x][x][x] - Cmp128Alex_2
1775305 cycles [x][x][x] - Cmp128Alex_3
1896256 cycles [x][x][ ] - Cmp128JJSSE
1310303 cycles [x][x][ ] - AxCMP128bitProc3
1255525 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
697520  cycles [x][ ][ ] - Cmp128DaveU
717197  cycles [x][ ][ ] - Cmp128NidudU

jj2007 · August 29, 2013, 03:43:56 AM

Code Select

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
987     kCycles [x][x][x] - Cmp128Nidud
1066    kCycles [x][x][x] - Cmp128NidudSSE
956     kCycles [x][x][x] - Cmp128Dave
2577    kCycles [x][x][x] - Cmp128Dave2
820     kCycles [x][x][x] - Cmp128JJAlexSSE_1 <<<<<<<<<<
929     kCycles [x][x][x] - Cmp128JJAlexSSE_2
950     kCycles [x][x][x] - Cmp128JJAlexSSE_3
688     kCycles [x][x][ ] - Cmp128Alex
1066    kCycles [x][x][x] - Cmp128Alex_2
1112    kCycles [x][x][x] - Cmp128Alex_3
1109    kCycles [x][x][ ] - Cmp128JJSSE
862     kCycles [x][x][ ] - AxCMP128bitProc3
870     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
598     kCycles [x][ ][ ] - Cmp128DaveU
586     kCycles [x][ ][ ] - Cmp128NidudU

nidud · August 29, 2013, 04:44:55 AM

deleted

FORTRANS · August 30, 2013, 06:09:11 AM

Hi,

I made the PlotFlag program with an improved color scheme.
And I fixed an error in the labeling of the combined flags. Oops.
I updated Reply #260 with the fixed program. The Overflow,
Sign, and Carry flags are assigned to blue. green, and red so
that their combinations should make sense.

Regards,

Steve N.

Antariy · August 30, 2013, 12:22:20 PM

Hi Steve! Thanks for your explanation

Yes, now it explains things - earlier I did not actually get how the the graps correspond to flags, after your more detailed explanation it got clear.

Quote from: nidud on August 29, 2013, 12:33:05 AM
ok, the BSF thing didn't work, so it's one step back :P

Code Select


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2799747 cycles [x][x][x] - Cmp128Nidud
3028917 cycles [x][x][x] - Cmp128NidudSSE
2740311 cycles [x][x][x] - Cmp128Dave
4045304 cycles [x][x][x] - Cmp128Dave2
1664740 cycles [x][x][x] - Cmp128JJAlexSSE_1
1633226 cycles [x][x][x] - Cmp128JJAlexSSE_2
1878213 cycles [x][x][x] - Cmp128JJAlexSSE_3
973406  cycles [x][x][ ] - Cmp128Alex
1835600 cycles [x][x][x] - Cmp128Alex_2
1878425 cycles [x][x][x] - Cmp128Alex_3
1968746 cycles [x][x][ ] - Cmp128JJSSE
1374576 cycles [x][x][ ] - AxCMP128bitProc3
1290053 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
739767  cycles [x][ ][ ] - Cmp128DaveU
744369  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

BTW, in my testbed I replaced couple of macroses but not posted them (the results above are not for the updated procs, though):

Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM

and

Cmp128JJAlexSSE_2 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   movzx eax,byte ptr [ow0+14]
   movzx edx,byte ptr [ow1+14]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM

Quote from: nidud on August 29, 2013, 04:44:55 AM
Hmm, maybe the CPU's which "cheat" with the BSF/BSR flags are faster
Alex/JJ use BSR

My test: "no cheat"
Code Select Expand
701224 cycles [x][x][x] - Cmp128NidudSSE 1068347 cycles [x][x][x] - Cmp128JJAlexSSE_1
Siekmanski: "no cheat"
Code Select Expand
730067 cycles [x][x][x] - Cmp128NidudSSE 784587 cycles [x][x][x] - Cmp128JJAlexSSE_1

Alex: "cheat"
Code Select Expand
2441613 cycles [x][x][x] - Cmp128NidudSSE 1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1
Dave: "cheat"
Code Select Expand
2886072 cycles [x][x][x] - Cmp128NidudSSE 1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1

Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs

But the same PIV models are much more slower than more modern CPUs with some other instructions, too, and those are the real reason: SSE instructions, SBB instruction, LAHF/SAHF.

Cmp128JJAlexSSE_1 uses far more faster (on PIV) code in GRP part without SUBs/SBBs.
Fully-GPR Cmp128Alex_2 is faster than Cmp128Nidud because of this, too.

On modern CPUs your GPR code "should be" faster than my GPR code, for an instance. But for SSE code there are very different timings in the thread, probably, on very modern Intel CPUs models my SSE code "should be" faster.

The MASM Forum

News:

Comparing 128-bit numbers aka OWORDs

Antariy

FORTRANS

FORTRANS

Antariy

nidud

FORTRANS

Antariy

FORTRANS

dedndave

nidud

dedndave

jj2007

nidud

FORTRANS

Antariy