News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

Antariy

Hi Steve! Thank you :t

Quote from: FORTRANS on August 27, 2013, 09:47:36 PM
   Okay, here it is.  16-bit, but would be easy to convert to 32-bits.

Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-mathing bytes, instead of CMPSB, but other logic is the same.


Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


(this version is faster than earlier included in the testbed)

FORTRANS


Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
720551 cycles [x][x][x] - Cmp128Nidud
760062 cycles [x][x][x] - Cmp128NidudSSE
1047587 cycles [x][x][x] - Cmp128Dave
2544535 cycles [x][x][x] - Cmp128Dave2
1452129 cycles [x][x][x] - Cmp128JJAlexSSE_1
1704401 cycles [x][x][x] - Cmp128JJAlexSSE_2
1711250 cycles [x][x][x] - Cmp128JJAlexSSE_3
832572 cycles [x][x][ ] - Cmp128Alex
1228790 cycles [x][x][x] - Cmp128Alex_2
1249508 cycles [x][x][x] - Cmp128Alex_3
1849514 cycles [x][x][ ] - Cmp128JJSSE
959424 cycles [x][x][ ] - AxCMP128bitProc3
1008975 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
652808 cycles [x][ ][ ] - Cmp128DaveU
649697 cycles [x][ ][ ] - Cmp128NidudU

--- ok ---

FORTRANS

Quote from: Antariy on August 27, 2013, 10:16:33 PM

Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-matching bytes, instead of CMPSB, but other logic is the same.

Hi Alex,

   Interesting.  Nice to see the algorithm reworked.

Thanks,

Steve N.

Antariy

Quote from: FORTRANS on August 27, 2013, 10:25:40 PM
   Interesting.  Nice to see the algorithm reworked.

Yes, it always interesting to see different implementations of an idea :biggrin:

Thank you for the test, Steve! :t

nidud

#259
deleted

FORTRANS

#260
Hi,

   First some more results.


pre-P4 (SSE1)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

pre-P4
    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

Press any key to continue ...

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...


   Second, I tried to visualize the problems with setting the flags
correctly.  So I wrote a program to plot the flags for a normal
byte comparison and then for a truncated byte representing a
partial data set.  It did not help me in any particular way.  But
here it is anyway.  Mode 12H graphics.

Regards,

Steve N.

Antariy

Quote from: nidud on August 28, 2013, 01:36:24 AM
I like the "Intel 8086 Family Architecture" document because it includes timings for the upcodes, and I assumed backward compatible on all instruction set architectures based on the Intel 8086 CPU.

QuoteMany additions and extensions have been added to the x86 instruction set over the years, almost consistently with full backward compatibility

It's not your error :t (BTW in the Intel's 80386 reference information is as in your reference)

As for clocks - after PMMX and especially after P6 family released, old clocks numbers information may be very outdated, the more so this for more or less modern CPUs (here we have totally unpredictable timings not only between manufacturers, but even inside different models of one microarchitecture).


Quote from: FORTRANS on August 28, 2013, 04:13:24 AM
Hi,

   First some more results.


pre-P4 (SSE1)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...

pre-P4
    All Flags Set: OV NG ZR AC PE CY
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV NG NZ AC PE NC
            BSR 1: OV NG NZ AC PE NC

Press any key to continue ...

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)

    All Flags Set: OV NG ZR AC PE CY
            BSF 1: OV NG NZ AC PE CY
            BSR 1: OV NG NZ AC PE CY

All Flags Cleared: NV PL NZ NA PO NC
            BSF 1: NV PL NZ NA PO NC
            BSR 1: NV PL NZ NA PO NC

Press any key to continue ...


Steve, is the third result for your PMMX?

The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?

FORTRANS

#262
Quote from: Antariy on August 28, 2013, 09:19:12 AM
Quote from: FORTRANS on August 28, 2013, 04:13:24 AM
Hi,

   First some more results.


Steve, is the third result for your PMMX?


   Yes, the P-MMX with Windows 98.  A pity that it
does not follow Intel's rules.

Quote
The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?

   Maybe.  I was trying to visualize what information from
the high nybble / byte / double, in a byte / word / OWORD
would tell me about the complete result.  So I took the row
and column counters, both bytes, and plotted the results of
a compare for each of the four flags we were looking at.


        MOV     AH,[RowCount]
        MOV     AL,[ColCount]
        CMP     AL,AH
        MOV     [ucPixel],0
        MOV     BX,[SaveBX]
        MOV     SI,[TestTable+BX]
        PUSHF
        POP     DI
        TEST    SI,DI
        JZ      Start5
        MOV     [ucPixel],15
Start5:
        CALL SetPixel10


   If the flag is set, the pixel is white, otherwise black for the first
four plots or graphs.  I then do the same for a truncated case.


        MOV     AH,[RowCount]
        MOV     AL,[ColCount]
        AND     AH,0F0H
        AND     AL,0F0H
        CMP     AL,AH
        MOV     [ucPixel],0
        MOV     BX,[SaveBX]
        MOV     SI,[TestTable+BX]
        PUSHF
        POP     DI
        TEST    SI,DI
        JZ      Start8
        MOV     [ucPixel],15
Start8:
        CALL SetPixel10


   I was hoping that it would show a short-cut or some such.
All it showed was that for the majority of cases there is no such
short-cut that I could possibly see.  You have to do a bit better.
(I think, at least that is what I took away from this.)

   For the color plot, I am using the Mode 12H planar graphics
mode.  Sixteen colors with four planes.  So I assigned a flag to
its own plane.  So the colors show what flags are set by the
comparison.  I probably should change the palette to make the
results be clearer.  But I saw what I needed to, and so did not
bother.  I could update that if you think it would help.  (And you
see a good and proper mapping of the four flags to three primary
colors.)

   The only notable fact that I saw was if the Zero Flag is set, no
others being considered are set as well.  So you can take an early
exit from the algorithm if the CMPS result is zero.  I did not bother
with mine.*  (Though I, or someone, should time both versions.)

    Given the other colors, any combination of the other three flags
is possible.**  And again, no obvious simplification was apparent to
me.

   I hope that explains most if not all of your question.

Regards,

Steve N.

Edit:

*  Zero is set, on average, once out of 256 times for the byte
comparison.  And it just gets worse as the size increases.

SRN

Edit:

**  Said that wrong.  The colors show which flags can be set
together, as shown by the labels.  Not all the possibilities occur.

SRN

dedndave

that is really strange
if it isn't going to execute the instruction correctly, at least it could BSOD or something   :biggrin:

nidud

#264
deleted

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2610027 cycles [x][x][x] - Cmp128Nidud
2886072 cycles [x][x][x] - Cmp128NidudSSE
2616611 cycles [x][x][x] - Cmp128Dave
3812015 cycles [x][x][x] - Cmp128Dave2
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1
1574406 cycles [x][x][x] - Cmp128JJAlexSSE_2
1786771 cycles [x][x][x] - Cmp128JJAlexSSE_3
953366  cycles [x][x][ ] - Cmp128Alex
1705162 cycles [x][x][x] - Cmp128Alex_2
1775305 cycles [x][x][x] - Cmp128Alex_3
1896256 cycles [x][x][ ] - Cmp128JJSSE
1310303 cycles [x][x][ ] - AxCMP128bitProc3
1255525 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
697520  cycles [x][ ][ ] - Cmp128DaveU
717197  cycles [x][ ][ ] - Cmp128NidudU

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
987     kCycles [x][x][x] - Cmp128Nidud
1066    kCycles [x][x][x] - Cmp128NidudSSE
956     kCycles [x][x][x] - Cmp128Dave
2577    kCycles [x][x][x] - Cmp128Dave2
820     kCycles [x][x][x] - Cmp128JJAlexSSE_1 <<<<<<<<<<
929     kCycles [x][x][x] - Cmp128JJAlexSSE_2
950     kCycles [x][x][x] - Cmp128JJAlexSSE_3
688     kCycles [x][x][ ] - Cmp128Alex
1066    kCycles [x][x][x] - Cmp128Alex_2
1112    kCycles [x][x][x] - Cmp128Alex_3
1109    kCycles [x][x][ ] - Cmp128JJSSE
862     kCycles [x][x][ ] - AxCMP128bitProc3
870     kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
598     kCycles [x][ ][ ] - Cmp128DaveU
586     kCycles [x][ ][ ] - Cmp128NidudU

nidud

#267
deleted

FORTRANS

Hi,

   I made the PlotFlag program with an improved color scheme.
And I fixed an error in the labeling of the combined flags.  Oops.
I updated Reply #260 with the fixed program.  The Overflow,
Sign, and Carry flags are assigned to blue. green, and red so
that their combinations should make sense.

Regards,

Steve N.

Antariy

Hi Steve! Thanks for your explanation :biggrin: Yes, now it explains things - earlier I did not actually get how the the graps correspond to flags, after your more detailed explanation it got clear.

Quote from: nidud on August 29, 2013, 12:33:05 AM
ok, the BSF thing didn't work, so it's one step back  :P


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2799747 cycles [x][x][x] - Cmp128Nidud
3028917 cycles [x][x][x] - Cmp128NidudSSE
2740311 cycles [x][x][x] - Cmp128Dave
4045304 cycles [x][x][x] - Cmp128Dave2
1664740 cycles [x][x][x] - Cmp128JJAlexSSE_1
1633226 cycles [x][x][x] - Cmp128JJAlexSSE_2
1878213 cycles [x][x][x] - Cmp128JJAlexSSE_3
973406  cycles [x][x][ ] - Cmp128Alex
1835600 cycles [x][x][x] - Cmp128Alex_2
1878425 cycles [x][x][x] - Cmp128Alex_3
1968746 cycles [x][x][ ] - Cmp128JJSSE
1374576 cycles [x][x][ ] - AxCMP128bitProc3
1290053 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
739767  cycles [x][ ][ ] - Cmp128DaveU
744369  cycles [x][ ][ ] - Cmp128NidudU

--- ok ---


BTW, in my testbed I replaced couple of macroses but not posted them (the results above are not for the updated procs, though):

Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   mov ah,byte ptr [ow0+15]
   mov dh,byte ptr [ow1+15]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


and


Cmp128JJAlexSSE_2 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
   movups xmm0,[ow0]
   movups xmm1,[ow1]
   pcmpeqb   xmm0,xmm1
   pmovmskb ecx,xmm0
   xor ecx,0FFFFh
   jz @l2
   and ecx,7FFFh
   bsr ecx,ecx
   movzx eax,byte ptr [ow0+14]
   movzx edx,byte ptr [ow1+14]
   mov al,byte ptr [ow0+ecx]
   mov dl,byte ptr [ow1+ecx]
   cmp ax,dx
   @l2:
ENDM


Quote from: nidud on August 29, 2013, 04:44:55 AM
Hmm, maybe the CPU's which "cheat" with the BSF/BSR flags are faster
Alex/JJ use BSR

My test: "no cheat"

701224 cycles [x][x][x] - Cmp128NidudSSE
1068347 cycles [x][x][x] - Cmp128JJAlexSSE_1

Siekmanski: "no cheat"

730067  cycles [x][x][x] - Cmp128NidudSSE
784587  cycles [x][x][x] - Cmp128JJAlexSSE_1


Alex: "cheat"

2441613 cycles [x][x][x] - Cmp128NidudSSE
1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1

Dave: "cheat"

2886072 cycles [x][x][x] - Cmp128NidudSSE
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1


Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs :biggrin:

But the same PIV models are much more slower than more modern CPUs with some other instructions, too, and those are the real reason: SSE instructions, SBB instruction, LAHF/SAHF.

Cmp128JJAlexSSE_1 uses far more faster (on PIV) code in GRP part without SUBs/SBBs.
Fully-GPR Cmp128Alex_2 is faster than Cmp128Nidud because of this, too.

On modern CPUs your GPR code "should be" faster than my GPR code, for an instance. But for SSE code there are very different timings in the thread, probably, on very modern Intel CPUs models my SSE code "should be" faster.