News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

ASM for FUN NEW step #1

Started by frktons, November 09, 2021, 09:04:58 AM

Previous topic - Next topic

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

536     cycles for 100 * C2D_J
1365    cycles for 100 * atodw
517     cycles for 100 * min2cvt
353     cycles for 100 * min1cvt (SSE3)
615     cycles for 100 * ConvertLUT
2123    cycles for 100 * FSM (Hutch)
618     cycles for 100 * MSVC (Timo)

546     cycles for 100 * C2D_J
1363    cycles for 100 * atodw
516     cycles for 100 * min2cvt
350     cycles for 100 * min1cvt (SSE3)
615     cycles for 100 * ConvertLUT
2152    cycles for 100 * FSM (Hutch)
620     cycles for 100 * MSVC (Timo)

536     cycles for 100 * C2D_J
1371    cycles for 100 * atodw
568     cycles for 100 * min2cvt
350     cycles for 100 * min1cvt (SSE3)
610     cycles for 100 * ConvertLUT
2142    cycles for 100 * FSM (Hutch)
624     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
10      bytes for atodw
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
118     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)
May the source be with you

daydreamer

Quote from: jj2007 on November 15, 2021, 12:46:38 AM
Quote from: daydreamer on November 15, 2021, 12:23:28 AMfor ascii to double conversion, I think Raymond approach would be more versatile take care of "1234.5678", but also "123.45678" and "12345.678"

Really? How exactly would you do that?

ConvertLUT:
  mov edx, [eax] ; pointer to string, e.g. "1234"
  and edx, 0f0f0f0fh ; convert "1" to 1
  movzx ecx, dl ;1000s
  mov eax, t1000[ecx*4]
  movzx ecx,dh ;100s
  add eax, t100[ecx*4]
  bswap edx
  movzx ecx,dh ;10s
  add eax, t10[ecx*4]
  movzx ecx, dl ;units
  add eax, ecx
  retn


.data
;integer part
t1000 real4 0.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0
t100 real4 0.0,100.0,200.0,300.0,400.0,500.0,600.0,700.0,800.0,900.0
t10 real4 0.0,10.0,20.0,30.0,40.0,50.0,60.0,70.0,80.0,90.0
t1 real4 0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
;decimals
t0dot1 real4 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
t0dot01 real4 0.0,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09
t0dot001 real4 0.0,0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009
t0dot0001 real4 0.0,0.0001,0.0002,0.0003,0.0004,0.0005,0.0006,0.0007,0.0008,0.0009
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Ok, these are the data, and now the code please, daydreamer :thumbsup:

nidud

#63
deleted

raymond

Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.

The qeditor Help section has a listing for opcodes, along with the expected clock cycles for each. The BSWAP one has the following:
QuoteUsage:  BSWAP   reg32

Modifies flags: none

Changes the byte order of a 32 bit register from big endian to
little endian or vice versa.   Result left in destination register
is undefined if the operand is a 16 bit register.

                         Clocks                 Size
Operands         808x  286   386   486          Bytes
reg32                   -       -       -       1             2

0F C8+ rd BSWAP r32 Reverses the byte order of a 32-bit register.

If you consider BSWAP as a "relatively slow instruction", maybe Hutch should review all the tables of that Help section.

Then, in addition, you mention that "IMUL is actually a rather fast instruction", BUT the imul instruction help section in qeditor shows the following:
Quote
                                          Clocks                     Size
Operands                808x    286    386     486         Bytes
reg8                      80-98     13    9-14   13-18          2
reg16                  128-154    21    9-22   13-26          2
reg32                        -         -      9-38   12-42          2
mem8                   86-104    16  12-17   13-18         2-4
mem16               134-160    24  12-25   13-26         2-4
mem32                      -        -    12-41   13-42         2-4
reg16,reg16               -         -     9-22   13-26         3-5
reg32,reg32               -         -     9-38   13-42         3-5
reg16,mem16            -        -     12-25   13-26         3-5
reg32,mem32            -        -     12-41   13-42         3-5
reg16,immed             -       21     9-22   13-26           3
reg32,immed             -       21     9-38   13-42         3-6
reg16,reg16,immed    -        2      9-22   13-26         3-6
reg32,reg32,immed    -      21      9-38   13-42         3-6
reg16,mem16,immed  -     24    12-25    13-26         3-6
reg32,mem32,immed  -     24    12-41    13-42         3-6

Any explanation for this????
Could you provide us with YOUR sources for clock cycles?
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

jj2007

Quote from: nidud on November 16, 2021, 03:18:06 AM
IMUL is actually a rather fast instruction, even when used with memory operands, so there may not be that much gain coding around it (LEA/LUT). BSWAP however is a relatively slow instruction so that should be avoided if possible.
...
    86804 cycles 3.asm: pmaddwd
    89518 cycles 1.asm: imul
    94277 cycles 4.asm: pmaddwd+bswap
   108254 cycles 2.asm: imul+bswap[/tt]

Thanks, Nidud. Here are my results:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

546     cycles for 100 * C2D_J
793     cycles for 100 * imul (Nidud)
762     cycles for 100 * min2cvt
479     cycles for 100 * min1cvt (SSE3)
625     cycles for 100 * ConvertLUT
2243    cycles for 100 * FSM (Hutch)
709     cycles for 100 * MSVC (Timo)

554     cycles for 100 * C2D_J
793     cycles for 100 * imul (Nidud)
775     cycles for 100 * min2cvt
476     cycles for 100 * min1cvt (SSE3)
627     cycles for 100 * ConvertLUT
2245    cycles for 100 * FSM (Hutch)
696     cycles for 100 * MSVC (Timo)

546     cycles for 100 * C2D_J
792     cycles for 100 * imul (Nidud)
766     cycles for 100 * min2cvt
475     cycles for 100 * min1cvt (SSE3)
643     cycles for 100 * ConvertLUT
2252    cycles for 100 * FSM (Hutch)
703     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)


Quote from: raymond on November 16, 2021, 04:21:42 AM
Could you provide us with YOUR sources for clock cycles?

I don't know what Nidud's sources are, but here is mine - Agner Fog: one cycle for bswap

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)

537     cycles for 100 * C2D_J
723     cycles for 100 * imul (Nidud)
571     cycles for 100 * min2cvt
339     cycles for 100 * min1cvt (SSE3)
521     cycles for 100 * ConvertLUT
2133    cycles for 100 * FSM (Hutch)
630     cycles for 100 * MSVC (Timo)

542     cycles for 100 * C2D_J
632     cycles for 100 * imul (Nidud)
510     cycles for 100 * min2cvt
340     cycles for 100 * min1cvt (SSE3)
510     cycles for 100 * ConvertLUT
2134    cycles for 100 * FSM (Hutch)
621     cycles for 100 * MSVC (Timo)

537     cycles for 100 * C2D_J
639     cycles for 100 * imul (Nidud)
511     cycles for 100 * min2cvt
341     cycles for 100 * min1cvt (SSE3)
510     cycles for 100 * ConvertLUT
2131    cycles for 100 * FSM (Hutch)
619     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)
May the source be with you

nidud

#67
deleted

nidud

#68
deleted

daydreamer

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: daydreamer on November 16, 2021, 07:45:15 AM
wonder how fast avx2 will be?

We are still waiting for your code in reply #61, daydreamer.

nidud

#71
deleted

nidud

#72
deleted

Siekmanski

AMD Ryzen 9 5950X 16-Core Processor             (SSE4)

343     cycles for 100 * C2D_J
660     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
346     cycles for 100 * min1cvt (SSE3)
424     cycles for 100 * ConvertLUT
984     cycles for 100 * FSM (Hutch)
350     cycles for 100 * MSVC (Timo)

343     cycles for 100 * C2D_J
659     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
347     cycles for 100 * min1cvt (SSE3)
424     cycles for 100 * ConvertLUT
970     cycles for 100 * FSM (Hutch)
546     cycles for 100 * MSVC (Timo)

341     cycles for 100 * C2D_J
662     cycles for 100 * imul (Nidud)
449     cycles for 100 * min2cvt
354     cycles for 100 * min1cvt (SSE3)
412     cycles for 100 * ConvertLUT
969     cycles for 100 * FSM (Hutch)
347     cycles for 100 * MSVC (Timo)

54      bytes for C2D_J
58      bytes for imul (Nidud)
98      bytes for min2cvt
50      bytes for min1cvt (SSE3)
122     bytes for ConvertLUT
8       bytes for FSM (Hutch)
70      bytes for MSVC (Timo)

1234    = eax C2D_J
1234    = eax atodw
1234    = eax imul (Nidud)
1234    = eax min2cvt
1234    = eax min1cvt (SSE3)
1234    = eax ConvertLUT
1234    = eax FSM (Hutch)
1234    = eax MSVC (Timo)

-
Creative coders use backward thinking techniques as a strategy.

Greenhorn

AMD Ryzen 7 3700X 8-Core Processor              (SSE4)

362 cycles for 100 * C2D_J
496 cycles for 100 * imul (Nidud)
365 cycles for 100 * min2cvt
282 cycles for 100 * min1cvt (SSE3)
438 cycles for 100 * ConvertLUT
1093 cycles for 100 * FSM (Hutch)
304 cycles for 100 * MSVC (Timo)

362 cycles for 100 * C2D_J
494 cycles for 100 * imul (Nidud)
365 cycles for 100 * min2cvt
277 cycles for 100 * min1cvt (SSE3)
439 cycles for 100 * ConvertLUT
1092 cycles for 100 * FSM (Hutch)
292 cycles for 100 * MSVC (Timo)

374 cycles for 100 * C2D_J
495 cycles for 100 * imul (Nidud)
364 cycles for 100 * min2cvt
279 cycles for 100 * min1cvt (SSE3)
437 cycles for 100 * ConvertLUT
1100 cycles for 100 * FSM (Hutch)
292 cycles for 100 * MSVC (Timo)

54 bytes for C2D_J
58 bytes for imul (Nidud)
98 bytes for min2cvt
50 bytes for min1cvt (SSE3)
122 bytes for ConvertLUT
8 bytes for FSM (Hutch)
70 bytes for MSVC (Timo)

1234 = eax C2D_J
1234 = eax atodw
1234 = eax imul (Nidud)
1234 = eax min2cvt
1234 = eax min1cvt (SSE3)
1234 = eax ConvertLUT
1234 = eax FSM (Hutch)
1234 = eax MSVC (Timo)

--- ok ---
Kole Feut un Nordenwind gift en krusen Büdel un en lütten Pint.