News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast Log10 approximation

Started by guga, August 16, 2020, 06:01:01 AM

Previous topic - Next topic

guga

Quote from: Siekmanski on August 20, 2020, 08:13:58 AM
If you need a fast SSE2 horizontal addition for 4 packed floats,

    movaps  xmm1,xmm0
    shufps  xmm1,xmm0,10110001b
    addps   xmm0,xmm1
    movhlps xmm1,xmm0
    addss   xmm0,xmm1



Great. Many thanks marinus.

This could be easy and very usefull to create a macro to simulate the haddps opcodes  :thumbsup: :thumbsup: :thumbsup: :thumbsup: :thumbsup:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

daydreamer

Quote from: guga on August 20, 2020, 08:28:46 AM
Quote from: Siekmanski on August 20, 2020, 08:13:58 AM
If you need a fast SSE2 horizontal addition for 4 packed floats,

    movaps  xmm1,xmm0
    shufps  xmm1,xmm0,10110001b
    addps   xmm0,xmm1
    movhlps xmm1,xmm0
    addss   xmm0,xmm1



Great. Many thanks marinus.

This could be easy and very usefull to create a macro to simulate the haddps opcodes  :thumbsup: :thumbsup: :thumbsup: :thumbsup: :thumbsup:
:thumbsup: :thumbsup:
replace mnemonic HADDPS with macro with the same name in masm just use NOKEYWORD in the very beginning of source file with those mnemonics you want to reprogram to macros
and now I probaby get to be flamed by bare metal coders for this heresy :bgrin:
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

guga

Hi Daydreamer :bgrin: :bgrin: :bgrin:

I´ll probably do this for RosAsm. A set of macros in SSE2 to simulate the behaviour of SSE3 and SSE4 are needed for Masm and RosAsm, specially when we want to create code that adapt to the user´s processor (or, in my case where i didn´t built SSE3/SSe4 opcodes yet).

These could be very handy :azn: :azn: :azn:
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Hi Guga & friends,

Can I have some timings, please? This is work in progress, building requires MasmBasic version 2 September 2020. The core is here:

FastMath FastLog10                        ; define a math function
  For_ fct=0.0 To 10.0 Step 0.5
        fld fct                           ; X
        fstp REAL10 ptr [edi]
        void Log10(fct)                   ; Y (built-in MasmBasic function)
        fstp REAL10 ptr [edi+REAL10]
        add edi, 2*REAL10
  Next
FastMath                                  ; -------- done -------------


Usage in the attached source (*.asc opens in RichMasm, WordPad, MS Word):
TestC proc
  mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
void FastLog10(MyExpo)   ; <<<<<<< put result in ST(0)
fstp res8
dec ebx
  .Until Sign?
  ret
TestC endp


Timings:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
2244 µs for initialising FastLog10

3170    cycles for 100 * Log10 (Guga)
4783    cycles for 100 * MasmBasic Log10
1553    cycles for 100 * FastMath Log10

3148    cycles for 100 * Log10 (Guga)
4795    cycles for 100 * MasmBasic Log10
1554    cycles for 100 * FastMath Log10

3173    cycles for 100 * Log10 (Guga)
4788    cycles for 100 * MasmBasic Log10
1552    cycles for 100 * FastMath Log10

3319    cycles for 100 * Log10 (Guga)
4781    cycles for 100 * MasmBasic Log10
1552    cycles for 100 * FastMath Log10

3175    cycles for 100 * Log10 (Guga)
4786    cycles for 100 * MasmBasic Log10
1540    cycles for 100 * FastMath Log10

536     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
162     bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
28 µs for initialising FastLog10

3470    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1912    cycles for 100 * FastMath Log10

3473    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1907    cycles for 100 * FastMath Log10

3484    cycles for 100 * Log10 (Guga)
5833    cycles for 100 * MasmBasic Log10
1916    cycles for 100 * FastMath Log10

3481    cycles for 100 * Log10 (Guga)
5834    cycles for 100 * MasmBasic Log10
1913    cycles for 100 * FastMath Log10

3477    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1913    cycles for 100 * FastMath Log10

536     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
162     bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575

--- ok ---
Creative coders use backward thinking techniques as a strategy.

jj2007

New version, I had forgotten to switch off the range checks:

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
88 µs for initialising FastLog10

2882    cycles for 100 * Log10 (Guga)
4780    cycles for 100 * MasmBasic Log10
1399    cycles for 100 * FastMath Log10

2918    cycles for 100 * Log10 (Guga)
5372    cycles for 100 * MasmBasic Log10
1502    cycles for 100 * FastMath Log10

2912    cycles for 100 * Log10 (Guga)
4787    cycles for 100 * MasmBasic Log10
1397    cycles for 100 * FastMath Log10

2882    cycles for 100 * Log10 (Guga)
4774    cycles for 100 * MasmBasic Log10
1387    cycles for 100 * FastMath Log10

2881    cycles for 100 * Log10 (Guga)
4774    cycles for 100 * MasmBasic Log10
1392    cycles for 100 * FastMath Log10

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575


Guga's version is g from this post.

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
78 µs for initialising FastLog10

3142    cycles for 100 * Log10 (Guga)
5851    cycles for 100 * MasmBasic Log10
1698    cycles for 100 * FastMath Log10

3144    cycles for 100 * Log10 (Guga)
5839    cycles for 100 * MasmBasic Log10
1694    cycles for 100 * FastMath Log10

3154    cycles for 100 * Log10 (Guga)
5834    cycles for 100 * MasmBasic Log10
1693    cycles for 100 * FastMath Log10

3142    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1695    cycles for 100 * FastMath Log10

3150    cycles for 100 * Log10 (Guga)
5836    cycles for 100 * MasmBasic Log10
1702    cycles for 100 * FastMath Log10

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575

--- ok ---
Creative coders use backward thinking techniques as a strategy.

jj2007

Here is another one, with a FastSqrt added:

FastMath FastSqrt      ; define a math function
  For_ fct=0.0 To 10.0 Step 0.5
        fld fct                 ; X
        fld st
        fstp REAL10 ptr [edi]
        fsqrt                   ; Y
        fstp REAL10 ptr [edi+REAL10]
        add edi, 2*REAL10
  Next
FastMath                       ; -------- done -------------


The speed gain is very modest, though:
1859    cycles for 100 * fsqrt
1398    cycles for 100 * FastSqrt


The tangens is more impressive:
10920   cycles for 100 * fptan
1390    cycles for 100 * FastTan

TouEnMasm

Hello,
I was trying to run your exe,and get this answer:
Quote
  This file content an indesirable Virus and will be deleted soon
and the file is deleted.
With what did you compile your files ?
Fa is a musical note to play with CL

jj2007

With MASM (ML.exe), but it 'compiles' also with UAsm and AsmC. Quality AV products at VirusTotal say it's clean.

I suggest you move your crappy AV software into the recycle bin. Is it a French product? "This file content an indesirable Virus" should be contains :cool:

TouEnMasm


Quote
I suggest you move your crappy AV software into the recycle bin.
You surely speak of Windows 10 Familial edition ,on a perfectly standard UC (4Go Mem,1 tera disk)
Fa is a musical note to play with CL

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
44 µs for initialising FastLog10

3175    cycles for 100 * Log10 (Guga)
5845    cycles for 100 * MasmBasic Log10
1693    cycles for 100 * FastMath Log10
1545    cycles for 100 * fsqrt
1698    cycles for 100 * FastSqrt

3149    cycles for 100 * Log10 (Guga)
5838    cycles for 100 * MasmBasic Log10
1692    cycles for 100 * FastMath Log10
1545    cycles for 100 * fsqrt
1690    cycles for 100 * FastSqrt

3147    cycles for 100 * Log10 (Guga)
5836    cycles for 100 * MasmBasic Log10
1690    cycles for 100 * FastMath Log10
1549    cycles for 100 * fsqrt
1692    cycles for 100 * FastSqrt

3142    cycles for 100 * Log10 (Guga)
5840    cycles for 100 * MasmBasic Log10
1696    cycles for 100 * FastMath Log10
1542    cycles for 100 * fsqrt
1695    cycles for 100 * FastSqrt

3140    cycles for 100 * Log10 (Guga)
5841    cycles for 100 * MasmBasic Log10
1692    cycles for 100 * FastMath Log10
1544    cycles for 100 * fsqrt
1691    cycles for 100 * FastSqrt

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10
14      bytes for fsqrt
67      bytes for FastSqrt

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575
Real8   2.236067977499789805
Real8   2.236067977499789805

--- ok ---
Creative coders use backward thinking techniques as a strategy.

six_L

QuoteIntel(R) Core(TM) i5-9400H CPU @ 2.50GHz (SSE4)
31 µs for initialising FastLog10

1622    cycles for 100 * Log10 (Guga)
3402    cycles for 100 * MasmBasic Log10
670     cycles for 100 * FastMath Log10
351     cycles for 100 * fsqrt
669     cycles for 100 * FastSqrt

1637    cycles for 100 * Log10 (Guga)
3474    cycles for 100 * MasmBasic Log10
669     cycles for 100 * FastMath Log10
328     cycles for 100 * fsqrt
667     cycles for 100 * FastSqrt

1583    cycles for 100 * Log10 (Guga)
3785    cycles for 100 * MasmBasic Log10
653     cycles for 100 * FastMath Log10
339     cycles for 100 * fsqrt
663     cycles for 100 * FastSqrt

1593    cycles for 100 * Log10 (Guga)
3320    cycles for 100 * MasmBasic Log10
653     cycles for 100 * FastMath Log10
341     cycles for 100 * fsqrt
663     cycles for 100 * FastSqrt

1590    cycles for 100 * Log10 (Guga)
3434    cycles for 100 * MasmBasic Log10
667     cycles for 100 * FastMath Log10
340     cycles for 100 * fsqrt
664     cycles for 100 * FastSqrt

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10
14      bytes for fsqrt
67      bytes for FastSqrt

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575
Real8   2.236067977499789805
Real8   2.236067977499789805

--- ok ---
Say you, Say me, Say the codes together for ever.

jj2007

Siekmanski & six_L, your built-in fsqrt is faster than the police allowed :cool:

FastMath shines for more complex functions, see this post where it replaces the GSL Bessel function. Speed gain is roughly a factor 55 on my trusty old Core i5 :tongue:

TimoVJL

AMD Ryzen 5 3400G with Radeon Vega Graphics     (SSE4)
36 µs for initialising FastLog10

2396    cycles for 100 * Log10 (Guga)
6316    cycles for 100 * MasmBasic Log10
1584    cycles for 100 * FastMath Log10
690     cycles for 100 * fsqrt
1518    cycles for 100 * FastSqrt

2418    cycles for 100 * Log10 (Guga)
6323    cycles for 100 * MasmBasic Log10
1542    cycles for 100 * FastMath Log10
699     cycles for 100 * fsqrt
1585    cycles for 100 * FastSqrt

2431    cycles for 100 * Log10 (Guga)
6362    cycles for 100 * MasmBasic Log10
1523    cycles for 100 * FastMath Log10
660     cycles for 100 * fsqrt
1598    cycles for 100 * FastSqrt

2552    cycles for 100 * Log10 (Guga)
6347    cycles for 100 * MasmBasic Log10
1523    cycles for 100 * FastMath Log10
660     cycles for 100 * fsqrt
1547    cycles for 100 * FastSqrt

2466    cycles for 100 * Log10 (Guga)
6344    cycles for 100 * MasmBasic Log10
1520    cycles for 100 * FastMath Log10
660     cycles for 100 * fsqrt
1518    cycles for 100 * FastSqrt

516     bytes for Log10 (Guga)
16      bytes for MasmBasic Log10
67      bytes for FastMath Log10
14      bytes for fsqrt
67      bytes for FastSqrt

Real8   0.6989700043360187465
Real8   0.6989700043360188575
Real8   0.6989700043360188575
Real8   2.236067977499789805
Real8   2.236067977499789805

-
May the source be with you