News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Test results for AVX and AVX-512 needed

Started by Gunther, December 21, 2017, 11:43:25 AM

Previous topic - Next topic

hutch--

I have seen this happen before, when the test was designed for manual code, it did not allow for compiler optimisation which will just short circuit the test method if its a simple repeat. Looks flashy but in simply not making the comparison that the author had in mind. The test needs to be designed so that the data that the compiler can simply short circuit can no longer be done, once the compiler cannot do this the test may be valid.

Gunther

#16
Steve,

Quote from: hutch--
Looks flashy but in simply not making the comparison that the author had in mind.
Flashy? It's very mild spoken. But exactly that's the point.

Quote from: aw27
The trickery was simple. VS simply noticed that
   for (i = 0; i < FINISH; i++) {
      sum1 = Sum1AccuC(X, N);
   }
never changes the value of sum1, so why bother to repeat the loop if they already got the result.  :icon_eek:
That was my guess. That wasn't wanted and it isn't fair, isn't it? It has something of the programmer's disqualification. Two things, by the way: What says your machine now with the new code? And: It's an interesting product line what you offer with your Website. Furthermore, take care with VS; it has weaknesses. One weakness of MS libc was discussed in this thread.

Thank you Marinus for testing. Your results are looking very reasonable.

Gunther
You have to know the facts before you can distort them.

aw27

You should not ask the compiler to sub-optimize the overall C code when you do your best to optimize the ASM.
I am sure you can produce a better test, that will also prevent the cache effects of a tight loop like this.
And, I am not interested in compiler wars. :t

Gunther

#18
Hi aw27,

Quote from: aw27
And, I am not interested in compiler wars. :t
No offense, please. You'll have good reasons to use VS. I had a copy of VS running here, but it's outdated. So I've to re-new it over my University. But there is no chance to do it between the years.

Quote from: aw27
You should not ask the compiler to sub-optimize the overall C code when you do your best to optimize the ASM.
Good point. On the other hand, the C code isn't the bottleneck.

Quote from: aw27
I am sure you can produce a better test, that will also prevent the cache effects of a tight loop like this.
That may or may not be. It's possible that I thought too briefly. But with 11 MB L3 cache shouldn't be much cache misses.

On the other hand, I have a hard time with many low blows behind me and am glad to be able to work reasonably well in the forum again. The last year was really not easy for me. Maybe I'll talk about some things soon inside the Soap Box. Maybe at the moment I lack the sophistication to write reasonably good software. I'll do my best to change the situation, but it was a school of hard knocks. I hope you'll understand that.

Nevertheless many thanks for the help of so many members in the forum, for example: Steve (aka Hutch--), Marinus, Jochen, Habran and, and, and ... Also for your help, aw27, I am very grateful.

Gunther
You have to know the facts before you can distort them.

HSE

Quote from: Gunther on December 22, 2017, 02:38:38 AM
Do you have an AMD CPU?

No. Intel Pentium B960@2.2GHZ

I installed ArkDasm, and show me:; --------------------------------------------------------------------------
; sub_00401cc2
; --------------------------------------------------------------------------
sub_00401cc2   proc
main:0000000000401cc2 mov eax, 0x80
main:0000000000401cc7 vmovaps ymm0, ymmword ptr [rcx]   <-- this resalted

rcx = 000000000022bd80

Features say: MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
POPCNT, SSE4.2

featurenumber = 13

Equations in Assembly: SmplMath

hutch--

 :biggrin:

Gunther,

It won't take long to get back up to pace, on the bright side, when you are on the comeback trail you re-emerge with new ideas and a fresh look at things you may have missed before.  :t

Gunther

Steve,

thank you very much for your warm and heartfelt words. I'll do my best to contribute some ideas and code snippets. In general, my contributions are not very valuable; there's enough room for improvements.

HSE,
Quote from: HSE
Features say:
Code: [Select]

MMX, CMOV and FCOMI, SSE, SSE2, SSE3, SSSE3, SSE4.1,
POPCNT, SSE4.2

featurenumber = 13
With featurenumber 13 you should have AVX2 available.

Could you run Isets.exe from the first post of this thread, please? It's in IsetsAVX-512.zip and should make no harm on your machine. The program output should be similar to this, without the AVX-512 section:

        Supported Features by Processor and Operating System
        ====================================================

Vendor String: GenuineIntel
Brand  String: Intel(R)Core(TM)i7-7820XCPU@3.60GHz

        Instruction Sets
        ----------------

MMX  SSE  SSE2  SSE3  SSSE3  SSE4.1  SSE4.2  AVX  AVX2
AVX-512 F  - Fundamental Instructions
AVX-512 DQ - Double and Quad Word Instructions
AVX-512 CD - Conflict Detection Instructions
AVX-512 BW - Byte and Word Support Instructions
AVX-512 ER - Exponential and Reciprocal Instructions


Please, press enter to end the application ...

You'll find the explanation in this post. That should help.

Gunther
You have to know the facts before you can distort them.

aw27

Quote
With featurenumber 13 you should have AVX2 available
I don't think the featurenumber is correctly calculated. I made my tests on a SandyBridge which has no support for AVX2 and it was considering it with featurenumber 12. It did not crash because the tests did not require AVX2.  Intel Pentium B960 is a SandyBridge without AVX support.

Quote
On the other hand, the C code isn't the bottleneck
The simple fact is that the compiler is particularly good at optimizing traditional C code and particularly bad with vector instructions.
This has been sported in this article https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language .
Although a few people said the ASM was not well optimized, more than 6 month later the author is still seated and waiting for  someone to produce a version able to beat the compiler.



hutch--

 :biggrin:

> more than 6 month later the author is still seated and waiting for  someone to produce a version able to beat the compiler

I wonder with such open ended comparisons if anyone could be bothered responding.

aw27

Quote from: hutch-- on December 22, 2017, 07:21:19 PM
:biggrin:

> more than 6 month later the author is still seated and waiting for  someone to produce a version able to beat the compiler

I wonder with such open ended comparisons if anyone could be bothered responding.
With open ended comparisons each response should be seen as a unique opinion, not statistically significant.  :biggrin:

HSE

        Supported Features by Processor and Operating System
        ====================================================

Vendor String: GenuineIntel
Brand  String: Intel(R)Pentium(R)CPUB960@2.20GHz

        Instruction Sets
        ----------------

MMX  SSE  SSE2  SSE3  SSSE3  SSE4.1  SSE4.2


Please, press enter to end the application ...


Evidently there is not AVX, but still I was expecting some automatic detection in the program. Thanks.
Equations in Assembly: SmplMath

Gunther

Hi HSE,

Quote from: HSE
Evidently there is not AVX, but still I was expecting some automatic detection in the program. Thanks.
I think that aw27 is right. He wrote:
Quote from: aw27
I don't think the featurenumber is correctly calculated. I made my tests on a SandyBridge which has no support for AVX2 and it was considering it with featurenumber 12. It did not crash because the tests did not require AVX2.  Intel Pentium B960 is a SandyBridge without AVX support.
That's the point. On the other hand, the main procedure should make sure that the AVX code path isn't called. It contains the following code snippet:

// Check AVX support

    if (featurenumber >= 12){                 // can we use AVX?
        printf("\nAssembly Language with 4 YMM accumulators:\n");
        printf("------------------------------------------\n");
        start = clock();                      // yes
        for(i = 0; i < FINISH; i++){
            sum4 = Sum4AccuYMM(X, N);
        }
        stop = clock();
        t4 = (float)(stop-start)/(float)CLOCKS_PER_SEC;
        boost = t1*100/t4;
        printf("sum4              = %.2f\n", sum4);
        printf("Elapsed Time      = %.2f Seconds\n", t4);
        printf("Performance Boost = %.0f%%\n",boost);
    }
    else{                                     // no: print a message
        printf("\nYour current CPU doesn't support the AVX instruction set.\n");
        printf("You'll need at least the Sandy Bridge or Ivy Bridge architecture.\n\n");
        printf("The application terminates now.\n");
    }

I'm not sure if I have posted the latest version in the forum. So, I re-compiled the sources and under the first post of this thread you'll find the latest versions of both applications: float.exe and floatassembly.exe (with sources, of course). I hope that both won't crash on your machine. Could you test that, please? Please excuse the inconvenience.

Gunther
You have to know the facts before you can distort them.

Gunther

Hi aw27,

Quote from: aw27
The simple fact is that the compiler is particularly good at optimizing traditional C code and particularly bad with vector instructions.
This has been sported in this article https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language .
Although a few people said the ASM was not well optimized, more than 6 month later the author is still seated and waiting for  someone to produce a version able to beat the compiler.

I could only read the article overflowing. But it seems to be very instructive. It could be a great challenge for the members of our forum. I've seen more than once how the compiler's excellent code has been improved and beaten. I'm just a simple lover of assembler code. There are many forum members who can do that better and contribute higher quality posts and applications. But my experience is: At the beginning, let the compiler do the job with the highest level of optimization. That's the right starting point. Then check the resulting assembler code very carefully and start with the improvements. Only then does one have a chance to win. Of course, someone can say: That's not fair. Well, life is not fair.

As I said, that's just my humble point of view.

Gunther
You have to know the facts before you can distort them.

aw27

Hi Gunther,

Learning assembly language is important, even on the day compilers are able to do better than humans in every case. That day will arrive. I still remember the days most people believed it was impossible a machine to win on chess against a Grand Master because have no global vision, could not recognize patterns, had no sense of position, not able to think in strategic terms - could only use brute force. They were wrong, most chess programs nowadays beat easily every chess Grand master.

felipe

Haha, and here we go again, right?  ;)

:lol:

I would say, if that's a total true, so machines and computers will do everything better some day, but i think that's not correct. It's just a simple generalization. Humans will be always smartest than machines, even if we don't realize of that.  :biggrin:

Btw i always question the real importance of the chess play. Maybe it's a stupid game. Humans had give machines the role of doing stupid and brutal things in an important part. So, they can win a chess play, but a cat can piss on a computer.  :lol: