News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Test results for AVX and AVX-512 needed

Started by Gunther, December 21, 2017, 11:43:25 AM

Previous topic - Next topic

aw27

Just to see if the ZMM part works (with my modified ASM used in floatasm), I used the Intel Emulator selecting Icelake CPU.


Calculating the sum of a float array with Intel Emulator (150000 iterations only)...
Emulating Icelake CPU


Simple C implementation:
------------------------
sum1              = 8390656.00
Elapsed Time      = 0.71 Seconds

C implementation with 4 accumulators:
-------------------------------------
sum2              = 8390656.00
Elapsed Time      = 0.28 Seconds
Performance Boost = 252%

Assembly language with 4 XMM accumulators:
------------------------------------------
sum3              = 8390656.00
Elapsed Time      = 0.61 Seconds
Performance Boost = 117%

Assembly Language with 4 YMM accumulators:
------------------------------------------
sum4              = 8390656.00
Elapsed Time      = 0.49 Seconds
Performance Boost = 143%

Assembly Language with 4 ZMM accumulators:
------------------------------------------
sum5              = 8390656.00
Elapsed Time      = 2.85 Seconds
Performance Boost = 25%






Gunther

Thank you for the link, felipe.  :t

aw27,

Did you have any doubt that the ZMM code is not working? Another point: the time for 4 scalar multiplications per loop cycle is 0.28 seconds, while 16 vectorized additions per loop cycle need 0.61 seconds? That's strange.

Gunther
You have to know the facts before you can distort them.

aw27

Not really doubts, I found this more interesting from a programmer's point of view.

Quote
Another point: the time for 4 scalar multiplications per loop cycle is 0.28 seconds, while 16 vectorized additions per loop cycle need 0.61 seconds?
I have not checked how the emulator works, I believe it was not produced with competition in mind, but simply to emulate instructions. This is usually done by single stepping through the code and replace instructions not supported with some routine.

Gunther

Hi aw27,

Quote from: aw27
I have not checked how the emulator works, I believe it was not produced with competition in mind, but simply to emulate instructions. This is usually done by single stepping through the code and replace instructions not supported with some routine.

Well, be that as it may, the strangeness remains.

Gunther
You have to know the facts before you can distort them.

aw27

I am not finding it very strange because for sure well over 90% of the time is not spent doing the exercise's sum computation.

Gunther

Hi aw27,

Quote from: aw27 on December 24, 2017, 03:37:33 AM
I am not finding it very strange because for sure well over 90% of the time is not spent doing the exercise's sum computation.
That's a very pessimistic estimate. I really do not want fruitless discussions here. Furthermore, I won't be penny-wise and pound-foolish. But is it right that the small loop overhead (a very simple for loop), passing 2 parameters plus an appropriate call and return needs more time than the floating point calculation of the array sum of 4096 elements?

Incidentally, this is not an unusual scenario. I'll give a practical example: For several years, our software for fractal image compression has been running at CERN. The encoder consists of about 150000 lines of code, as well as the decoder. The decoding process is no problem and runs in real time. The encoding is very expensive - hard number crunching. We've profiled the entire encoding process and we found out the following: There are 6 procedures, less than 0.5% of the code. They formed the bottleneck. With a small image of 256x256 pixels, they are called over 33 million times. With the doubling of the image size, the effort increases by a factor of four. Those were the time wasters.

What happened inside the procedures? Very simple things: Calculation of the arithmetic mean, the root mean, the scalar product always of 2 vectors. No witchcraft. So we realized only this procedures in assembly language with different code paths: classic usage of the good old FPU, SSE2 code, AVX code. The reason is simple: The PC farm at CERN consists of just over 20,000 computers and is very heterogeneous. That has historical reasons. Since these are true color images, of course, the calculation of the individual color planes was parallelized. That was not easy, because there is loose coupling inside the cluster and tight coupling between the cores. But it works well now. To sum up: We hope that we can scratch the real time with the AVX-512 code part. That would be a large step forward.

That's what I'm doing at the moment. This is my small, modest contribution in the hunt for the Higgs boson and other particles. That's why I designed and wrote the test bed in this form and not differently. Although it looks a bit stupid at first glance, it certainly has a real background. For the longtime members of the forum that was certainly a boring repetition, because they know what I'm doing for years. I wrote it down only for the newer members and ask for leniency.

All in all, all the testers have helped me a lot, and for that I would like to thank you once again.

Gunther
You have to know the facts before you can distort them.

aw27

#51
Hi Gunther,
May be are talking about different things. This runs under an emulator, the emulator takes a lot of time to do emulation because it runs in single step mode, i.e, it sets the trap flag which invokes INT 01 after each instruction which causes a call to an ISR  in the kernel. The ISR will check the next instruction to be executed and replace it with a call to a user mode routine, if necessary. This procedure is repeated until the program ends.This is what I believe happens, I don't know about alternatives, but they may exist. I can not check with the emulator source code because is not available.

hutch--

Gunther,

I have not really kept up with this conversation but your last post had some interesting stuff in it in relation to the isolation of the main bottleneck in doing very large counts of number crunching calculations. I gather the 20000 computers are x86/64 based rather than Itaniums and that there must be some dedicated hardware to interface them so that the calculation load can be distributed in a useful manner. What I wonder is if the hotspot where the time is being taken can be tuned so that a section of the processing power deals directly with this bottleneck.

Its the usual stuff here, bash the guts of the calculation code in its single thread form to extract the maximum speed them find a technique to parallel process the workload then multithread the work based on the thread count of each processor to get more hardware working on the main bottleneck.

Gunther

Steve,

Quote from: hutch-- on December 24, 2017, 07:13:50 AM
I gather the 20000 computers are x86/64 based rather than Itaniums
That's the situation. We've machines with 2, 4, 6, 8, 12 and 16 cores. The machines with 16 cores do not exist that often, there is a shortage of money. Anyway, the boxes can be interconnected to a large Cluster. We're using Open MPI to handle all the complicated stuff. It's robust and rock solid.

The situation is, roughly speaking, as follows. Each image is broken down into domains and ranges. We do that recursively; this results in a domain pool and a range pool. Each domain is twice as large as a corresponding range at the appropriate recursion level. This has to do with the Contract Mapping Theorem by Stefan Banach, which forms the basis of the whole method. The ranges must not overlap (a disjoint set structure). The domains can overlap and they do. We simply move a domain window column by column and row by row over the image. Before we do that, the domain is shrinked down to the appropriate range size. And now comes the bottleneck: We want find the best fit with a given domain in comparison with each range. If we can't find a fit, a new recursion level is required. If we find a fit, this part of the image is marked and we've found the fractal code for this image part.

To find a fit with our eyes is easy. But the computer can't see. We have to attribute this to a calculation process. For this we use techniques from the error and compensation calculation and the regression. We calculate different mean values, the slope of the regression line (that is the contrast) and the absolute term of the line (that is the brightness). That must be done for every comparison and for 3 color planes. We don't use RGB but YUV. That has advantages for the parallelization. Y is simply a grey scale image, while U and V are rough color difference signals. So we need only 2 Cores per image; in the first core runs the Y encoding and in the other core the U and V encoding in several threads. That's all tested and proven. Several of my students have successfully written their master theses with me. For me it is now to summarize the different solutions and install the AVX-512 code, because so far we had no hardware.

I hope that I have not bored anyone with these many technical details.

Gunther 
You have to know the facts before you can distort them.

Siekmanski

Hi Gunther,

Sounds like a cool job for the GPU using the pixel shader.

A CPU consists of a few cores optimized for sequential serial processing while a GPU has a massively parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously.
Creative coders use backward thinking techniques as a strategy.

Gunther

Hi Marinus,

Quote from: Siekmanski on December 24, 2017, 09:50:18 PM
Hi Gunther,

Sounds like a cool job for the GPU using the pixel shader.
Yes and no. We had a long and hard discussion about CUDA some months ago. The tricky point is: the CERN PC farm is very heterogeneous. We've Intel, AMD, Cyrix, Transmeta and Apple boxes running Linux or BSD. Often with special hardware and some very exotic graphics cards. Furthermore: With CUDA you are tied to Nvidia, all the ATI cards (is now AMD), VIA and S3 won't work for that. Not to forget: The GPU data types are a bit exotic: here is a byte sometimes 9 bit or you can have 12 bit fixed-point arithmetic. That doesn't really help for our tasks. But let's wait and see, nothing is set in stone. Maybe OpenCL brings an improvement here.

Gunther
You have to know the facts before you can distort them.

FORTRANS

Hi Gunther,

Quote from: Gunther on December 24, 2017, 10:42:30 AM
I hope that I have not bored anyone with these many technical details.

   Actually I find it interesting.  Given the simple description of the
process you gave, I may be off-base here.  Would a Fourier transform
from spatial to frequency space aid in feature classification?  A fairly
complex operation, so it would have to help a lot, I guess.

Cheers,

Steve N.

Gunther

Steve,

Quote from: FORTRANS on December 25, 2017, 12:20:42 AM
Would a Fourier transform
from spatial to frequency space aid in feature classification?  A fairly
complex operation, so it would have to help a lot, I guess.
You're speaking about JPEG or MPEG, didn't you? Yes, they use the discrete cosine transformation. That works fine and one can find a lot of code for that.

But: there is a big disadvantage. These methods are not independent of the resolution of the image. Let me explain this a bit. If you enlarge a given image by a factor of two, the entire image must be encoded in JPEG again. In contrast, the fractal coding is independent of the resolution. For example, you can encode a 512x512 image fractal and decode it on 8192x8192 without any loss of quality. This is because we do not look at the pixels of the image, but we store only the generating functions in the encoding. That's the trick: each picture is represented by a set of generating functions. This was proved in 1982 by the Australian mathematician Hutchinson (noun is Omen). All the effort is only spent to find these generating functions. When decoding we use a technique similar to that in Postscript, because Postscript (or the binary PostScript aka PDF) is resolution independent. We have a picture space and a device space. The complete decoding is done in the picture space. Only at the very end does the transfer to the device space (printer, plotter, screen - whatever you want) take place. For this we use special transformation matrices.

But as I said, decoding is not the problem at all. Here we already have real time and even the high level language (C ++) suffices. We'll probably have to try assembler for 4K movies later. But this is not a fundamental problem.

I still want to make an important comment. I can not go into the details at this point. Who deals with such things lives dangerously, that is not exaggerated. I also do not suffer from paranoia. I had to make very bad experiences myself, which will not let me go until my end. One of my former students has come to a tragic end. If you are interested, you can read this, at least in part, here. You will find another dark side of the Wikipedia here, because that's not even a quarter of the truth. I do not speak like a blind man of the color, because I have had to experience all this in large part.

Normally all master theses are available in our university library. But as a consequence of this case, our Dean has decided that the work of my graduates will be kept in a safe place. Very few people, whom I can absolutely trust, know the entire content of the various master's theses. This practice is very unusual, but I have a responsibility to my graduates, and nobody takes that away from me. These are all young, diligent graduates, some of whom have families and small children; all this is worth protecting.

Gunther   
You have to know the facts before you can distort them.

GoneFishing

Hi Gunther,
That sounds like a detective  or maybe even a thriller. It's a pity that we can't hear all your story.

FORTRANS

Hi Gunther,

Quote from: Gunther on December 25, 2017, 01:41:34 AM
Steve,

Quote from: FORTRANS on December 25, 2017, 12:20:42 AM
Would a Fourier transform
from spatial to frequency space aid in feature classification?  A fairly
complex operation, so it would have to help a lot, I guess.
You're speaking about JPEG or MPEG, didn't you? Yes, they use the discrete cosine transformation. That works fine and one can find a lot of code for that.

But: there is a big disadvantage. These methods are not independent of the resolution of the image. Let me explain this a bit. If you enlarge a given image by a factor of two, the entire image must be encoded in JPEG again. In contrast, the fractal coding is independent of the resolution. 

   No, I was not thinking of JPEG compression.  Back in the old days
I saw people making seekers that would "look" for a target.  One
group was trying to use a Fourier transform to look for recognizable
patterns in the frequency domain.  Something like "look for a maximum
peak and compare it to the peaks near it, or peaks at integer multiples
of that frequency".  They were looking for a straight line or rectangle
shaped areas (I think).  They seemed to be having fun, but I never
found out how well they were doing.  (Actually, since they, more or
less, went quietly away from my perspective, they probably did not
have much success.)

   Some things showed up in the frequency domain as recognizable
that were too difficult to "classify" in the spatial domain.  And some
size or orientation estimates could be made quickly as well.

Regards,

Steve N.