News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

MOVDQA/MOVDQU vs. MOVNTDQA/MOVNTDQ: What's a good cache strategy?

Started by dawnraider, October 26, 2016, 11:02:41 AM

Previous topic - Next topic

dawnraider

HI all,

I am implementing image processing algorithms on Intel-based X86/X64 hardware using AVX/AVX2 and need to clarify my understanding
of temporal/non-temporal hints, and when they are useful.

My understanding is that a non-temporal hint allows a memory read/write to bypass cache to avoid cache pollution, thus not disturbing the
cache contents and allowing the hardware prefetchers (or any temporal hints laid down in code) to enable cache levels to work optimally.

This may work well for algorithms that either (1) read lots from memory and write little, or (2) read little from memory but write lots, but in the
field of image processing one typically does as many reads as they do writes, especially for raster-based/pixel-based algorithms.

So, I was wondering:

1) Where the amount of data read per "processing round" is the same as the amount of data written (or at most, a multiple of 2:1 either way)
     are non-temporal hints any use?

2) If, for every pixel I read I have to write that pixel back, which side should the non-temporal hint go? On the read from memory or the write
     to memory?

If anybody has any guidance I would be extremely grateful. Thanks!

hutch--

The rough guides I have found in practice is to use cached read and non cached writes thus keeping each data stream separate from the other. I have not really tested the AVX versions of the mnemonics and would suggest that you do a number of test pieces and time the results.

dawnraider

Quote
I have not really tested the AVX versions of the mnemonics...

Does this mean that you have had some experience of using hinting with the SSE versions of these instructions?
If so, did you notice any benefits in the hint direction and placement?

hutch--

I tried out the hinting mnemonics back in the Pentium era when they were introduced but I never saw anything go faster from using them so I rarely ever bother with them. Instruction choice is the action AND benchmarking to see what is the fastest. I recommended cached reads because they are generally faster but writing back through the cache makes everything slower so you use the bypass cache writes.

jj2007

Quote from: hutch-- on October 26, 2016, 02:10:27 PMI never saw anything go faster from using them

It's not easy to find a practical case for using them. Typically, you are reading or writing to memory because you intend to use the data somehow. That excludes cases where you copy 64k, process them, copy again, process etc.

I wonder if the OS uses these instructions when it loads 100MB from disk to memory...

How does your case look like?

Have a look at
http://www.masmforum.com/board/index.php?topic=15389.0
http://www.masmforum.com/board/index.php?topic=14685.msg119939#msg119939

And googling MOVNTDQA site:software.intel.com might also be a good idea.

dawnraider

As I mentioned previously, the use case is image processing where all pixels are first read, processed, and written back.
Sometimes, there are two pixels required on the input for every pixel output.

Quote
http://www.masmforum.com/board/index.php?topic=15389.0
http://www.masmforum.com/board/index.php?topic=14685.msg119939#msg119939

I had already seen these threads. However, as they only deal with writing data very quickly and not reading it, they didn't
really answer my questions.

The Intel IA32 Architectures Software Programmer's Guide recommends using non-temporal stores in combination with
temporal loads and software-specified prefetch hints (Feb 2014 Edition, Vol. 1 Chapter 10.4.6.2).

I have existing code that I can easily modify for non-temporal loads so I might just give that a go. However, for testing
non-temporal stores it will take a significant amount of redesigning as the code currently uses unaligned loads/aligned
stores for maximum code compactness/performance. As non-temporal stores (MOVNTDQ) appear to be unaligned only,
I would have to possibly rewrite the code to use aligned loads otherwise any performance gains from the temporal hints
may be hidden by poorer memory alignment at the input(s).

rrr314159

@dawnraider, with an I5 a couple years ago I tried non-temporal hints and found no performance improvement. It seemed that the CPU algorithm was good enough already, and it ignored my hints. Also aligned vs. non-aligned was not much difference. I read that it had been important in earlier generations of processors but not so much any more. I was doing image processing, sounds not unlike your project. So, good luck, but don't count on a huge improvement. If a lot of rewriting is required try to check it out first to see if it's worth the trouble. And please let us know what you find!
I am NaN ;)

johnsa


From my experience I've not found any benefit to using the cache hinting operations like prefetcht0, prefetchnta etc. I've run 1000's of tests on them with various offset amounts and never really found any significant performance different that pokes it's head out of the general measurement noise.

That said, i generally still stick to using vmovaps/movaps/vmovdqa/movdqa wherever possible instead of the unaligned versions.

I also found the using movhps combinations was considerably faster than movups ie:

; Write HDR pixels (movlps/hps is far better than movups).
vmovq [rdi],xmm4
vmovhps [rdi+8],xmm4
vmovq [rdi+rdx],xmm6
vmovhps [rdi+rdx+8],xmm6


With regards to the non temporal operations it can be a bit hit and miss, i've not found the load operations to be of any benefit, and depending on the algorithm and how you use the data they can
be considerably faster for writes. Basically if you access your pixel buffer / data in a random way with small strides, the non temporal versions will be worse.
If you need to read it back again at any point in the near future, they will be worse, if the strides are massive and causing cache misses every time, non temporal will be better.
If the data is massive (IE: you're writing sequentially one time only an image buffer of 4096x4096 or something.. then they will be better). I normally put both in and while working i switch back and forth to see
as it sometimes changes in favour of one or the other during development depending on what else is happening.