News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Memory and cache questions

Started by Leo, February 19, 2014, 02:38:02 AM

Previous topic - Next topic

Leo

I have an algorithm that seems to be constrained by some combination of memory bandwidth and cache size. I try to think of ways to improve its performance. The algorithm receives data in an input array and writes data to output array. Both arrays can be very large. The input array is never written to, while the output array is never read from(inside the main algorithm). The access pattern is not continuous, but predictable. In most iterations the algorithm reads and writes data that is close in memory to data that was read/written N iterations earlier. N is constant for the whole algorithm run, but changes between runs. Data is read 4 bytes at a time and written one byte at a time. The input array is twice as large as the output array.
One thing that I found that helps is calling prefetcht0 on the input array a few iterations before it is read. I also tried using non temporal writes with maskmovq. This gave large improvements for some usage patterns on some CPUs and even larger penalties for other cases. In the end I just used ordinary memory writes. The question is what else can I do to improve performance? I read that there is a way to cause all memory writes to some region to behave like maskmovq, but I never managed to get it working. How can it be done? Any other ideas?
Thanks

jj2007

Hi Leo,

Without a proper testbed, it is close to impossible to give any useful advice. Try rep movsd, it might be the fastest option (and now you are wondering whether I am serious, of course ;))

The Lab is full of such questions, and in most cases we could identify a "winner" in the end. I suggest you try to reduce the problem to its essential elements, and then we set up a testbed.

Leo

The first question is whether I can use the fact that the output array is not read from inside the algorithm? I understand that using non-temporal writes can help. The problem is that I need to write the data one byte at a time. I tried using movdqa, but the results varied wildly across different CPUs and usage patterns. in the end I settled for ordinary memory write with pextrb. I read that there is a way to mark the whole memory region in a way that all memory writes behave like movdqa, but I never got it working. Any ideas on what can be done here?

A rough approximation of the algorithm:
loop 1, for i=0 to MaxI
{
  calculate c (changes slightly from previous iteration)
  loop 2, for j=0 to MaxJ
    [
      Calculate data index k~ j*c+i
      Read data from InputArray[k]
      prefetcht0 InputArray[k a few iterations in the future] //This helps
      Do some processing
      Write data to OutputArray[k]
    ]
}

The formula for k is more complicated, but it is generally similar to what I wrote. Assume that the flow of the algorithm can't be changed. Although it is the obvious solution in the simplified algorithm I wrote here, the data access pattern can't be made sequential in the actual algorithm. What I try to understand is if the memory performance can be improved. One issue this algorithm has, if MaxJ is large enough, is that it reads and writes only a few bytes a time, but the whole cache line is loaded into cache. By the next time it comes to read from about the same place for i=n+1, the relevant cache line has been replaced and is loaded again and again. I don't know if anything can be done about it.

hutch--

Leo,

The problem you are getting is called "page thrashing" and it involves reads and writes to memory that are too far apart to be loaded from the same memory page. At its worst case if you are doing 1 read then 1 write to memory locations far enough apart the you get a memory page load for every read and write. Now if it is possible with the task you are performing you should try to load more data from each read and write more data on each write and you will reduce the number of memory page loads for any given amount of data that you process.

jj2007

Quote from: Leo on February 19, 2014, 11:29:57 PMI tried using movdqa, but the results varied wildly across different CPUs and usage patterns. in the end I settled for ordinary memory write with pextrb.

movlps is often faster than movdqa. I have no experience with pextrb but it doesn't look like a particularly fast instruction, and I wonder if the conversion from xmm to reg32 cannot be avoided altogether?

Again, it would be helpful if you could set up a testbed - just the two loops, with real instructions. If you don't want to reveal your professional secrets, use some dummy calculations for "calculate c".

How did you time your tests so far? rdtsc, QPC?

MichaelW

In your algorithm does k change between the read and the write?
Well Microsoft, here's another nice mess you've gotten us into.

Leo

Quote from: jj2007 on February 20, 2014, 01:50:52 AM
I have no experience with pextrb but it doesn't look like a particularly fast instruction, and I wonder if the conversion from xmm to reg32 cannot be avoided altogether?

pextrb writes directly from xmm to memory. No general registers are involved.

Quote from: MichaelW on February 20, 2014, 02:48:19 AM
In your algorithm does k change between the read and the write?

The input array and the output array are distinct, so even if it is the same k(and it is much of the time), these are completely different places in memory.

MichaelW

Quote from: Leo on February 20, 2014, 03:29:29 AM
The input array and the output array are distinct, so even if it is the same k(and it is much of the time), these are completely different places in memory.

Is this layout necessary, or can you make the element and array sizes the same, and then interleave the arrays?
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Quote from: Leo on February 20, 2014, 03:29:29 AMpextrb writes directly from xmm to memory. No general registers are involved.

I had looked at its older brother PEXTRW r32, xmm, imm8 (this is Pentium III...), but PEXTRB reg/m8,xmm2, imm8 writes indeed directly from xmm to memory :t

Nonetheless, I'd be curious to see real code here. You need a very good excuse to write to memory bytewise ::)

hutch--

Instruction choice will not solve the problem of page thrashing and neither will direct writes bypassing the cache. You must address the memory page load times and try and reduce the swapping from read to write and back. If it is possible, redesign the algorithm so that you read more data each read and write more data for each write. Until you solve this problem instruction choice will only make trivial differences.

MichaelW

This is an attempt to compare the data transfer time between two separate arrays, and one combined array where the source and destination elements are interleaved.

;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
ARRAY_SIZE equ 10000000
;==============================================================================
.data
.code
;==============================================================================
start:
;==============================================================================

    mov esi, alloc(ARRAY_SIZE*4)
    mov edi, alloc(ARRAY_SIZE*4)
    printf("%d\t%d\n\n",esi,edi)

    invoke Sleep, 5000

    timer_begin 1, REALTIME_PRIORITY_CLASS
        xor ebx, ebx
        .WHILE ebx < ARRAY_SIZE
            mov eax, [esi+ebx*4]
            mov [edi+ebx*4], eax
            add ebx, 1
        .ENDW
    timer_end
    printf("%dms\n\n", eax)

    free esi
    free edi

    mov esi, alloc(ARRAY_SIZE*8)

    invoke Sleep, 5000

    timer_begin 1, REALTIME_PRIORITY_CLASS
        xor ebx, ebx
        .WHILE ebx < ARRAY_SIZE
            mov eax, [esi+ebx*8]
            mov [esi+ebx*8+4], eax
            add ebx, 1
        .ENDW
    timer_end
    printf("%dms\n\n", eax)

    free esi

    inkey
    exit
;==============================================================================
end start


I tested on a 3.06GHz P4 Northwood with HT, a 533MHz FSB, 512KB L2 cache (AFAIK), and 512MB of memory. This shows the ARRAY_SIZE value and the approximate speed advantage of the interleaved array:

10000000 ~6%
20000000 ~10%
30000000 ~0

I'm not sure whether the drop off in speed advantage is a caching effect, a swap-file effect, or just an indicator that my concept is flawed.






Well Microsoft, here's another nice mess you've gotten us into.

Leo

Quote from: MichaelW on February 20, 2014, 03:37:50 AM
Is this layout necessary, or can you make the element and array sizes the same, and then interleave the arrays?

I can interleave the arrays, but if I do, I am going to end up with the result array 3-4 times larger than it needs to be. Then I will either need to convert it to a smaller one or work with with the big array that wastes memory and is slow. I am also unsure about the benefit this will bring.

Quote from: MichaelW on February 20, 2014, 04:16:10 PM
This is an attempt to compare the data transfer time between two separate arrays, and one combined array where the source and destination elements are interleaved.

It will more closely represent what I do if the memory access is made non-sequential. Assume you have square two dimensional arrays and you have two loops that process the data. The catch is that the index order is reversed. The outer loop is on the small index and the inner loop is on the large index. This is not exactly what happens in the algorithm, but it is close enough. The loop order can not be reversed. The input array is twice the size of the output array. Input elements are words, while output elements are bytes.

Leo

Quote from: hutch-- on February 20, 2014, 10:09:32 AM
Instruction choice will not solve the problem of page thrashing and neither will direct writes bypassing the cache. You must address the memory page load times and try and reduce the swapping from read to write and back. If it is possible, redesign the algorithm so that you read more data each read and write more data for each write. Until you solve this problem instruction choice will only make trivial differences.

While small improvements might be made to the algorithm,  I don't think the data access pattern can be changed.  The question is whether for this access pattern I can improve memory performance.

MichaelW

Quote from: Leo on February 21, 2014, 01:26:11 AM
I can interleave the arrays, but if I do, I am going to end up with the result array 3-4 times larger than it needs to be. Then I will either need to convert it to a smaller one or work with with the big array that wastes memory and is slow. I am also unsure about the benefit this will bring.

I suspect that accessing the result array a byte at a time is one of your performance limitations, but I did not bother to test this because I don't have any recent hardware, the same reason that I did not provide detailed results. I interleaved the arrays to improve the caching. While I have doubts that interleaving them 1:1 is necessary, doing it that way made the addressing simple and efficient. I was assuming that you would start with a dword input array, construct the interleaved array, do the processing, then extract the result array.
Why not run my code on your system, where you have enough memory and cache to handle a really large array, and see what is does?
Well Microsoft, here's another nice mess you've gotten us into.