News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

AVX code to calculate sum of array or sum of squares of array

Started by Mark44, February 20, 2014, 05:44:03 PM

Previous topic - Next topic

Mark44

Hi all! Brand new member here, but have had an interest in assembly programming for a long time. I just retired in December from a large software firm in Redmond, WA, where I worked as a programming writer for 15 years. Now that I'm retired, I'm able to spend time on whatever interests me, which at the moment is finding out the capabilities of the new computer I just bought. The CPU is an Intel i-7 (Ivy Bridge), so supports SSE up to 4.2 and AVX but not AVX2.

For my own amusement I've been poking around with these technologies, first using inline assembly in C code, and more recently using the MASM that comes with Visual Studio 10, including ml64.

Here is some 64-bit assembly code that uses AVX and other instructions to calculate either the sum of the array passed to it, or the sum of the squares of the array passed to it.  The routine loads 16 floats at a time, starting at 16 floats from the end of the array, and works its way to the front of the array. The code assumes that the array has some multiple of 16 elements in it, and that it is aligned on a 32 byte boundary.

The code shown here runs and produces correct results. I've written a simple, crude timing routine (using rdtscp) but haven't used it on this code, yet. If you have any comments about the code, let me know.

Mark

.data

zero dd 4 dup(0)
.code
sum_of_x_squared PROC C
; C prototype: float sum_of_x_squared(float * inputArray, int array_len_minus_16, int flag)
; The sum_of_x_squared proc returns either the sum of the elements of a passed array or
; the sum of the squares of the elements of the passed array. What is returned is determined by
; the value of the third parameter: 0 for sum of elements, 1 for sum of squares of elements.
; The procedure cycles through an array of floats, and works its way from the end
; back to the beginning.
; In each iteration, ymm1 is set with 8 floats, and ymm2 is set with the next 8 floats in memory. 
; Input args: inputArray passed in RCX, (arr_len - 16) is passed in RDX, flag in R8.
; If flag == 0, the sum of the array is returned.
; If flag == 1, the sum of the squares is returned.
; Return value: sum of array elements is returned in XMM0.

vzeroall ; Zero out all YMM registers.

mainLoop:
; Clear xmm0, xmm4, and xmm5.
vmovaps xmm0, xmmword ptr [zero]
vmovaps xmm4, xmmword ptr [zero]
vmovaps xmm5, xmmword ptr [zero]

; Accumulate sum from previous loop iteration.
addps xmm0, xmm3

; Store 8 floats in ymm1, and the following 8 floats in ymm2.
vmovaps ymm1, ymmword ptr[rcx + rdx*4]
vmovaps ymm2, ymmword ptr[rcx + rdx*4 + 32]

; Square each element if flag == 1. Skip the squaring if flag == 0.
or r8, r8
jz skipSquares
vmulps ymm1, ymm1, ymm1
vmulps ymm2, ymm2, ymm2

skipSquares:
; Use horizontal addition three times. After the third addition, ymm1 has part of the sum
; in the low 128 bits, and the other part in the high 128 bits.
vhaddps ymm1, ymm1, ymm2
vhaddps ymm1, ymm1, ymm1
vhaddps ymm1, ymm1, ymm1

; Extract the low 128 bits from ymm1, and store in xmm4. Extract the high 128 bits and store in xmm5.
vextractf128 xmm4, ymm1, 0
vextractf128 xmm5, ymm1, 1

; Add xmm4 and xmm5. This is the sum of all 16 floats.
addps xmm4, xmm5
; And store the sum in xmm0, the return value register.
addps xmm0, xmm4

;vmovss temp, xmm0
addps xmm3, xmm0 ; Store sum in xmm3 for further loop iterations.
sub rdx, 16
jge mainLoop
RET
sum_of_x_squared ENDP

Gunther

Hi Mark,

thank you for providing the code and welcome to the forum.  :t

Your idea is good, but you should provide a small test bed (with timings) for testing purposes. Furthermore, the code is better placed inside the 64 bit subforum, because it'll only run under 64 bit Windows.

Gunther 
You have to know the facts before you can distort them.

Mark44

Sorry to misplace my post. Although I looked at a number of forum sections, I didn't notice the 64-bit section.

I plan to post a few more examples of stuff I've been working on. Is there anything else needed besides source code? The executable?

GoneFishing

Hello and Welcome , Mark
64 bit board is here:
The MASM Forum » Projects »64 Bit Assembler

Generally , the complete source code is quite enough .
Though me personally doesn't have Ivy Bridge CPU  , I'll look forward to see more of your asm examples.

Gunther

Mark,

Quote from: Mark44 on February 21, 2014, 01:28:33 PM
I plan to post a few more examples of stuff I've been working on. Is there anything else needed besides source code? The executable?

that's a good plan. Go forward. :t

The complete source code is usually enough. But there are some special cases, for example a particular compiler, for which the EXE is necessary. If you're in doubt, add the binaries to the ZIP ball and attach it to your post.

Gunther
You have to know the facts before you can distort them.