Hi all! Brand new member here, but have had an interest in assembly programming for a long time. I just retired in December from a large software firm in Redmond, WA, where I worked as a programming writer for 15 years. Now that I'm retired, I'm able to spend time on whatever interests me, which at the moment is finding out the capabilities of the new computer I just bought. The CPU is an Intel i-7 (Ivy Bridge), so supports SSE up to 4.2 and AVX but not AVX2.

For my own amusement I've been poking around with these technologies, first using inline assembly in C code, and more recently using the MASM that comes with Visual Studio 10, including ml64.

Here is some 64-bit assembly code that uses AVX and other instructions to calculate either the sum of the array passed to it, or the sum of the squares of the array passed to it. The routine loads 16 floats at a time, starting at 16 floats from the end of the array, and works its way to the front of the array. The code assumes that the array has some multiple of 16 elements in it, and that it is aligned on a 32 byte boundary.

The code shown here runs and produces correct results. I've written a simple, crude timing routine (using rdtscp) but haven't used it on this code, yet. If you have any comments about the code, let me know.

Mark

`.data`

zero dd 4 dup(0)

.code

sum_of_x_squared PROC C

; C prototype: float sum_of_x_squared(float * inputArray, int array_len_minus_16, int flag)

; The sum_of_x_squared proc returns either the sum of the elements of a passed array or

; the sum of the squares of the elements of the passed array. What is returned is determined by

; the value of the third parameter: 0 for sum of elements, 1 for sum of squares of elements.

; The procedure cycles through an array of floats, and works its way from the end

; back to the beginning.

; In each iteration, ymm1 is set with 8 floats, and ymm2 is set with the next 8 floats in memory.

; Input args: inputArray passed in RCX, (arr_len - 16) is passed in RDX, flag in R8.

; If flag == 0, the sum of the array is returned.

; If flag == 1, the sum of the squares is returned.

; Return value: sum of array elements is returned in XMM0.

vzeroall ; Zero out all YMM registers.

mainLoop:

; Clear xmm0, xmm4, and xmm5.

vmovaps xmm0, xmmword ptr [zero]

vmovaps xmm4, xmmword ptr [zero]

vmovaps xmm5, xmmword ptr [zero]

; Accumulate sum from previous loop iteration.

addps xmm0, xmm3

; Store 8 floats in ymm1, and the following 8 floats in ymm2.

vmovaps ymm1, ymmword ptr[rcx + rdx*4]

vmovaps ymm2, ymmword ptr[rcx + rdx*4 + 32]

; Square each element if flag == 1. Skip the squaring if flag == 0.

or r8, r8

jz skipSquares

vmulps ymm1, ymm1, ymm1

vmulps ymm2, ymm2, ymm2

skipSquares:

; Use horizontal addition three times. After the third addition, ymm1 has part of the sum

; in the low 128 bits, and the other part in the high 128 bits.

vhaddps ymm1, ymm1, ymm2

vhaddps ymm1, ymm1, ymm1

vhaddps ymm1, ymm1, ymm1

; Extract the low 128 bits from ymm1, and store in xmm4. Extract the high 128 bits and store in xmm5.

vextractf128 xmm4, ymm1, 0

vextractf128 xmm5, ymm1, 1

; Add xmm4 and xmm5. This is the sum of all 16 floats.

addps xmm4, xmm5

; And store the sum in xmm0, the return value register.

addps xmm0, xmm4

;vmovss temp, xmm0

addps xmm3, xmm0 ; Store sum in xmm3 for further loop iterations.

sub rdx, 16

jge mainLoop

RET

sum_of_x_squared ENDP