Can someone, please test the app on other Machines ?

So far the result is accurated (although a bit slow after 3000 * 3000 * 500 * 4 internal loop or something :)

I disabled the edit control to increase/decrease the iterations, because 3000 as it is marked, is only the total amount of "good results" the app is collecting. The total amount of iterations is something around the number above (Including the calibration).

This example, it is counting the total amount of time (in nanoseconds) that the function memcpy_SSE uses and also tries to cmopute how many overheads it is being found when tryng to stabilize the results

I´m analysing it for stability of the results. So, the algo with the less variations is the one that represent the more accurated value.

I´ll try optimizing it a bit and review the total amount of "iterations for good passes" are necessary to collect stable results. On previous tests, something around 300 were more then enough for a stabilization, but i´ll take a look at it later.

Btw...i did not implemented any warning message, saying that the app is running on Some specific part (or enabled a progressbar yet). So, if you please can tell me how much time does it takes to work on the machine, i´ll really apreciate.

This version tests the total amount of time (in nanosecs) that the code below takes to run

`Proc memcpy_SSE:`

Arguments @pDest, @pSource, @Length

Uses esi, edi, ecx, edx, eax

mov edi D@pDest

mov esi D@pSource

; we are copying a memory from 128 to 128 bytes at once

mov ecx D@Length

mov eax ecx | shr ecx 4 ; integer count. Divide by 16 (4 dwords)

jz L0> ; The memory size if smaller then 16 bytes long. Jmp over

; No we must compute he remainder, to see how many times we will loop

mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

mov edx 0 ; here it is used as an index

L1:

; movlps XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM

;movhps XMM1 X$esi+edx*8+8 ; copy the 1st 4 dwords from esi to register XMM

;movlps X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi

;movhps X$edi+edx*8+8 XMM1 ; copy the 1st 4 dwords from register XMM to edi

movsd | movsd | movsd | movsd

; movupd XMM1 X$esi+edx*8 ; copy the 1st 4 dwords from esi to register XMM

; movupd X$edi+edx*8 XMM1 ; copy the 1st 4 dwords from register XMM to edi

dec ecx

;lea edx D$edx+2

jnz L1<

emms ; clear the registers back to use on FPU

test eax eax | jz L4> ; No remainders ? Exit

jmp L3> ; jmp to the remainder computation

L0:

; If we are here, It means that the data is smaller then 16 bytes, and we ned to compute the remainder.

mov edx ecx | shl edx 4 | sub eax edx ; remainder. It can only have be 0 to 15 remainders bytes

L2:

; If the memory is not 4 dword aligned we may have some remainder here So, just clean them.

test eax eax | jz L4> ; No remainders ? Exit

L9:

lea edi D$edi+edx*8 ; mul edx by 8 to get the pos

mov eax eax ; fix potential stallings

lea esi D$esi+edx*8 ; mul edx by 8 to get the pos

L3: movsb | dec eax | jnz L3<

L4:

EndP