Author Topic: Code location sensitivity of timings  (Read 36171 times)


  • Member
  • *****
  • Posts: 3802
  • Forgive your enemies, but never forget their names
Re: Code location sensitivity of timings
« Reply #75 on: August 19, 2014, 05:08:29 AM »

your timings:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

21892   cycles for 100 * memchr scasb
3007    cycles for 100 * memchr SSE2 lps/hps
2690    cycles for 100 * memchr SSE2 nidud
2500    cycles for 100 * memchr SSE2 ups

21951   cycles for 100 * memchr scasb
2981    cycles for 100 * memchr SSE2 lps/hps
2721    cycles for 100 * memchr SSE2 nidud
6211    cycles for 100 * memchr SSE2 ups

21827   cycles for 100 * memchr scasb
3003    cycles for 100 * memchr SSE2 lps/hps
2510    cycles for 100 * memchr SSE2 nidud
2721    cycles for 100 * memchr SSE2 ups

36      bytes for memchr scasb
88      bytes for memchr SSE2 lps/hps
92      bytes for memchr SSE2 nidud
84      bytes for memchr SSE2 ups

--- ok ---

Get your facts first, and then you can distort them.


  • Member
  • *****
  • Posts: 2311
Re: Code location sensitivity of timings
« Reply #76 on: March 22, 2015, 02:17:44 AM »
Some notes about the SIMD functions in this tread.

As pointed out by rrr in this post reading ahead more than one byte may read into a protected memory page and blow up.

The situation will be like this:
Code: [Select]
[PAGE_NOACCESS][H -- current page -- T][PAGE_NOACCESS]
Reading above the Tail byte or below the Head byte will fail in this case. If the pointer is the address of T and the function starts by reading 4 bytes it will overlap the protected area by 3 bytes.

A page is aligned 4096 (1000h) so the offset of T is xFFF, and the offset of H is x000. The fix is then to align to chunk size. In the case of 4 the first read will be xFFC, and in the case of 16 xFF0.

In the last case 15 bytes needs to be removed from the first test, so this creates an overhead on all these functions where the pointer needs to be aligned to chunk size before the fun begins.
Code: [Select]
mov ecx,eax
and ecx,16-1 ; unaligned bytes - 0..15
and eax,-16 ; align 16
or edx,-1 ; create mask
shl edx,cl
pxor xmm0,xmm0 ; populate zero in xmm0
pcmpeqb xmm0,[eax] ; check for zero
pmovmskb ecx,xmm0
and ecx,edx ; remove bytes in front
jnz done
pxor xmm0,xmm0
add eax,16

Here is an updated version of strlen using 32 byte

This is a "page safe" version strchr:
Code: [Select]
.model flat, stdcall


strchr  proc string, char
push edi
push edx
mov ecx,8[esp+4]
mov eax,8[esp+8]
and eax,000000FFh
imul eax,eax,01010101h
movd xmm1,eax
pshufd  xmm1,xmm1,0 ; populate char in xmm1
xorps xmm0,xmm0 ; populate zero in xmm0
mov edi,ecx
and ecx,16-1 ; unaligned bytes - 0..15
and edi,-16 ; align EDI 16
movdqa  xmm2,[edi] ; load 16 bytes
movdqa  xmm3,xmm2 ; copy to xmm3
or edx,-1 ; create mask
shl edx,cl
pcmpeqb xmm3,xmm1 ; check for char
pmovmskb ecx,xmm3
pcmpeqb xmm2,xmm0 ; check for zero
pmovmskb eax,xmm2
or ecx,eax ; combine result
and ecx,edx ; remove bytes in front
jnz done
movdqa  xmm2,[edi+16] ; continue testing 16-byte blocks
movdqa  xmm3,xmm2
lea edi,[edi+16]
pcmpeqb xmm3,xmm1
pmovmskb ecx,xmm3
pcmpeqb xmm2,xmm0
pmovmskb eax,xmm2
or ecx,eax
jz @B
bsf ecx,ecx ; get index of bit
lea eax,[edi+ecx] ; load address to result
cmp byte ptr [eax],1 ; points to zero or char
sbb ecx,ecx ; if zero CARRY flag set
not ecx ; 0 or -1
and eax,ecx ; 0 or pointer
pop edx
pop edi
ret 8
strchr  endp


This is a new "safe" version of memcpy using 32 byte. The problem here was reading 16 bytes before the size test.
Code: [Select]
memcpy  proc dst, src, count
push esi
push edi
mov eax,[esp+12] ; dst -- return value
mov esi,[esp+16] ; src
mov ecx,[esp+20] ; count
cmp ecx,64 ; skip short strings
jb copy_0_31
movdqu  xmm3,[esi] ; save aligned and tail bytes
movdqu  xmm4,[esi+16]
movdqu  xmm5,[esi+ecx-16]
movdqu  xmm6,[esi+ecx-32]
mov edi,eax ; align pointer
neg edi
and edi,32-1
add esi,edi
sub ecx,edi
add edi,eax
and ecx,-32
cmp esi,edi ; get direction of copy
ja move_R
sub ecx,32
movdqu  xmm1,[esi+ecx] ; copy 32 bytes
movdqu  xmm2,[esi+ecx+16]
movdqa  [edi+ecx],xmm1
movdqa  [edi+ecx+16],xmm2
jnz loop_L
jmp done
db 13 dup(90h) ; align loop_R 16
add edi,ecx
add esi,ecx
neg ecx
movdqu  xmm1,[esi+ecx]
movdqu  xmm2,[esi+ecx+16]
movdqa  [edi+ecx],xmm1
movdqa  [edi+ecx+16],xmm2
add ecx,dword ptr 32
jnz loop_R
mov ecx,[esp+20] ; fixup after copy
movdqu  [eax],xmm3 ; 0..31 unaligned from start
movdqu  [eax+16],xmm4 ;
movdqu  [eax+ecx-16],xmm5 ; 0..31 unaligned tail bytes
movdqu  [eax+ecx-32],xmm6 ;
pop edi
pop esi
ret 12
; Copy 0..63 byte
test ecx,ecx
jz toend
test ecx,-2
jz copy_1
test ecx,-4
jz copy_2
test ecx,-8
jz copy_4
test ecx,-16
jz copy_8
test ecx,-32
movdqu  xmm1,[esi]
movdqu  xmm3,[esi+ecx-16]
jz copy_16
movdqu  xmm2,[esi+16]
movdqu  xmm4,[esi+ecx-32]
movdqu  [eax+ecx-32],xmm4
movdqu  [eax+16],xmm2
movdqu  [eax],xmm1
movdqu  [eax+ecx-16],xmm3
jmp toend
movq xmm2,[esi]
movq xmm1,[esi+ecx-8]
movq [eax],xmm2
movq [eax+ecx-8],xmm1
jmp toend
mov edi,[esi]
mov esi,[esi+ecx-4]
mov [eax],edi
mov [eax+ecx-4],esi
jmp toend
mov di,[esi]
mov si,[esi+ecx-2]
mov [eax+ecx-2],si
mov [eax],di
jmp toend
mov cl,[esi]
mov [eax],cl
jmp toend
memcpy  endp

With regards to using 2 pointers, this may be difficult to do safely. Consider strcmp(T, H) where T and H are aligned –15, the Head pointer will overlap the page by 15 bytes. This is in above samples done by reading (ahead) chunks of 16 bytes and compare which is not very safe.