News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Sorting strings

Started by RuiLoureiro, May 29, 2014, 06:15:48 AM

Previous topic - Next topic

RuiLoureiro

#120
Hi,
        Now, i used crt_memcpy, memcpy_1, memcpy_2, memcpy_3
        procedures.
        Copying 16 bytes at a time we don't get the best results in my P4.

        The versions YZZE, YZZG and YZZH copy 16 bytes at a time using
        the 32 bit registers.
        The strings are not aligned.
       
        Gunther, could you post your results, if you dont mind ?
        Thank you.
       
        These are my results.
Quote
NOT ALIGNED
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****
  32 cycles, COPYAtoB_SSEDS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
  32 cycles, COPYAtoB_SSEES-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
  32 cycles, COPYAtoB_XZZFS-  15 bytes- copy lenght DWORDS+MOVZX- uses ESP
  33 cycles, COPYAtoB_XZES-   15 bytes- copy lenght DWORDS+MOVZX- uses ESP
  34 cycles, COPYAtoB_XZZES-  15 bytes- copy lenght DWORDS+MOVZX- uses ESP
  35 cycles, COPYAtoB_XZZCS-  15 bytes- copy lenght DWORDS+MOVZX- uses ESP
  35 cycles, COPYAtoB_SSEFS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
  35 cycles, crt_memcpy-      15 bytes- copy crt_memcpy
  35 cycles, COPYAtoB_SSED-   15 bytes- copy 16 BYTES+MOVZX
  35 cycles, COPYAtoB_XZZF-   15 bytes- copy lenght DWORDS+MOVZX
  36 cycles, COPYAtoB_XZZC-   15 bytes- copy lenght DWORDS+MOVZX

  37 cycles, COPYAtoB_YZZH-   15 bytes- copy length 16 BYTES+MOVZX  <<<<---
 
  37 cycles, COPYAtoB_SSEF-   15 bytes- copy 16 BYTES+MOVZX
  38 cycles, COPYAtoB_SSEE-   15 bytes- copy 16 BYTES+MOVZX
  39 cycles, COPYAtoB_XZZE-   15 bytes- copy lenght DWORDS+MOVZX
  41 cycles, COPYAtoB_YZZG-   15 bytes- copy length 16 BYTES+MOVZX  <<<<---
 
  41 cycles, COPYAtoB_YZZE-   15 bytes- copy length 16 BYTES+MOVZX
  45 cycles, COPYAtoB_XZE-    15 bytes- copy lenght DWORDS+MOVZX
  50 cycles, COPYAtoB_WZZE-   15 bytes- copy 16 BYTES+MOVZX
  63 cycles, COPYAtoB_XZES-   53 bytes- copy lenght DWORDS+MOVZX- uses ESP
  65 cycles, COPYAtoB_XZZFS-  53 bytes- copy lenght DWORDS+MOVZX- uses ESP

  69 cycles, memcpy_1-        53 bytes- copy regcopy                >>>>---
  72 cycles, COPYAtoB_XZZES-  53 bytes- copy lenght DWORDS+MOVZX- uses ESP
  74 cycles, COPYAtoB_XZZCS-  53 bytes- copy lenght DWORDS+MOVZX- uses ESP

  74 cycles, COPYAtoB_YZZG-   53 bytes- copy length 16 BYTES+MOVZX  <<<<---

  75 cycles, COPYAtoB_XZZE-   53 bytes- copy lenght DWORDS+MOVZX
  75 cycles, COPYAtoB_XZE-    53 bytes- copy lenght DWORDS+MOVZX
  75 cycles, COPYAtoB_XZZF-   53 bytes- copy lenght DWORDS+MOVZX
  76 cycles, COPYAtoB_YZZE-   53 bytes- copy length 16 BYTES+MOVZX
  77 cycles, COPYAtoB_XZZC-   53 bytes- copy lenght DWORDS+MOVZX

  78 cycles, memcpy_1-        15 bytes- copy regcopy                >>>>---
 
  78 cycles, COPYAtoB_WZZE-   53 bytes- copy 16 BYTES+MOVZX       

  82 cycles, COPYAtoB_YZZH-   53 bytes- copy length 16 BYTES+MOVZX  <<<<---
 
  85 cycles, COPYAtoB_SSEDS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
  90 cycles, COPYAtoB_SSEFS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
  91 cycles, COPYAtoB_SSEES-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
  93 cycles, COPYAtoB_SSEF-   53 bytes- copy 16 BYTES+MOVZX
  93 cycles, COPYAtoB_SSED-   53 bytes- copy 16 BYTES+MOVZX
  93 cycles, COPYAtoB_SSEE-   53 bytes- copy 16 BYTES+MOVZX
106 cycles, COPYAtoB_SSEHS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP

106 cycles, memcpy_1-       103 bytes- copy regcopy                >>>>---
107 cycles, crt_memcpy-      53 bytes- copy crt_memcpy
109 cycles, COPYAtoB_YZZH103 bytes- copy length 16 BYTES+MOVZX  <<<<---
109 cycles, COPYAtoB_YZZG-  103 bytes- copy length 16 BYTES+MOVZX  <<<<---

110 cycles, COPYAtoB_SSEH-   53 bytes- copy 16 BYTES+MOVZX
110 cycles, COPYAtoB_SSEG-   53 bytes- copy 16 BYTES+MOVZX
110 cycles, COPYAtoB_SSEGS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
111 cycles, COPYAtoB_YZZE-  103 bytes- copy length 16 BYTES+MOVZX
113 cycles, COPYAtoB_WZZE-  103 bytes- copy 16 BYTES+MOVZX
117 cycles, COPYAtoB_XZZCS- 103 bytes- copy lenght DWORDS+MOVZX- uses ESP
117 cycles, COPYAtoB_XZZFS- 103 bytes- copy lenght DWORDS+MOVZX- uses ESP
117 cycles, COPYAtoB_XZZES- 103 bytes- copy lenght DWORDS+MOVZX- uses ESP
117 cycles, COPYAtoB_XZES-  103 bytes- copy lenght DWORDS+MOVZX- uses ESP
119 cycles, COPYAtoB_XZZE-  103 bytes- copy lenght DWORDS+MOVZX
120 cycles, COPYAtoB_XZZC-  103 bytes- copy lenght DWORDS+MOVZX
120 cycles, COPYAtoB_XZE-   103 bytes- copy lenght DWORDS+MOVZX
122 cycles, COPYAtoB_XZZF-  103 bytes- copy lenght DWORDS+MOVZX
141 cycles, crt_memcpy-     103 bytes- copy crt_memcpy

173 cycles, COPYAtoB_YZZH203 bytes- copy length 16 BYTES+MOVZX  <<<<---
173 cycles, COPYAtoB_YZZG-  203 bytes- copy length 16 BYTES+MOVZX  <<<<---

174 cycles, COPYAtoB_YZZE-  203 bytes- copy length 16 BYTES+MOVZX
175 cycles, COPYAtoB_SSEH-   15 bytes- copy 16 BYTES+MOVZX
176 cycles, COPYAtoB_SSEG-   15 bytes- copy 16 BYTES+MOVZX
177 cycles, COPYAtoB_SSEHS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
177 cycles, memcpy_1-       203 bytes- copy regcopy                >>>>---

177 cycles, COPYAtoB_SSEGS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
188 cycles, COPYAtoB_SSEDS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
205 cycles, COPYAtoB_XZES-  203 bytes- copy lenght DWORDS+MOVZX- uses ESP
206 cycles, COPYAtoB_XZZFS- 203 bytes- copy lenght DWORDS+MOVZX- uses ESP
207 cycles, COPYAtoB_XZZCS- 203 bytes- copy lenght DWORDS+MOVZX- uses ESP
208 cycles, COPYAtoB_SSEFS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
209 cycles, COPYAtoB_SSEHS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
210 cycles, COPYAtoB_SSED-  103 bytes- copy 16 BYTES+MOVZX
210 cycles, COPYAtoB_XZZC-  203 bytes- copy lenght DWORDS+MOVZX
211 cycles, COPYAtoB_SSEES- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
211 cycles, COPYAtoB_XZZE-  203 bytes- copy lenght DWORDS+MOVZX
211 cycles, COPYAtoB_XZZF-  203 bytes- copy lenght DWORDS+MOVZX
212 cycles, COPYAtoB_XZE-   203 bytes- copy lenght DWORDS+MOVZX
213 cycles, COPYAtoB_SSEF-  103 bytes- copy 16 BYTES+MOVZX
213 cycles, COPYAtoB_SSEE-  103 bytes- copy 16 BYTES+MOVZX
217 cycles, COPYAtoB_SSEH-  103 bytes- copy 16 BYTES+MOVZX
217 cycles, COPYAtoB_WZZE-  203 bytes- copy 16 BYTES+MOVZX
223 cycles, crt_memcpy-     203 bytes- copy crt_memcpy
224 cycles, COPYAtoB_SSEGS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
229 cycles, COPYAtoB_SSEG-  103 bytes- copy 16 BYTES+MOVZX
248 cycles, memcpy_2-        15 bytes- copy memcpy SSE
252 cycles, memcpy_3-        15 bytes- copy memcpyxmmU SSE
259 cycles, COPYAtoB_XZZES- 203 bytes- copy lenght DWORDS+MOVZX- uses ESP
276 cycles, memcpy_3-        53 bytes- copy memcpyxmmU SSE
300 cycles, memcpy_2-        53 bytes- copy memcpy SSE
379 cycles, memcpy_3-       103 bytes- copy memcpyxmmU SSE

395 cycles, COPYAtoB_YZZH503 bytes- copy length 16 BYTES+MOVZX  <<<<---
397 cycles, COPYAtoB_YZZG-  503 bytes- copy length 16 BYTES+MOVZX  <<<<---

403 cycles, COPYAtoB_YZZE-  503 bytes- copy length 16 BYTES+MOVZX
406 cycles, COPYAtoB_SSEE-  203 bytes- copy 16 BYTES+MOVZX
406 cycles, COPYAtoB_SSEFS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
407 cycles, COPYAtoB_SSEF-  203 bytes- copy 16 BYTES+MOVZX
412 cycles, memcpy_2-       103 bytes- copy memcpy SSE
417 cycles, COPYAtoB_SSEES- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
419 cycles, COPYAtoB_XZZES- 503 bytes- copy lenght DWORDS+MOVZX- uses ESP
419 cycles, COPYAtoB_XZZF-  503 bytes- copy lenght DWORDS+MOVZX
420 cycles, COPYAtoB_XZZCS- 503 bytes- copy lenght DWORDS+MOVZX- uses ESP
421 cycles, COPYAtoB_XZES-  503 bytes- copy lenght DWORDS+MOVZX- uses ESP
421 cycles, COPYAtoB_XZZC-  503 bytes- copy lenght DWORDS+MOVZX
422 cycles, COPYAtoB_XZE-   503 bytes- copy lenght DWORDS+MOVZX
424 cycles, COPYAtoB_XZZFS- 503 bytes- copy lenght DWORDS+MOVZX- uses ESP
425 cycles, COPYAtoB_WZZE-  503 bytes- copy 16 BYTES+MOVZX
427 cycles, COPYAtoB_XZZE-  503 bytes- copy lenght DWORDS+MOVZX
428 cycles, COPYAtoB_SSEDS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
435 cycles, COPYAtoB_SSED-  203 bytes- copy 16 BYTES+MOVZX
443 cycles, crt_memcpy-     503 bytes- copy crt_memcpy

447 cycles, memcpy_1-       503 bytes- copy regcopy                >>>>---

466 cycles, COPYAtoB_SSEGS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
466 cycles, COPYAtoB_SSEHS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
466 cycles, COPYAtoB_SSEG-  203 bytes- copy 16 BYTES+MOVZX
472 cycles, COPYAtoB_SSEH-  203 bytes- copy 16 BYTES+MOVZX
624 cycles, memcpy_2-       203 bytes- copy memcpy SSE
634 cycles, memcpy_3-       203 bytes- copy memcpyxmmU SSE

728 cycles, memcpy_1-      1027 bytes- copy regcopy                >>>>---
730 cycles, COPYAtoB_YZZH- 1027 bytes- copy length 16 BYTES+MOVZX  <<<<---
760 cycles, COPYAtoB_YZZG- 1027 bytes- copy length 16 BYTES+MOVZX  <<<<---

784 cycles, COPYAtoB_YZZE- 1027 bytes- copy length 16 BYTES+MOVZX
792 cycles, COPYAtoB_XZZES-1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
793 cycles, COPYAtoB_XZE-  1027 bytes- copy lenght DWORDS+MOVZX
793 cycles, COPYAtoB_XZZCS-1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
796 cycles, COPYAtoB_XZZC- 1027 bytes- copy lenght DWORDS+MOVZX
799 cycles, COPYAtoB_XZZE- 1027 bytes- copy lenght DWORDS+MOVZX
801 cycles, COPYAtoB_XZZF- 1027 bytes- copy lenght DWORDS+MOVZX
808 cycles, COPYAtoB_XZES- 1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
812 cycles, COPYAtoB_WZZE- 1027 bytes- copy 16 BYTES+MOVZX
817 cycles, COPYAtoB_XZZFS-1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
831 cycles, crt_memcpy-    1027 bytes- copy crt_memcpy
962 cycles, COPYAtoB_SSEDS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
968 cycles, COPYAtoB_SSEGS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
969 cycles, COPYAtoB_SSED-  503 bytes- copy 16 BYTES+MOVZX
980 cycles, COPYAtoB_SSEFS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
981 cycles, COPYAtoB_SSEE-  503 bytes- copy 16 BYTES+MOVZX
982 cycles, COPYAtoB_SSEG-  503 bytes- copy 16 BYTES+MOVZX
984 cycles, COPYAtoB_SSEES- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
987 cycles, COPYAtoB_SSEF-  503 bytes- copy 16 BYTES+MOVZX
991 cycles, COPYAtoB_SSEH-  503 bytes- copy 16 BYTES+MOVZX
995 cycles, COPYAtoB_SSEHS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
1140 cycles, memcpy_2-       503 bytes- copy memcpy SSE
1162 cycles, memcpy_3-       503 bytes- copy memcpyxmmU SSE

1527 cycles, COPYAtoB_YZZH- 2062 bytes- copy length 16 BYTES+MOVZX  <<<<---
1546 cycles, COPYAtoB_YZZG- 2062 bytes- copy length 16 BYTES+MOVZX  <<<<---

1557 cycles, COPYAtoB_XZZE- 2062 bytes- copy lenght DWORDS+MOVZX
1558 cycles, memcpy_1-      2062 bytes- copy regcopy                >>>>---

1568 cycles, COPYAtoB_XZZC- 2062 bytes- copy lenght DWORDS+MOVZX
1569 cycles, COPYAtoB_XZZF- 2062 bytes- copy lenght DWORDS+MOVZX
1570 cycles, COPYAtoB_XZZFS-2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
1571 cycles, COPYAtoB_WZZE- 2062 bytes- copy 16 BYTES+MOVZX
1573 cycles, COPYAtoB_XZZCS-2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
1576 cycles, COPYAtoB_XZE-  2062 bytes- copy lenght DWORDS+MOVZX
1577 cycles, COPYAtoB_XZZES-2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
1580 cycles, COPYAtoB_XZES- 2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
1589 cycles, COPYAtoB_YZZE- 2062 bytes- copy length 16 BYTES+MOVZX
1635 cycles, crt_memcpy-    2062 bytes- copy crt_memcpy
1908 cycles, COPYAtoB_SSEES-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
2026 cycles, COPYAtoB_SSEFS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
2030 cycles, COPYAtoB_SSEF- 1027 bytes- copy 16 BYTES+MOVZX
2043 cycles, COPYAtoB_SSEHS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
2045 cycles, COPYAtoB_SSEE- 1027 bytes- copy 16 BYTES+MOVZX
2046 cycles, COPYAtoB_SSED- 1027 bytes- copy 16 BYTES+MOVZX
2078 cycles, COPYAtoB_SSEH- 1027 bytes- copy 16 BYTES+MOVZX
2087 cycles, COPYAtoB_SSEDS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
2109 cycles, memcpy_3-      1027 bytes- copy memcpyxmmU SSE
2118 cycles, memcpy_2-      1027 bytes- copy memcpy SSE
2123 cycles, COPYAtoB_SSEGS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
2150 cycles, COPYAtoB_SSEG- 1027 bytes- copy 16 BYTES+MOVZX
3949 cycles, COPYAtoB_SSEGS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
4010 cycles, COPYAtoB_SSEG- 2062 bytes- copy 16 BYTES+MOVZX
4020 cycles, COPYAtoB_SSEES-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
4046 cycles, COPYAtoB_SSEE- 2062 bytes- copy 16 BYTES+MOVZX
4052 cycles, COPYAtoB_SSEDS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
4054 cycles, COPYAtoB_SSEFS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
4060 cycles, COPYAtoB_SSED- 2062 bytes- copy 16 BYTES+MOVZX
4081 cycles, COPYAtoB_SSEHS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
4086 cycles, COPYAtoB_SSEF- 2062 bytes- copy 16 BYTES+MOVZX
4099 cycles, COPYAtoB_SSEH- 2062 bytes- copy 16 BYTES+MOVZX
4116 cycles, memcpy_2-      2062 bytes- copy memcpy SSE
4368 cycles, memcpy_3-      2062 bytes- copy memcpyxmmU SSE
********** END III **********
_YZZH  mean:  436.14
_YZZG               444.57
memcpy mean: 451.86

Gunther

Hi Rui,

the results of #47 and #48 are attached. copy.zip is the archive's name.

Gunther
You have to know the facts before you can distort them.

RuiLoureiro

Thank you Gunther  :t

In your powerful i7, _SSEE is the best

Now, i would like to know the results for CopyString53
if you dont mind.

Quote
--------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
--------------------------------------------------------------
***** Time table *****

  5 cycles, MOVEAtoB_SSEF-  13 bytes- copy 128 BITS+MOVZX
  5 cycles, MOVEAtoB_SSEE13 bytes- copy 128 BITS+MOVZX
  6 cycles, MOVEAtoB_SSED-  13 bytes- copy 128 BITS+MOVZX
  7 cycles, MOVEAtoB_XZZF-  13 bytes- copy lenght DWORDS+MOVZX
  7 cycles, MOVEAtoB_XZZE-  13 bytes- copy lenght DWORDS+MOVZX
  7 cycles, MOVEAtoB_XZZC-  13 bytes- copy lenght DWORDS+MOVZX
 
11 cycles, MOVEAtoB_SSEE53 bytes- copy 128 BITS+MOVZX
12 cycles, MOVEAtoB_SSEF-  53 bytes- copy 128 BITS+MOVZX
14 cycles, MOVEAtoB_XZE-   13 bytes- copy lenght DWORDS+MOVZX
15 cycles, MOVEAtoB_SSED-  53 bytes- copy 128 BITS+MOVZX
19 cycles, MOVEAtoB_SSEE- 103 bytes- copy 128 BITS+MOVZX
20 cycles, MOVEAtoB_SSEF- 103 bytes- copy 128 BITS+MOVZX
21 cycles, MOVEAtoB_XZZE-  53 bytes- copy lenght DWORDS+MOVZX
22 cycles, MOVEAtoB_XZZC53 bytes- copy lenght DWORDS+MOVZX

25 cycles, MOVEAtoB_SSED- 103 bytes- copy 128 BITS+MOVZX
29 cycles, MOVEAtoB_XZE-   53 bytes- copy lenght DWORDS+MOVZX
30 cycles, MOVEAtoB_SSEH-  53 bytes- copy 128 BITS+MOVZX
32 cycles, MOVEAtoB_SSEG-  53 bytes- copy 128 BITS+MOVZX
34 cycles, MOVEAtoB_XZZF-  53 bytes- copy lenght DWORDS+MOVZX

37 cycles, MOVEAtoB_SSEE- 203 bytes- copy 128 BITS+MOVZX
37 cycles, MOVEAtoB_SSEF- 203 bytes- copy 128 BITS+MOVZX
47 cycles, MOVEAtoB_XZE-  103 bytes- copy lenght DWORDS+MOVZX
47 cycles, MOVEAtoB_XZZF- 103 bytes- copy lenght DWORDS+MOVZX
47 cycles, MOVEAtoB_XZZE- 103 bytes- copy lenght DWORDS+MOVZX
48 cycles, MOVEAtoB_SSEH- 103 bytes- copy 128 BITS+MOVZX
49 cycles, MOVEAtoB_SSED- 203 bytes- copy 128 BITS+MOVZX
50 cycles, MOVEAtoB_XZZC- 103 bytes- copy lenght DWORDS+MOVZX
52 cycles, MOVEAtoB_SSEG- 103 bytes- copy 128 BITS+MOVZX
63 cycles, MOVEAtoB_SSEH-  13 bytes- copy 128 BITS+MOVZX

63 cycles, MOVEAtoB_SSEE- 503 bytes- copy 128 BITS+MOVZX
63 cycles, MOVEAtoB_SSEF- 503 bytes- copy 128 BITS+MOVZX
66 cycles, MOVEAtoB_SSEG-  13 bytes- copy 128 BITS+MOVZX
72 cycles, MOVEAtoB_SSEG- 203 bytes- copy 128 BITS+MOVZX
78 cycles, MOVEAtoB_SSEH- 203 bytes- copy 128 BITS+MOVZX
90 cycles, MOVEAtoB_SSED- 503 bytes- copy 128 BITS+MOVZX
105 cycles, MOVEAtoB_XZZC- 203 bytes- copy lenght DWORDS+MOVZX
106 cycles, MOVEAtoB_XZE-  203 bytes- copy lenght DWORDS+MOVZX
106 cycles, MOVEAtoB_XZZE- 203 bytes- copy lenght DWORDS+MOVZX
110 cycles, MOVEAtoB_SSEH- 503 bytes- copy 128 BITS+MOVZX
128 cycles, MOVEAtoB_SSEG- 503 bytes- copy 128 BITS+MOVZX

129 cycles, MOVEAtoB_SSEF-1027 bytes- copy 128 BITS+MOVZX
129 cycles, MOVEAtoB_SSEE-1027 bytes- copy 128 BITS+MOVZX
130 cycles, MOVEAtoB_SSEG-1027 bytes- copy 128 BITS+MOVZX
136 cycles, MOVEAtoB_XZZF- 203 bytes- copy lenght DWORDS+MOVZX
158 cycles, MOVEAtoB_SSED-1027 bytes- copy 128 BITS+MOVZX
165 cycles, MOVEAtoB_SSEH-1027 bytes- copy 128 BITS+MOVZX
237 cycles, MOVEAtoB_XZZE- 503 bytes- copy lenght DWORDS+MOVZX
238 cycles, MOVEAtoB_XZZC- 503 bytes- copy lenght DWORDS+MOVZX
238 cycles, MOVEAtoB_XZE-  503 bytes- copy lenght DWORDS+MOVZX
240 cycles, MOVEAtoB_XZZF- 503 bytes- copy lenght DWORDS+MOVZX
465 cycles, MOVEAtoB_XZZC-1027 bytes- copy lenght DWORDS+MOVZX
466 cycles, MOVEAtoB_XZE- 1027 bytes- copy lenght DWORDS+MOVZX
466 cycles, MOVEAtoB_XZZE-1027 bytes- copy lenght DWORDS+MOVZX
495 cycles, MOVEAtoB_XZZF-1027 bytes- copy lenght DWORDS+MOVZX
********** END III **********

Gunther

Hi Rui,

Quote from: RuiLoureiro on July 09, 2014, 09:44:04 PM
Now, i would like to know the results for CopyString53
if you dont mind.

no problem. Results are attached.

Gunther
You have to know the facts before you can distort them.

nidud

#124
deleted

RuiLoureiro

Thank you so much Gunther  :t

In this test, to copy a large amount of bytes
(.../1027/2062 bytes) the best seems to be memcpy_3

nidud,
        I will add your new procedure
        in the next post.

--------------------------
Results from Gunther
---------------------------
---------------------------------------------
    names used by nidud
---------------------------------------------
memcpy_1  -> is regcopy
memcpy_2  -> is memcpy SSE
memcpy_3  -> is memcpyxmmU SSE
---------------------------------------------
Quote
--------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
--------------------------------------------------------------
***** Time table *****

13 cycles, COPYAtoB_YZZH-   15 bytes- copy length 16 BYTES+MOVZX
13 cycles, crt_memcpy-      15 bytes- copy crt_memcpy
14 cycles, COPYAtoB_SSEE-   15 bytes- copy 16 BYTES+MOVZX

15 cycles, COPYAtoB_SSED-   15 bytes- copy 16 BYTES+MOVZX
15 cycles, COPYAtoB_SSEES-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
16 cycles, COPYAtoB_SSEDS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
16 cycles, COPYAtoB_SSEFS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
16 cycles, COPYAtoB_WZZE-   15 bytes- copy 16 BYTES+MOVZX
17 cycles, COPYAtoB_YZZE-   15 bytes- copy length 16 BYTES+MOVZX
17 cycles, COPYAtoB_SSED-   53 bytes- copy 16 BYTES+MOVZX
17 cycles, COPYAtoB_SSEES-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
17 cycles, COPYAtoB_SSEF-   53 bytes- copy 16 BYTES+MOVZX
17 cycles, COPYAtoB_SSEF-   15 bytes- copy 16 BYTES+MOVZX
17 cycles, COPYAtoB_SSEFS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
17 cycles, COPYAtoB_XZZE-   15 bytes- copy lenght DWORDS+MOVZX
17 cycles, COPYAtoB_XZZC-   15 bytes- copy lenght DWORDS+MOVZX
18 cycles, COPYAtoB_XZZCS-  15 bytes- copy lenght DWORDS+MOVZX- uses ESP
18 cycles, COPYAtoB_XZES-   15 bytes- copy lenght DWORDS+MOVZX- uses ESP
18 cycles, COPYAtoB_SSEDS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
18 cycles, COPYAtoB_YZZG-   15 bytes- copy length 16 BYTES+MOVZX
18 cycles, COPYAtoB_XZZES-  15 bytes- copy lenght DWORDS+MOVZX- uses ESP
19 cycles, COPYAtoB_XZZFS-  15 bytes- copy lenght DWORDS+MOVZX- uses ESP
19 cycles, COPYAtoB_XZZF-   15 bytes- copy lenght DWORDS+MOVZX
21 cycles, COPYAtoB_SSEE-   53 bytes- copy 16 BYTES+MOVZX
22 cycles, COPYAtoB_YZZH-   53 bytes- copy length 16 BYTES+MOVZX
22 cycles, COPYAtoB_YZZG-   53 bytes- copy length 16 BYTES+MOVZX
24 cycles, COPYAtoB_WZZE-   53 bytes- copy 16 BYTES+MOVZX
25 cycles, memcpy_1-        53 bytes- copy regcopy
25 cycles, COPYAtoB_XZZFS-  53 bytes- copy lenght DWORDS+MOVZX- uses ESP
26 cycles, COPYAtoB_YZZE-   53 bytes- copy length 16 BYTES+MOVZX
26 cycles, COPYAtoB_XZZES-  53 bytes- copy lenght DWORDS+MOVZX- uses ESP
26 cycles, COPYAtoB_XZZF-   53 bytes- copy lenght DWORDS+MOVZX
27 cycles, COPYAtoB_XZZE-   53 bytes- copy lenght DWORDS+MOVZX
28 cycles, COPYAtoB_XZZC-   53 bytes- copy lenght DWORDS+MOVZX
28 cycles, memcpy_1-        15 bytes- copy regcopy
31 cycles, COPYAtoB_SSEGS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
32 cycles, COPYAtoB_SSEG-   53 bytes- copy 16 BYTES+MOVZX
32 cycles, memcpy_2-        15 bytes- copy memcpy SSE
32 cycles, COPYAtoB_SSEH-   53 bytes- copy 16 BYTES+MOVZX
34 cycles, memcpy_3-        15 bytes- copy memcpyxmmU SSE
34 cycles, COPYAtoB_SSEHS-  53 bytes- copy 16 BYTES+MOVZX- uses ESP
36 cycles, COPYAtoB_XZZCS-  53 bytes- copy lenght DWORDS+MOVZX- uses ESP
37 cycles, crt_memcpy-      53 bytes- copy crt_memcpy

37 cycles, COPYAtoB_YZZG-  103 bytes- copy length 16 BYTES+MOVZX
37 cycles, COPYAtoB_XZES-   53 bytes- copy lenght DWORDS+MOVZX- uses ESP
37 cycles, COPYAtoB_YZZH-  103 bytes- copy length 16 BYTES+MOVZX
38 cycles, memcpy_2-        53 bytes- copy memcpy SSE
38 cycles, COPYAtoB_SSEES- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
38 cycles, COPYAtoB_SSEDS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
38 cycles, COPYAtoB_SSEFS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
39 cycles, memcpy_1-       103 bytes- copy regcopy
39 cycles, memcpy_3-        53 bytes- copy memcpyxmmU SSE
40 cycles, COPYAtoB_SSED-  103 bytes- copy 16 BYTES+MOVZX
40 cycles, COPYAtoB_SSEF-  103 bytes- copy 16 BYTES+MOVZX
40 cycles, COPYAtoB_WZZE-  103 bytes- copy 16 BYTES+MOVZX
40 cycles, COPYAtoB_SSEE-  103 bytes- copy 16 BYTES+MOVZX
41 cycles, COPYAtoB_XZE-    15 bytes- copy lenght DWORDS+MOVZX
42 cycles, COPYAtoB_YZZE-  103 bytes- copy length 16 BYTES+MOVZX
47 cycles, crt_memcpy-     103 bytes- copy crt_memcpy
48 cycles, COPYAtoB_SSEGS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
48 cycles, COPYAtoB_SSEH-  103 bytes- copy 16 BYTES+MOVZX
49 cycles, memcpy_2-       103 bytes- copy memcpy SSE
49 cycles, COPYAtoB_XZZES- 103 bytes- copy lenght DWORDS+MOVZX- uses ESP
50 cycles, COPYAtoB_XZZFS- 103 bytes- copy lenght DWORDS+MOVZX- uses ESP
50 cycles, COPYAtoB_SSEHS- 103 bytes- copy 16 BYTES+MOVZX- uses ESP
51 cycles, memcpy_3-       103 bytes- copy memcpyxmmU SSE  <<<<<<<-----

51 cycles, COPYAtoB_SSEES- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
51 cycles, COPYAtoB_XZZE-  103 bytes- copy lenght DWORDS+MOVZX
51 cycles, COPYAtoB_SSEFS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
52 cycles, COPYAtoB_SSED-  203 bytes- copy 16 BYTES+MOVZX
52 cycles, COPYAtoB_SSEDS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
52 cycles, COPYAtoB_SSEG-  103 bytes- copy 16 BYTES+MOVZX
52 cycles, COPYAtoB_XZZC-  103 bytes- copy lenght DWORDS+MOVZX
52 cycles, COPYAtoB_XZZF-  103 bytes- copy lenght DWORDS+MOVZX
52 cycles, COPYAtoB_SSEF-  203 bytes- copy 16 BYTES+MOVZX
52 cycles, COPYAtoB_XZE-   103 bytes- copy lenght DWORDS+MOVZX
52 cycles, COPYAtoB_SSEE-  203 bytes- copy 16 BYTES+MOVZX
55 cycles, memcpy_3-       203 bytes- copy memcpyxmmU SSE      <<<<<<<-----
59 cycles, COPYAtoB_XZE-    53 bytes- copy lenght DWORDS+MOVZX
59 cycles, memcpy_2-       203 bytes- copy memcpy SSE
64 cycles, COPYAtoB_YZZH-  203 bytes- copy length 16 BYTES+MOVZX
64 cycles, COPYAtoB_YZZG-  203 bytes- copy length 16 BYTES+MOVZX
65 cycles, COPYAtoB_WZZE-  203 bytes- copy 16 BYTES+MOVZX
68 cycles, COPYAtoB_YZZE-  203 bytes- copy length 16 BYTES+MOVZX
68 cycles, memcpy_1-       203 bytes- copy regcopy
69 cycles, crt_memcpy-     203 bytes- copy crt_memcpy

69 cycles, memcpy_3-       503 bytes- copy memcpyxmmU SSE
70 cycles, COPYAtoB_SSEG-  203 bytes- copy 16 BYTES+MOVZX
72 cycles, COPYAtoB_XZZCS- 103 bytes- copy lenght DWORDS+MOVZX- uses ESP
72 cycles, COPYAtoB_SSEE-  503 bytes- copy 16 BYTES+MOVZX
72 cycles, COPYAtoB_XZES-  103 bytes- copy lenght DWORDS+MOVZX- uses ESP
75 cycles, COPYAtoB_SSEGS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
76 cycles, COPYAtoB_SSEH-   15 bytes- copy 16 BYTES+MOVZX
79 cycles, COPYAtoB_SSEGS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
80 cycles, COPYAtoB_SSEG-   15 bytes- copy 16 BYTES+MOVZX
82 cycles, COPYAtoB_SSEH-  203 bytes- copy 16 BYTES+MOVZX
84 cycles, COPYAtoB_SSEGS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
87 cycles, COPYAtoB_SSEHS- 203 bytes- copy 16 BYTES+MOVZX- uses ESP
88 cycles, COPYAtoB_SSEG-  503 bytes- copy 16 BYTES+MOVZX
92 cycles, COPYAtoB_SSEHS-  15 bytes- copy 16 BYTES+MOVZX- uses ESP
93 cycles, crt_memcpy-     503 bytes- copy crt_memcpy
93 cycles, COPYAtoB_SSED-  503 bytes- copy 16 BYTES+MOVZX
94 cycles, COPYAtoB_SSEFS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
94 cycles, COPYAtoB_SSEDS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
94 cycles, COPYAtoB_SSEF-  503 bytes- copy 16 BYTES+MOVZX
107 cycles, COPYAtoB_SSEES- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
108 cycles, COPYAtoB_XZZFS- 203 bytes- copy lenght DWORDS+MOVZX- uses ESP
109 cycles, COPYAtoB_XZZES- 203 bytes- copy lenght DWORDS+MOVZX- uses ESP
110 cycles, COPYAtoB_XZZF-  203 bytes- copy lenght DWORDS+MOVZX
110 cycles, COPYAtoB_XZZC-  203 bytes- copy lenght DWORDS+MOVZX
111 cycles, COPYAtoB_XZZE-  203 bytes- copy lenght DWORDS+MOVZX
111 cycles, COPYAtoB_SSEH-  503 bytes- copy 16 BYTES+MOVZX
111 cycles, COPYAtoB_XZE-   203 bytes- copy lenght DWORDS+MOVZX

115 cycles, memcpy_3-      1027 bytes- copy memcpyxmmU SSE
129 cycles, COPYAtoB_SSEHS- 503 bytes- copy 16 BYTES+MOVZX- uses ESP
130 cycles, memcpy_2-       503 bytes- copy memcpy SSE
135 cycles, COPYAtoB_SSEE- 1027 bytes- copy 16 BYTES+MOVZX

137 cycles, crt_memcpy-    1027 bytes- copy crt_memcpy
137 cycles, COPYAtoB_XZES-  203 bytes- copy lenght DWORDS+MOVZX- uses ESP
137 cycles, COPYAtoB_XZZCS- 203 bytes- copy lenght DWORDS+MOVZX- uses ESP
143 cycles, COPYAtoB_YZZE-  503 bytes- copy length 16 BYTES+MOVZX
145 cycles, memcpy_1-       503 bytes- copy regcopy
150 cycles, COPYAtoB_YZZH-  503 bytes- copy length 16 BYTES+MOVZX
150 cycles, COPYAtoB_YZZG-  503 bytes- copy length 16 BYTES+MOVZX
153 cycles, COPYAtoB_SSEG- 1027 bytes- copy 16 BYTES+MOVZX
153 cycles, COPYAtoB_SSEGS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
160 cycles, COPYAtoB_SSEDS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
161 cycles, COPYAtoB_SSED- 1027 bytes- copy 16 BYTES+MOVZX
161 cycles, COPYAtoB_SSEF- 1027 bytes- copy 16 BYTES+MOVZX
162 cycles, COPYAtoB_SSEFS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
162 cycles, COPYAtoB_SSEES-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
165 cycles, COPYAtoB_SSEHS-1027 bytes- copy 16 BYTES+MOVZX- uses ESP
169 cycles, COPYAtoB_WZZE-  503 bytes- copy 16 BYTES+MOVZX
170 cycles, COPYAtoB_SSEH- 1027 bytes- copy 16 BYTES+MOVZX
185 cycles, memcpy_2-      1027 bytes- copy memcpy SSE

192 cycles, memcpy_3-      2062 bytes- copy memcpyxmmU SSE
216 cycles, crt_memcpy-    2062 bytes- copy crt_memcpy
239 cycles, COPYAtoB_XZZES- 503 bytes- copy lenght DWORDS+MOVZX- uses ESP
241 cycles, COPYAtoB_XZZE-  503 bytes- copy lenght DWORDS+MOVZX
241 cycles, COPYAtoB_XZE-   503 bytes- copy lenght DWORDS+MOVZX
242 cycles, COPYAtoB_XZZC-  503 bytes- copy lenght DWORDS+MOVZX
242 cycles, COPYAtoB_XZZFS- 503 bytes- copy lenght DWORDS+MOVZX- uses ESP
243 cycles, COPYAtoB_XZZF-  503 bytes- copy lenght DWORDS+MOVZX
268 cycles, COPYAtoB_XZES-  503 bytes- copy lenght DWORDS+MOVZX- uses ESP
269 cycles, COPYAtoB_XZZCS- 503 bytes- copy lenght DWORDS+MOVZX- uses ESP
275 cycles, COPYAtoB_SSEE- 2062 bytes- copy 16 BYTES+MOVZX

278 cycles, COPYAtoB_YZZE- 1027 bytes- copy length 16 BYTES+MOVZX
283 cycles, COPYAtoB_YZZH- 1027 bytes- copy length 16 BYTES+MOVZX
284 cycles, COPYAtoB_YZZG- 1027 bytes- copy length 16 BYTES+MOVZX
286 cycles, memcpy_1-      1027 bytes- copy regcopy

292 cycles, COPYAtoB_SSED- 2062 bytes- copy 16 BYTES+MOVZX
293 cycles, COPYAtoB_SSEF- 2062 bytes- copy 16 BYTES+MOVZX
294 cycles, COPYAtoB_SSEES-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
294 cycles, COPYAtoB_SSEDS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
294 cycles, COPYAtoB_SSEFS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
296 cycles, memcpy_2-      2062 bytes- copy memcpy SSE
310 cycles, COPYAtoB_SSEG- 2062 bytes- copy 16 BYTES+MOVZX
319 cycles, COPYAtoB_SSEGS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
329 cycles, COPYAtoB_WZZE- 1027 bytes- copy 16 BYTES+MOVZX
331 cycles, COPYAtoB_SSEHS-2062 bytes- copy 16 BYTES+MOVZX- uses ESP
331 cycles, COPYAtoB_SSEH- 2062 bytes- copy 16 BYTES+MOVZX
468 cycles, COPYAtoB_XZZFS-1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
469 cycles, COPYAtoB_XZZES-1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
470 cycles, COPYAtoB_XZZF- 1027 bytes- copy lenght DWORDS+MOVZX
471 cycles, COPYAtoB_XZZE- 1027 bytes- copy lenght DWORDS+MOVZX
471 cycles, COPYAtoB_XZZC- 1027 bytes- copy lenght DWORDS+MOVZX
473 cycles, COPYAtoB_XZE-  1027 bytes- copy lenght DWORDS+MOVZX
497 cycles, COPYAtoB_XZES- 1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
499 cycles, COPYAtoB_XZZCS-1027 bytes- copy lenght DWORDS+MOVZX- uses ESP
509 cycles, COPYAtoB_YZZE- 2062 bytes- copy length 16 BYTES+MOVZX
568 cycles, COPYAtoB_YZZH- 2062 bytes- copy length 16 BYTES+MOVZX
570 cycles, COPYAtoB_YZZG- 2062 bytes- copy length 16 BYTES+MOVZX
583 cycles, memcpy_1-      2062 bytes- copy regcopy
677 cycles, COPYAtoB_WZZE- 2062 bytes- copy 16 BYTES+MOVZX
919 cycles, COPYAtoB_XZZFS-2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
919 cycles, COPYAtoB_XZZES-2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
920 cycles, COPYAtoB_XZZE- 2062 bytes- copy lenght DWORDS+MOVZX
921 cycles, COPYAtoB_XZE-  2062 bytes- copy lenght DWORDS+MOVZX
922 cycles, COPYAtoB_XZZC- 2062 bytes- copy lenght DWORDS+MOVZX
922 cycles, COPYAtoB_XZZF- 2062 bytes- copy lenght DWORDS+MOVZX
948 cycles, COPYAtoB_XZZCS-2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
948 cycles, COPYAtoB_XZES- 2062 bytes- copy lenght DWORDS+MOVZX- uses ESP
********** END III **********

RuiLoureiro

Hi,
        Now, i used crt_memcpy, memcpy_1, memcpy_2, memcpy_3, memcpy_4
        procedures and some SSE procedures.
       
        Gunther, could you post the results of your i7,
        if you dont mind ?
        Thank you.
       
        These are my results.
Quote
NOT ALIGNED
----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
***** Time table *****

  34 cycles, COPYAtoB_SSEE-  15 bytes- copy 16 BYTES+MOVZX
  35 cycles, COPYAtoB_YZZI-  15 bytes- copy length 16 BYTES+MOVZX
  35 cycles, crt_memcpy-     15 bytes- copy crt_memcpy
  35 cycles, COPYAtoB_XZZF-  15 bytes- copy lenght DWORDS+MOVZX
  37 cycles, COPYAtoB_WZZE-  15 bytes- copy 16 BYTES+MOVZX
  37 cycles, COPYAtoB_YZZH-  15 bytes- copy length 16 BYTES+MOVZX
  37 cycles, COPYAtoB_XZZE-  15 bytes- copy lenght DWORDS+MOVZX
  37 cycles, COPYAtoB_XZZC-  15 bytes- copy lenght DWORDS+MOVZX
  39 cycles, COPYAtoB_YZZG-  15 bytes- copy length 16 BYTES+MOVZX
  39 cycles, COPYAtoB_YZZE-  15 bytes- copy length 16 BYTES+MOVZX
  47 cycles, COPYAtoB_XZE-   15 bytes- copy lenght DWORDS+MOVZX
  50 cycles, COPYAtoB_SSEI-  15 bytes- copy 16 BYTES+MOVZX
  51 cycles, COPYAtoB_SSEJ-  15 bytes- copy 16 BYTES+MOVZX
  61 cycles, COPYAtoB_SSEH-  15 bytes- copy 16 BYTES+MOVZX
  68 cycles, memcpy_1-       53 bytes- copy regcopy
  72 cycles, memcpy_1-       15 bytes- copy regcopy
 
  72 cycles, COPYAtoB_YZZI-  53 bytes- copy length 16 BYTES+MOVZX
  73 cycles, memcpy_4-       15 bytes- copy memcpy * 8
  73 cycles, COPYAtoB_YZZH-  53 bytes- copy length 16 BYTES+MOVZX
  75 cycles, COPYAtoB_YZZG-  53 bytes- copy length 16 BYTES+MOVZX
  75 cycles, COPYAtoB_XZZC-  53 bytes- copy lenght DWORDS+MOVZX
  76 cycles, COPYAtoB_XZE-   53 bytes- copy lenght DWORDS+MOVZX
  76 cycles, COPYAtoB_XZZF-  53 bytes- copy lenght DWORDS+MOVZX
  78 cycles, COPYAtoB_XZZE-  53 bytes- copy lenght DWORDS+MOVZX
  78 cycles, COPYAtoB_YZZE-  53 bytes- copy length 16 BYTES+MOVZX
  89 cycles, COPYAtoB_WZZE-  53 bytes- copy 16 BYTES+MOVZX
  90 cycles, COPYAtoB_SSEJ-  53 bytes- copy 16 BYTES+MOVZX
  94 cycles, COPYAtoB_SSEI-  53 bytes- copy 16 BYTES+MOVZX
  98 cycles, COPYAtoB_SSEH-  53 bytes- copy 16 BYTES+MOVZX
100 cycles, COPYAtoB_SSEE53 bytes- copy 16 BYTES+MOVZX

103 cycles, memcpy_4-       53 bytes- copy memcpy * 8
106 cycles, memcpy_1-      103 bytes- copy regcopy
108 cycles, COPYAtoB_YZZG- 103 bytes- copy length 16 BYTES+MOVZX
108 cycles, COPYAtoB_YZZI- 103 bytes- copy length 16 BYTES+MOVZX
108 cycles, crt_memcpy-     53 bytes- copy crt_memcpy
109 cycles, COPYAtoB_YZZH- 103 bytes- copy length 16 BYTES+MOVZX
111 cycles, COPYAtoB_YZZE- 103 bytes- copy length 16 BYTES+MOVZX
112 cycles, COPYAtoB_WZZE- 103 bytes- copy 16 BYTES+MOVZX
121 cycles, COPYAtoB_XZZF- 103 bytes- copy lenght DWORDS+MOVZX
122 cycles, COPYAtoB_XZZC- 103 bytes- copy lenght DWORDS+MOVZX
125 cycles, COPYAtoB_XZE-  103 bytes- copy lenght DWORDS+MOVZX
130 cycles, COPYAtoB_XZZE- 103 bytes- copy lenght DWORDS+MOVZX
132 cycles, memcpy_4-      103 bytes- copy memcpy * 8
139 cycles, crt_memcpy-    103 bytes- copy crt_memcpy

172 cycles, COPYAtoB_YZZG- 203 bytes- copy length 16 BYTES+MOVZX
173 cycles, COPYAtoB_YZZI- 203 bytes- copy length 16 BYTES+MOVZX
173 cycles, COPYAtoB_YZZE- 203 bytes- copy length 16 BYTES+MOVZX
174 cycles, COPYAtoB_YZZH- 203 bytes- copy length 16 BYTES+MOVZX
177 cycles, memcpy_1-      203 bytes- copy regcopy
206 cycles, COPYAtoB_SSEJ- 103 bytes- copy 16 BYTES+MOVZX
209 cycles, COPYAtoB_SSEI- 103 bytes- copy 16 BYTES+MOVZX
211 cycles, COPYAtoB_SSEE- 103 bytes- copy 16 BYTES+MOVZX
212 cycles, COPYAtoB_WZZE- 203 bytes- copy 16 BYTES+MOVZX
212 cycles, memcpy_4-      203 bytes- copy memcpy * 8
213 cycles, COPYAtoB_XZZE- 203 bytes- copy lenght DWORDS+MOVZX
213 cycles, COPYAtoB_XZZC- 203 bytes- copy lenght DWORDS+MOVZX
214 cycles, COPYAtoB_SSEH- 103 bytes- copy 16 BYTES+MOVZX
214 cycles, COPYAtoB_XZZF- 203 bytes- copy lenght DWORDS+MOVZX
225 cycles, crt_memcpy-    203 bytes- copy crt_memcpy
231 cycles, COPYAtoB_XZE-  203 bytes- copy lenght DWORDS+MOVZX
250 cycles, memcpy_2-       15 bytes- copy memcpy SSE
256 cycles, memcpy_3-       15 bytes- copy memcpyxmmU SSE
277 cycles, memcpy_3-       53 bytes- copy memcpyxmmU SSE
278 cycles, memcpy_2-       53 bytes- copy memcpy SSE
379 cycles, memcpy_3-      103 bytes- copy memcpyxmmU SSE
393 cycles, memcpy_2-      103 bytes- copy memcpy SSE
394 cycles, COPYAtoB_YZZI- 503 bytes- copy length 16 BYTES+MOVZX
395 cycles, COPYAtoB_YZZG- 503 bytes- copy length 16 BYTES+MOVZX
398 cycles, COPYAtoB_YZZH- 503 bytes- copy length 16 BYTES+MOVZX

412 cycles, COPYAtoB_SSEE- 203 bytes- copy 16 BYTES+MOVZX
414 cycles, memcpy_1-      503 bytes- copy regcopy
418 cycles, COPYAtoB_YZZE- 503 bytes- copy length 16 BYTES+MOVZX
427 cycles, COPYAtoB_XZZC- 503 bytes- copy lenght DWORDS+MOVZX
428 cycles, COPYAtoB_WZZE- 503 bytes- copy 16 BYTES+MOVZX
429 cycles, COPYAtoB_SSEH- 203 bytes- copy 16 BYTES+MOVZX
430 cycles, COPYAtoB_SSEI- 203 bytes- copy 16 BYTES+MOVZX
431 cycles, COPYAtoB_SSEJ- 203 bytes- copy 16 BYTES+MOVZX

436 cycles, memcpy_4-      503 bytes- copy memcpy * 8
442 cycles, COPYAtoB_XZZF- 503 bytes- copy lenght DWORDS+MOVZX
444 cycles, COPYAtoB_XZZE- 503 bytes- copy lenght DWORDS+MOVZX
454 cycles, crt_memcpy-    503 bytes- copy crt_memcpy
503 cycles, COPYAtoB_XZE-  503 bytes- copy lenght DWORDS+MOVZX
620 cycles, memcpy_2-      203 bytes- copy memcpy SSE
631 cycles, memcpy_3-      203 bytes- copy memcpyxmmU SSE

726 cycles, memcpy_1-     1027 bytes- copy regcopy
734 cycles, COPYAtoB_YZZG-1027 bytes- copy length 16 BYTES+MOVZX
769 cycles, COPYAtoB_YZZI-1027 bytes- copy length 16 BYTES+MOVZX
787 cycles, COPYAtoB_YZZH-1027 bytes- copy length 16 BYTES+MOVZX
788 cycles, COPYAtoB_YZZE-1027 bytes- copy length 16 BYTES+MOVZX
808 cycles, COPYAtoB_XZZE-1027 bytes- copy lenght DWORDS+MOVZX
815 cycles, COPYAtoB_XZE- 1027 bytes- copy lenght DWORDS+MOVZX
815 cycles, COPYAtoB_XZZC-1027 bytes- copy lenght DWORDS+MOVZX
817 cycles, COPYAtoB_WZZE-1027 bytes- copy 16 BYTES+MOVZX
825 cycles, memcpy_4-     1027 bytes- copy memcpy * 8
835 cycles, crt_memcpy-   1027 bytes- copy crt_memcpy
851 cycles, COPYAtoB_XZZF-1027 bytes- copy lenght DWORDS+MOVZX

965 cycles, COPYAtoB_SSEH- 503 bytes- copy 16 BYTES+MOVZX
968 cycles, COPYAtoB_SSEJ- 503 bytes- copy 16 BYTES+MOVZX
970 cycles, COPYAtoB_SSEI- 503 bytes- copy 16 BYTES+MOVZX
989 cycles, COPYAtoB_SSEE- 503 bytes- copy 16 BYTES+MOVZX
1142 cycles, memcpy_2-      503 bytes- copy memcpy SSE
1163 cycles, memcpy_3-      503 bytes- copy memcpyxmmU SSE
1529 cycles, COPYAtoB_YZZI-2062 bytes- copy length 16 BYTES+MOVZX
1538 cycles, COPYAtoB_YZZG-2062 bytes- copy length 16 BYTES+MOVZX
1563 cycles, COPYAtoB_XZZC-2062 bytes- copy lenght DWORDS+MOVZX
1581 cycles, COPYAtoB_YZZE-2062 bytes- copy length 16 BYTES+MOVZX
1585 cycles, COPYAtoB_WZZE-2062 bytes- copy 16 BYTES+MOVZX
1603 cycles, memcpy_1-     2062 bytes- copy regcopy
1622 cycles, COPYAtoB_XZZF-2062 bytes- copy lenght DWORDS+MOVZX
1622 cycles, memcpy_4-     2062 bytes- copy memcpy * 8
1626 cycles, crt_memcpy-   2062 bytes- copy crt_memcpy
1626 cycles, COPYAtoB_XZE- 2062 bytes- copy lenght DWORDS+MOVZX
1630 cycles, COPYAtoB_XZZE-2062 bytes- copy lenght DWORDS+MOVZX
1734 cycles, COPYAtoB_YZZH-2062 bytes- copy length 16 BYTES+MOVZX
2029 cycles, COPYAtoB_SSEE-1027 bytes- copy 16 BYTES+MOVZX
2060 cycles, COPYAtoB_SSEH-1027 bytes- copy 16 BYTES+MOVZX
2074 cycles, COPYAtoB_SSEJ-1027 bytes- copy 16 BYTES+MOVZX
2097 cycles, memcpy_3-     1027 bytes- copy memcpyxmmU SSE
2119 cycles, COPYAtoB_SSEI-1027 bytes- copy 16 BYTES+MOVZX
2123 cycles, memcpy_2-     1027 bytes- copy memcpy SSE
4043 cycles, COPYAtoB_SSEE-2062 bytes- copy 16 BYTES+MOVZX
4065 cycles, COPYAtoB_SSEI-2062 bytes- copy 16 BYTES+MOVZX
4070 cycles, COPYAtoB_SSEH-2062 bytes- copy 16 BYTES+MOVZX
4086 cycles, COPYAtoB_SSEJ-2062 bytes- copy 16 BYTES+MOVZX
4088 cycles, memcpy_2-     2062 bytes- copy memcpy SSE
4391 cycles, memcpy_3-     2062 bytes- copy memcpyxmmU SSE
********** END III **********

Gunther

Hi Rui,

the results for CopyString54 are attached as CopyStrin54.zip. I hope that helps:

Gunther
You have to know the facts before you can distort them.

RuiLoureiro

Hi Gunther,
            It helps, thank you.  :t
            Now i want to test another procedure.         
            Could you run CopyString55, please ?
            Thanks.
       
These are my results.
Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
----------------------------------------------------
***** Time table *****

34 cycles, COPYAtoB_YZZI-   15 bytes- copy length 16 BYTES+MOVZX
35 cycles, COPYAtoB_SSEE-   15 bytes- copy 16 BYTES+MOVZX
35 cycles, crt_memcpy-      15 bytes- copy crt_memcpy
35 cycles, COPYAtoB_XZZF-   15 bytes- copy lenght DWORDS+MOVZX
36 cycles, COPYAtoB_YZZH-   15 bytes- copy length 16 BYTES+MOVZX
36 cycles, COPYAtoB_XZZC-   15 bytes- copy lenght DWORDS+MOVZX
37 cycles, COPYAtoB_XZZE-   15 bytes- copy lenght DWORDS+MOVZX
37 cycles, COPYAtoB_WZZE-   15 bytes- copy 16 BYTES+MOVZX
39 cycles, COPYAtoB_YZZG-   15 bytes- copy length 16 BYTES+MOVZX
39 cycles, COPYAtoB_YZZE-   15 bytes- copy length 16 BYTES+MOVZX
44 cycles, COPYAtoB_XZE-    15 bytes- copy lenght DWORDS+MOVZX
48 cycles, COPYAtoB_SSEK-   15 bytes- copy 16 BYTES+MOVZX
50 cycles, COPYAtoB_SSEJ-   15 bytes- copy 16 BYTES+MOVZX
51 cycles, COPYAtoB_SSEI-   15 bytes- copy 16 BYTES+MOVZX
59 cycles, COPYAtoB_SSEH-   15 bytes- copy 16 BYTES+MOVZX
68 cycles, memcpy_1-        53 bytes- copy regcopy
71 cycles, COPYAtoB_YZZI-   53 bytes- copy length 16 BYTES+MOVZX
71 cycles, COPYAtoB_YZZG-   53 bytes- copy length 16 BYTES+MOVZX
71 cycles, memcpy_4-        15 bytes- copy memcpy * 8
74 cycles, COPYAtoB_WZZE-   53 bytes- copy 16 BYTES+MOVZX
75 cycles, COPYAtoB_XZZE-   53 bytes- copy lenght DWORDS+MOVZX
75 cycles, COPYAtoB_XZZF-   53 bytes- copy lenght DWORDS+MOVZX
77 cycles, COPYAtoB_YZZE-   53 bytes- copy length 16 BYTES+MOVZX
79 cycles, COPYAtoB_XZZC-   53 bytes- copy lenght DWORDS+MOVZX
81 cycles, memcpy_1-        15 bytes- copy regcopy
83 cycles, COPYAtoB_XZE-    53 bytes- copy lenght DWORDS+MOVZX
84 cycles, COPYAtoB_SSEK-   53 bytes- copy 16 BYTES+MOVZX
89 cycles, COPYAtoB_SSEI-   53 bytes- copy 16 BYTES+MOVZX
91 cycles, COPYAtoB_SSEE-   53 bytes- copy 16 BYTES+MOVZX
94 cycles, COPYAtoB_SSEJ-   53 bytes- copy 16 BYTES+MOVZX
100 cycles, COPYAtoB_YZZH-   53 bytes- copy length 16 BYTES+MOVZX
100 cycles, COPYAtoB_SSEH-   53 bytes- copy 16 BYTES+MOVZX
102 cycles, memcpy_4-        53 bytes- copy memcpy * 8
107 cycles, crt_memcpy-      53 bytes- copy crt_memcpy
107 cycles, memcpy_1-       103 bytes- copy regcopy
108 cycles, COPYAtoB_YZZG-  103 bytes- copy length 16 BYTES+MOVZX
108 cycles, COPYAtoB_YZZI-  103 bytes- copy length 16 BYTES+MOVZX
110 cycles, COPYAtoB_YZZE-  103 bytes- copy length 16 BYTES+MOVZX
110 cycles, COPYAtoB_YZZH-  103 bytes- copy length 16 BYTES+MOVZX
112 cycles, COPYAtoB_WZZE-  103 bytes- copy 16 BYTES+MOVZX
121 cycles, COPYAtoB_XZZF-  103 bytes- copy lenght DWORDS+MOVZX
122 cycles, COPYAtoB_XZZC-  103 bytes- copy lenght DWORDS+MOVZX
123 cycles, COPYAtoB_XZZE-  103 bytes- copy lenght DWORDS+MOVZX
133 cycles, memcpy_4-       103 bytes- copy memcpy * 8
134 cycles, COPYAtoB_XZE-   103 bytes- copy lenght DWORDS+MOVZX
139 cycles, crt_memcpy-     103 bytes- copy crt_memcpy
170 cycles, COPYAtoB_YZZG-  203 bytes- copy length 16 BYTES+MOVZX
175 cycles, COPYAtoB_YZZH-  203 bytes- copy length 16 BYTES+MOVZX
175 cycles, memcpy_1-       203 bytes- copy regcopy
179 cycles, COPYAtoB_YZZE-  203 bytes- copy length 16 BYTES+MOVZX
188 cycles, COPYAtoB_YZZI-  203 bytes- copy length 16 BYTES+MOVZX
198 cycles, COPYAtoB_SSEK103 bytes- copy 16 BYTES+MOVZX
207 cycles, COPYAtoB_SSEI-  103 bytes- copy 16 BYTES+MOVZX
210 cycles, COPYAtoB_WZZE-  203 bytes- copy 16 BYTES+MOVZX
211 cycles, COPYAtoB_SSEE-  103 bytes- copy 16 BYTES+MOVZX
211 cycles, COPYAtoB_XZZC-  203 bytes- copy lenght DWORDS+MOVZX
212 cycles, COPYAtoB_SSEH-  103 bytes- copy 16 BYTES+MOVZX
212 cycles, COPYAtoB_XZZE-  203 bytes- copy lenght DWORDS+MOVZX
213 cycles, memcpy_4-       203 bytes- copy memcpy * 8
214 cycles, COPYAtoB_XZZF-  203 bytes- copy lenght DWORDS+MOVZX
216 cycles, COPYAtoB_SSEJ-  103 bytes- copy 16 BYTES+MOVZX
224 cycles, crt_memcpy-     203 bytes- copy crt_memcpy
233 cycles, COPYAtoB_XZE-   203 bytes- copy lenght DWORDS+MOVZX
250 cycles, memcpy_2-        15 bytes- copy memcpy SSE
254 cycles, memcpy_3-        15 bytes- copy memcpyxmmU SSE
276 cycles, memcpy_3-        53 bytes- copy memcpyxmmU SSE
299 cycles, memcpy_2-        53 bytes- copy memcpy SSE
381 cycles, memcpy_3-       103 bytes- copy memcpyxmmU SSE
392 cycles, COPYAtoB_YZZG-  503 bytes- copy length 16 BYTES+MOVZX
392 cycles, COPYAtoB_YZZI-  503 bytes- copy length 16 BYTES+MOVZX
393 cycles, memcpy_1-       503 bytes- copy regcopy
397 cycles, COPYAtoB_YZZH-  503 bytes- copy length 16 BYTES+MOVZX
403 cycles, memcpy_2-       103 bytes- copy memcpy SSE
408 cycles, COPYAtoB_YZZE-  503 bytes- copy length 16 BYTES+MOVZX
408 cycles, COPYAtoB_SSEE-  203 bytes- copy 16 BYTES+MOVZX
421 cycles, COPYAtoB_XZZC-  503 bytes- copy lenght DWORDS+MOVZX
424 cycles, COPYAtoB_XZZE-  503 bytes- copy lenght DWORDS+MOVZX
428 cycles, COPYAtoB_SSEH-  203 bytes- copy 16 BYTES+MOVZX
429 cycles, COPYAtoB_SSEI-  203 bytes- copy 16 BYTES+MOVZX
435 cycles, COPYAtoB_SSEJ-  203 bytes- copy 16 BYTES+MOVZX
437 cycles, COPYAtoB_WZZE-  503 bytes- copy 16 BYTES+MOVZX
440 cycles, crt_memcpy-     503 bytes- copy crt_memcpy
447 cycles, COPYAtoB_XZZF-  503 bytes- copy lenght DWORDS+MOVZX
448 cycles, COPYAtoB_XZE-   503 bytes- copy lenght DWORDS+MOVZX
452 cycles, COPYAtoB_SSEK203 bytes- copy 16 BYTES+MOVZX
464 cycles, memcpy_4-       503 bytes- copy memcpy * 8
623 cycles, memcpy_2-       203 bytes- copy memcpy SSE
631 cycles, memcpy_3-       203 bytes- copy memcpyxmmU SSE
728 cycles, COPYAtoB_YZZI- 1027 bytes- copy length 16 BYTES+MOVZX
732 cycles, COPYAtoB_YZZG- 1027 bytes- copy length 16 BYTES+MOVZX
732 cycles, memcpy_1-      1027 bytes- copy regcopy
733 cycles, COPYAtoB_YZZH- 1027 bytes- copy length 16 BYTES+MOVZX
787 cycles, COPYAtoB_YZZE- 1027 bytes- copy length 16 BYTES+MOVZX
806 cycles, COPYAtoB_XZZF- 1027 bytes- copy lenght DWORDS+MOVZX
807 cycles, COPYAtoB_XZE-  1027 bytes- copy lenght DWORDS+MOVZX
830 cycles, memcpy_4-      1027 bytes- copy memcpy * 8
830 cycles, COPYAtoB_XZZE- 1027 bytes- copy lenght DWORDS+MOVZX
850 cycles, crt_memcpy-    1027 bytes- copy crt_memcpy
864 cycles, COPYAtoB_WZZE- 1027 bytes- copy 16 BYTES+MOVZX
874 cycles, COPYAtoB_XZZC- 1027 bytes- copy lenght DWORDS+MOVZX
960 cycles, COPYAtoB_SSEK503 bytes- copy 16 BYTES+MOVZX
964 cycles, COPYAtoB_SSEI-  503 bytes- copy 16 BYTES+MOVZX
965 cycles, COPYAtoB_SSEJ-  503 bytes- copy 16 BYTES+MOVZX
965 cycles, COPYAtoB_SSEH-  503 bytes- copy 16 BYTES+MOVZX
986 cycles, COPYAtoB_SSEE-  503 bytes- copy 16 BYTES+MOVZX
1166 cycles, memcpy_2-      503 bytes- copy memcpy SSE
1241 cycles, memcpy_3-      503 bytes- copy memcpyxmmU SSE
1526 cycles, COPYAtoB_YZZH-2062 bytes- copy length 16 BYTES+MOVZX
1527 cycles, COPYAtoB_YZZG-2062 bytes- copy length 16 BYTES+MOVZX
1536 cycles, COPYAtoB_YZZI-2062 bytes- copy length 16 BYTES+MOVZX
1556 cycles, memcpy_1-     2062 bytes- copy regcopy
1559 cycles, COPYAtoB_XZZE-2062 bytes- copy lenght DWORDS+MOVZX
1561 cycles, COPYAtoB_XZE- 2062 bytes- copy lenght DWORDS+MOVZX
1571 cycles, COPYAtoB_XZZF-2062 bytes- copy lenght DWORDS+MOVZX
1574 cycles, COPYAtoB_YZZE-2062 bytes- copy length 16 BYTES+MOVZX
1577 cycles, COPYAtoB_WZZE-2062 bytes- copy 16 BYTES+MOVZX
1585 cycles, COPYAtoB_XZZC-2062 bytes- copy lenght DWORDS+MOVZX
1606 cycles, memcpy_4-     2062 bytes- copy memcpy * 8
1620 cycles, crt_memcpy-   2062 bytes- copy crt_memcpy
2019 cycles, COPYAtoB_SSEE-1027 bytes- copy 16 BYTES+MOVZX
2038 cycles, COPYAtoB_SSEK-1027 bytes- copy 16 BYTES+MOVZX
2049 cycles, COPYAtoB_SSEH-1027 bytes- copy 16 BYTES+MOVZX
2051 cycles, COPYAtoB_SSEJ-1027 bytes- copy 16 BYTES+MOVZX
2061 cycles, COPYAtoB_SSEI-1027 bytes- copy 16 BYTES+MOVZX
2110 cycles, memcpy_3-     1027 bytes- copy memcpyxmmU SSE
2116 cycles, memcpy_2-     1027 bytes- copy memcpy SSE
4029 cycles, COPYAtoB_SSEI-2062 bytes- copy 16 BYTES+MOVZX
4031 cycles, COPYAtoB_SSEJ-2062 bytes- copy 16 BYTES+MOVZX
4042 cycles, COPYAtoB_SSEE-2062 bytes- copy 16 BYTES+MOVZX
4046 cycles, COPYAtoB_SSEK-2062 bytes- copy 16 BYTES+MOVZX
4057 cycles, COPYAtoB_SSEH-2062 bytes- copy 16 BYTES+MOVZX
4085 cycles, memcpy_2-     2062 bytes- copy memcpy SSE
4389 cycles, memcpy_3-     2062 bytes- copy memcpyxmmU SSE
********** END III **********

Gunther

Rui,

results are attached.

Gunther
You have to know the facts before you can distort them.

RuiLoureiro

Hi Gunther,
            Thank you for your work  :t
            I am trying to understand the behaviour of
            each procedure.
            COPYAtoB_SSEJ are EQUAL to COPYAtoB_SSEK
            but 'J' has 1push+1pop and 'K' not (30 cycles more).
            I don't believe that 'J' is best than 'K'.
            Something seems to be wrong here.
---------------------------
Results from Gunther
---------------------------
Quote
CopyString54.txt
-------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
--------------------------------------------------------------
***** Time table *****

14 cycles, crt_memcpy-     15 bytes- copy crt_memcpy
17 cycles, COPYAtoB_SSEJ-  53 bytes- copy 16 BYTES+MOVZX           <<<<---J

23 cycles, COPYAtoB_SSEJ- 103 bytes- copy 16 BYTES+MOVZX           <<<<---J
25 cycles, memcpy_1-       53 bytes- copy regcopy
27 cycles, memcpy_1-       15 bytes- copy regcopy
27 cycles, COPYAtoB_SSEJ-  15 bytes- copy 16 BYTES+MOVZX
30 cycles, memcpy_4-       15 bytes- copy memcpy * 8
32 cycles, memcpy_2-       15 bytes- copy memcpy SSE
34 cycles, memcpy_3-       15 bytes- copy memcpyxmmU SSE
35 cycles, memcpy_4-       53 bytes- copy memcpy * 8
36 cycles, crt_memcpy-     53 bytes- copy crt_memcpy
38 cycles, memcpy_2-       53 bytes- copy memcpy SSE
39 cycles, memcpy_3-       53 bytes- copy memcpyxmmU SSE
39 cycles, memcpy_1-      103 bytes- copy regcopy

41 cycles, COPYAtoB_SSEJ- 203 bytes- copy 16 BYTES+MOVZX           <<<<---J
44 cycles, memcpy_2-      103 bytes- copy memcpy SSE
46 cycles, memcpy_4-      103 bytes- copy memcpy * 8
47 cycles, crt_memcpy-    103 bytes- copy crt_memcpy
48 cycles, memcpy_3-      103 bytes- copy memcpyxmmU SSE
50 cycles, memcpy_3-      203 bytes- copy memcpyxmmU SSE
60 cycles, memcpy_2-      203 bytes- copy memcpy SSE
67 cycles, memcpy_4-      203 bytes- copy memcpy * 8
68 cycles, memcpy_1-      203 bytes- copy regcopy
68 cycles, crt_memcpy-    203 bytes- copy crt_memcpy

69 cycles, memcpy_3-      503 bytes- copy memcpyxmmU SSE           >>>>>---
82 cycles, COPYAtoB_SSEJ- 503 bytes- copy 16 BYTES+MOVZX           <<<<---J
86 cycles, memcpy_4-      503 bytes- copy memcpy * 8
93 cycles, crt_memcpy-    503 bytes- copy crt_memcpy

102 cycles, memcpy_3-     1027 bytes- copy memcpyxmmU SSE           >>>>>---
122 cycles, memcpy_4-     1027 bytes- copy memcpy * 8
129 cycles, crt_memcpy-   1027 bytes- copy crt_memcpy
132 cycles, memcpy_2-      503 bytes- copy memcpy SSE
132 cycles, COPYAtoB_SSEJ-1027 bytes- copy 16 BYTES+MOVZX           <<<<---J
146 cycles, memcpy_1-      503 bytes- copy regcopy

184 cycles, memcpy_3-     2062 bytes- copy memcpyxmmU SSE           >>>>>---
187 cycles, memcpy_2-     1027 bytes- copy memcpy SSE
210 cycles, memcpy_4-     2062 bytes- copy memcpy * 8
215 cycles, crt_memcpy-   2062 bytes- copy crt_memcpy
262 cycles, COPYAtoB_SSEJ-2062 bytes- copy 16 BYTES+MOVZX           <<<<---J

286 cycles, memcpy_1-     1027 bytes- copy regcopy
297 cycles, memcpy_2-     2062 bytes- copy memcpy SSE
582 cycles, memcpy_1-     2062 bytes- copy regcopy
********** END III **********
Quote
CopyString55.txt
--------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
--------------------------------------------------------------
***** Time table *****

14 cycles, crt_memcpy-     15 bytes- copy crt_memcpy
17 cycles, COPYAtoB_SSEK-  53 bytes- copy 16 BYTES+MOVZX       <<<<<---- K
17 cycles, COPYAtoB_SSEJ-  53 bytes- copy 16 BYTES+MOVZX       <<<<<---- J

25 cycles, memcpy_1-       53 bytes- copy regcopy
26 cycles, COPYAtoB_SSEJ-  15 bytes- copy 16 BYTES+MOVZX       <<<<<---- J
29 cycles, memcpy_1-       15 bytes- copy regcopy
30 cycles, COPYAtoB_SSEK-  15 bytes- copy 16 BYTES+MOVZX       <<<<<---- K
31 cycles, memcpy_4-       15 bytes- copy memcpy * 8
33 cycles, memcpy_2-       15 bytes- copy memcpy SSE
35 cycles, memcpy_3-       15 bytes- copy memcpyxmmU SSE
36 cycles, memcpy_4-       53 bytes- copy memcpy * 8
37 cycles, COPYAtoB_SSEJ- 103 bytes- copy 16 BYTES+MOVZX       <<<<<---- J
37 cycles, COPYAtoB_SSEK- 103 bytes- copy 16 BYTES+MOVZX       <<<<<---- K

37 cycles, crt_memcpy-     53 bytes- copy crt_memcpy
38 cycles, memcpy_2-       53 bytes- copy memcpy SSE
39 cycles, memcpy_3-       53 bytes- copy memcpyxmmU SSE
41 cycles, memcpy_1-      103 bytes- copy regcopy
46 cycles, memcpy_4-      103 bytes- copy memcpy * 8
47 cycles, crt_memcpy-    103 bytes- copy crt_memcpy
48 cycles, COPYAtoB_SSEJ- 203 bytes- copy 16 BYTES+MOVZX       <<<<<---- J
51 cycles, COPYAtoB_SSEK- 203 bytes- copy 16 BYTES+MOVZX       <<<<<---- K

52 cycles, memcpy_2-      103 bytes- copy memcpy SSE
53 cycles, memcpy_3-      103 bytes- copy memcpyxmmU SSE
55 cycles, memcpy_3-      203 bytes- copy memcpyxmmU SSE
62 cycles, memcpy_2-      203 bytes- copy memcpy SSE
67 cycles, COPYAtoB_SSEJ- 503 bytes- copy 16 BYTES+MOVZX       <<<<<---- J

68 cycles, memcpy_4-      203 bytes- copy memcpy * 8
69 cycles, memcpy_1-      203 bytes- copy regcopy
70 cycles, crt_memcpy-    203 bytes- copy crt_memcpy
72 cycles, memcpy_3-      503 bytes- copy memcpyxmmU SSE       >>>>>>>--
88 cycles, memcpy_4-      503 bytes- copy memcpy * 8
94 cycles, COPYAtoB_SSEK- 503 bytes- copy 16 BYTES+MOVZX       <<<<<---- K
97 cycles, crt_memcpy-    503 bytes- copy crt_memcpy

119 cycles, memcpy_3-     1027 bytes- copy memcpyxmmU SSE       >>>>>>--
126 cycles, memcpy_2-      503 bytes- copy memcpy SSE
133 cycles, memcpy_4-     1027 bytes- copy memcpy * 8
133 cycles, COPYAtoB_SSEJ-1027 bytes- copy 16 BYTES+MOVZX       <<<<<---- J
138 cycles, crt_memcpy-   1027 bytes- copy crt_memcpy
149 cycles, memcpy_1-      503 bytes- copy regcopy
162 cycles, COPYAtoB_SSEK-1027 bytes- copy 16 BYTES+MOVZX       <<<<<---- K
194 cycles, memcpy_2-     1027 bytes- copy memcpy SSE

196 cycles, memcpy_3-     2062 bytes- copy memcpyxmmU SSE       >>>>>>--
219 cycles, crt_memcpy-   2062 bytes- copy crt_memcpy
220 cycles, memcpy_4-     2062 bytes- copy memcpy * 8
262 cycles, COPYAtoB_SSEJ-2062 bytes- copy 16 BYTES+MOVZX       <<<<<---- J
289 cycles, memcpy_1-     1027 bytes- copy regcopy
296 cycles, COPYAtoB_SSEK-2062 bytes- copy 16 BYTES+MOVZX       <<<<<---- K

306 cycles, memcpy_2-     2062 bytes- copy memcpy SSE
597 cycles, memcpy_1-     2062 bytes- copy regcopy
********** END III **********

RuiLoureiro

Hi Gunther,

            Now, i want to test SSEL in your i7.         
            Could you run CopyString56, please ?
            Thanks.
       
These are my results.
Quote
-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
...   
  59 cycles, COPYAtoB_SSEL,   15 bytes - copy 16 BYTES+MOVZX
  64 cycles, COPYAtoB_SSEL,   53 bytes - copy 16 BYTES+MOVZX
199 cycles, COPYAtoB_SSEL,  103 bytes - copy 16 BYTES+MOVZX
445 cycles, COPYAtoB_SSEL,  203 bytes - copy 16 BYTES+MOVZX
959 cycles, COPYAtoB_SSEL,  503 bytes - copy 16 BYTES+MOVZX
2026 cycles, COPYAtoB_SSEL, 1027 bytes - copy 16 BYTES+MOVZX
4025 cycles, COPYAtoB_SSEL, 2062 bytes - copy 16 BYTES+MOVZX

  58 cycles, COPYAtoB_SSEJ,   15 bytes - copy 16 BYTES+MOVZX
  91 cycles, COPYAtoB_SSEJ,   53 bytes - copy 16 BYTES+MOVZX
204 cycles, COPYAtoB_SSEJ,  103 bytes - copy 16 BYTES+MOVZX
430 cycles, COPYAtoB_SSEJ,  203 bytes - copy 16 BYTES+MOVZX
974 cycles, COPYAtoB_SSEJ,  503 bytes - copy 16 BYTES+MOVZX
2046 cycles, COPYAtoB_SSEJ, 1027 bytes - copy 16 BYTES+MOVZX
4025 cycles, COPYAtoB_SSEJ, 2062 bytes - copy 16 BYTES+MOVZX

  50 cycles, COPYAtoB_SSEK,   15 bytes - copy 16 BYTES+MOVZX
  88 cycles, COPYAtoB_SSEK,   53 bytes - copy 16 BYTES+MOVZX
189 cycles, COPYAtoB_SSEK,  103 bytes - copy 16 BYTES+MOVZX
429 cycles, COPYAtoB_SSEK,  203 bytes - copy 16 BYTES+MOVZX
962 cycles, COPYAtoB_SSEK,  503 bytes - copy 16 BYTES+MOVZX
2035 cycles, COPYAtoB_SSEK, 1027 bytes - copy 16 BYTES+MOVZX
4044 cycles, COPYAtoB_SSEK, 2062 bytes - copy 16 BYTES+MOVZX

Gunther

Rui,

results are attached in c56.zip. I think it's time now to sum up and provide the source code for the best procedures.

Gunther
You have to know the facts before you can distort them.

RuiLoureiro

Quote from: Gunther on July 11, 2014, 12:06:29 AM
Rui,
...
I think it's time now to sum up and provide the source code for the best procedures.
Gunther
Hi Gunther,
            You are doing a very good work.
            Thank you so much for this work.  :t

I am trying to get the best procedure that uses SSE (in your i7)
without alignement. I think i still need to adjust some code.
But if someone wants to know some proc i will give it to him.
If you want i will send to you.

Only now, we may compare COPYAtoB_SSEJ with COPYAtoB_SSEK
and we may see K is better than J, as i said before.
But it is not good for 15 bytes.

---------------------------------
Some results from Gunther
----------------------------------
Quote
--------------------------------------------------------------
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
--------------------------------------------------------------
13 cycles, crt_memcpy,   15 bytes - copy crt_memcpy
35 cycles, crt_memcpy,   53 bytes - copy crt_memcpy


46 cycles, crt_memcpy,  103 bytes - copy crt_memcpy
72 cycles, crt_memcpy,  203 bytes - copy crt_memcpy
95 cycles, crt_memcpy,  503 bytes - copy crt_memcpy
138 cycles, crt_memcpy, 1027 bytes - copy crt_memcpy
219 cycles, crt_memcpy, 2062 bytes - copy crt_memcpy

40 cycles, memcpy_1,   15 bytes - copy memcpy_1
24 cycles, memcpy_1,   53 bytes - copy memcpy_1
39 cycles, memcpy_1,  103 bytes - copy memcpy_1
69 cycles, memcpy_1,  203 bytes - copy memcpy_1
147 cycles, memcpy_1,  503 bytes - copy memcpy_1
294 cycles, memcpy_1, 1027 bytes - copy memcpy_1
606 cycles, memcpy_1, 2062 bytes - copy memcpy_1

33 cycles, memcpy_2,   15 bytes - copy memcpy_2
38 cycles, memcpy_2,   53 bytes - copy memcpy_2
54 cycles, memcpy_2,  103 bytes - copy memcpy_2
62 cycles, memcpy_2,  203 bytes - copy memcpy_2
125 cycles, memcpy_2,  503 bytes - copy memcpy_2
189 cycles, memcpy_2, 1027 bytes - copy memcpy_2
304 cycles, memcpy_2, 2062 bytes - copy memcpy_2
-----------------------------------------------------------
43 cycles, COPYAtoB_SSEL,   15 bytes - copy 16 BYTES+MOVZX
16 cycles, COPYAtoB_SSEL,   53 bytes - copy 16 BYTES+MOVZX

36 cycles, COPYAtoB_SSEL,  103 bytes - copy 16 BYTES+MOVZX
51 cycles, COPYAtoB_SSEL,  203 bytes - copy 16 BYTES+MOVZX

109 cycles, COPYAtoB_SSEL,  503 bytes - copy 16 BYTES+MOVZX
165 cycles, COPYAtoB_SSEL, 1027 bytes - copy 16 BYTES+MOVZX
301 cycles, COPYAtoB_SSEL, 2062 bytes - copy 16 BYTES+MOVZX
-----------------------------------------------------------
34 cycles, memcpy_3,   15 bytes - copy memcpy_3
39 cycles, memcpy_3,   53 bytes - copy memcpy_3

52 cycles, memcpy_3,  103 bytes - copy memcpy_3
54 cycles, memcpy_3,  203 bytes - copy memcpy_3

71 cycles, memcpy_3,  503 bytes - copy memcpy_3
117 cycles, memcpy_3, 1027 bytes - copy memcpy_3
199 cycles, memcpy_3, 2062 bytes - copy memcpy_3
----------------------------------------------------------
32 cycles, memcpy_4,   15 bytes - copy memcpy_4
35 cycles, memcpy_4,   53 bytes - copy memcpy_4
46 cycles, memcpy_4,  103 bytes - copy memcpy_4
69 cycles, memcpy_4,  203 bytes - copy memcpy_4
87 cycles, memcpy_4,  503 bytes - copy memcpy_4
130 cycles, memcpy_4, 1027 bytes - copy memcpy_4
214 cycles, memcpy_4, 2062 bytes - copy memcpy_4

16 cycles, COPYAtoB_SSEE,   15 bytes - copy 16 BYTES+MOVZX
18 cycles, COPYAtoB_SSEE,   53 bytes - copy 16 BYTES+MOVZX
40 cycles, COPYAtoB_SSEE,  103 bytes - copy 16 BYTES+MOVZX
53 cycles, COPYAtoB_SSEE,  203 bytes - copy 16 BYTES+MOVZX
110 cycles, COPYAtoB_SSEE,  503 bytes - copy 16 BYTES+MOVZX
164 cycles, COPYAtoB_SSEE, 1027 bytes - copy 16 BYTES+MOVZX
300 cycles, COPYAtoB_SSEE, 2062 bytes - copy 16 BYTES+MOVZX

48 cycles, COPYAtoB_SSEH,   15 bytes - copy 16 BYTES+MOVZX
18 cycles, COPYAtoB_SSEH,   53 bytes - copy 16 BYTES+MOVZX
41 cycles, COPYAtoB_SSEH,  103 bytes - copy 16 BYTES+MOVZX
55 cycles, COPYAtoB_SSEH,  203 bytes - copy 16 BYTES+MOVZX
72 cycles, COPYAtoB_SSEH,  503 bytes - copy 16 BYTES+MOVZX
135 cycles, COPYAtoB_SSEH, 1027 bytes - copy 16 BYTES+MOVZX
282 cycles, COPYAtoB_SSEH, 2062 bytes - copy 16 BYTES+MOVZX

27 cycles, COPYAtoB_SSEI,   15 bytes - copy 16 BYTES+MOVZX
15 cycles, COPYAtoB_SSEI,   53 bytes - copy 16 BYTES+MOVZX
39 cycles, COPYAtoB_SSEI,  103 bytes - copy 16 BYTES+MOVZX
49 cycles, COPYAtoB_SSEI,  203 bytes - copy 16 BYTES+MOVZX
68 cycles, COPYAtoB_SSEI,  503 bytes - copy 16 BYTES+MOVZX
134 cycles, COPYAtoB_SSEI, 1027 bytes - copy 16 BYTES+MOVZX
268 cycles, COPYAtoB_SSEI, 2062 bytes - copy 16 BYTES+MOVZX

41 cycles, COPYAtoB_SSEJ,   15 bytes - copy 16 BYTES+MOVZX
20 cycles, COPYAtoB_SSEJ,   53 bytes - copy 16 BYTES+MOVZX
39 cycles, COPYAtoB_SSEJ,  103 bytes - copy 16 BYTES+MOVZX
59 cycles, COPYAtoB_SSEJ,  203 bytes - copy 16 BYTES+MOVZX
115 cycles, COPYAtoB_SSEJ,  503 bytes - copy 16 BYTES+MOVZX
163 cycles, COPYAtoB_SSEJ, 1027 bytes - copy 16 BYTES+MOVZX
309 cycles, COPYAtoB_SSEJ, 2062 bytes - copy 16 BYTES+MOVZX

41 cycles, COPYAtoB_SSEK,   15 bytes - copy 16 BYTES+MOVZX
19 cycles, COPYAtoB_SSEK,   53 bytes - copy 16 BYTES+MOVZX
38 cycles, COPYAtoB_SSEK,  103 bytes - copy 16 BYTES+MOVZX
52 cycles, COPYAtoB_SSEK,  203 bytes - copy 16 BYTES+MOVZX
87 cycles, COPYAtoB_SSEK,  503 bytes - copy 16 BYTES+MOVZX
139 cycles, COPYAtoB_SSEK, 1027 bytes - copy 16 BYTES+MOVZX
277 cycles, COPYAtoB_SSEK, 2062 bytes - copy 16 BYTES+MOVZX

RuiLoureiro

#134
Hi Gunther,

        I would like to see the results from CopyString57
        Could you run it, please ?
        Thanks
(unfortunately i have not an i7 like yours, till now, to do it)       

Here are some results.
Quote
memcpy_3 doesn't preserve ESI, EDI

-----------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
-----------------------------------------------------
  35 cycles, crt_memcpy,   15 bytes - copy crt_memcpy
108 cycles, crt_memcpy,   53 bytes - copy crt_memcpy
139 cycles, crt_memcpy,  103 bytes - copy crt_memcpy
226 cycles, crt_memcpy,  203 bytes - copy crt_memcpy
446 cycles, crt_memcpy,  503 bytes - copy crt_memcpy
836 cycles, crt_memcpy, 1027 bytes - copy crt_memcpy
1670 cycles, crt_memcpy, 2062 bytes - copy crt_memcpy

  82 cycles, memcpy_1,   15 bytes - copy memcpy_1
  68 cycles, memcpy_1,   53 bytes - copy memcpy_1
106 cycles, memcpy_1,  103 bytes - copy memcpy_1
179 cycles, memcpy_1,  203 bytes - copy memcpy_1
400 cycles, memcpy_1,  503 bytes - copy memcpy_1
726 cycles, memcpy_1, 1027 bytes - copy memcpy_1
1596 cycles, memcpy_1, 2062 bytes - copy memcpy_1

251 cycles, memcpy_2,   15 bytes - copy memcpy_2
278 cycles, memcpy_2,   53 bytes - copy memcpy_2
383 cycles, memcpy_2,  103 bytes - copy memcpy_2
623 cycles, memcpy_2,  203 bytes - copy memcpy_2
1141 cycles, memcpy_2,  503 bytes - copy memcpy_2
2115 cycles, memcpy_2, 1027 bytes - copy memcpy_2
4113 cycles, memcpy_2, 2062 bytes - copy memcpy_2

  61 cycles, COPYAtoB_SSEL, 15 bytes - copy 16 BYTES+MOVZX
  73 cycles, COPYAtoB_SSEL, 53 bytes - copy 16 BYTES+MOVZX
198 cycles, COPYAtoB_SSEL, 103 bytes - copy 16 BYTES+MOVZX
441 cycles, COPYAtoB_SSEL, 203 bytes - copy 16 BYTES+MOVZX
957 cycles, COPYAtoB_SSEL, 503 bytes - copy 16 BYTES+MOVZX
2028 cycles, COPYAtoB_SSEL, 1027 bytes - copy 16 BYTES+MOVZX
4037 cycles, COPYAtoB_SSEL, 2062 bytes - copy 16 BYTES+MOVZX

256 cycles, memcpy_3, 15 bytes - copy memcpy_3
279 cycles, memcpy_3, 53 bytes - copy memcpy_3
380 cycles, memcpy_3, 103 bytes - copy memcpy_3
634 cycles, memcpy_3, 203 bytes - copy memcpy_3
1154 cycles, memcpy_3, 503 bytes - copy memcpy_3
2088 cycles, memcpy_3, 1027 bytes - copy memcpy_3
4364 cycles, memcpy_3, 2062 bytes - copy memcpy_3

  72 cycles, memcpy_4,   15 bytes - copy memcpy_4
102 cycles, memcpy_4,   53 bytes - copy memcpy_4
132 cycles, memcpy_4,  103 bytes - copy memcpy_4
213 cycles, memcpy_4,  203 bytes - copy memcpy_4
431 cycles, memcpy_4,  503 bytes - copy memcpy_4
828 cycles, memcpy_4, 1027 bytes - copy memcpy_4
1628 cycles, memcpy_4, 2062 bytes - copy memcpy_4

  50 cycles, COPYAtoB_SSEJ,   15 bytes - copy 16 BYTES+MOVZX
  90 cycles, COPYAtoB_SSEJ,   53 bytes - copy 16 BYTES+MOVZX
209 cycles, COPYAtoB_SSEJ,  103 bytes - copy 16 BYTES+MOVZX
432 cycles, COPYAtoB_SSEJ,  203 bytes - copy 16 BYTES+MOVZX
963 cycles, COPYAtoB_SSEJ,  503 bytes - copy 16 BYTES+MOVZX
2048 cycles, COPYAtoB_SSEJ, 1027 bytes - copy 16 BYTES+MOVZX
4021 cycles, COPYAtoB_SSEJ, 2062 bytes - copy 16 BYTES+MOVZX

  47 cycles, COPYAtoB_SSEK,   15 bytes - copy 16 BYTES+MOVZX
  85 cycles, COPYAtoB_SSEK,   53 bytes - copy 16 BYTES+MOVZX
186 cycles, COPYAtoB_SSEK,  103 bytes - copy 16 BYTES+MOVZX
431 cycles, COPYAtoB_SSEK,  203 bytes - copy 16 BYTES+MOVZX
961 cycles, COPYAtoB_SSEK,  503 bytes - copy 16 BYTES+MOVZX
2038 cycles, COPYAtoB_SSEK, 1027 bytes - copy 16 BYTES+MOVZX
4067 cycles, COPYAtoB_SSEK, 2062 bytes - copy 16 BYTES+MOVZX

  28 cycles, COPYAtoB_SSEM,   15 bytes - copy 16 BYTES+MOVZX
  61 cycles, COPYAtoB_SSEM,   53 bytes - copy 16 BYTES+MOVZX
177 cycles, COPYAtoB_SSEM,  103 bytes - copy 16 BYTES+MOVZX
416 cycles, COPYAtoB_SSEM,  203 bytes - copy 16 BYTES+MOVZX
962 cycles, COPYAtoB_SSEM,  503 bytes - copy 16 BYTES+MOVZX
2056 cycles, COPYAtoB_SSEM, 1027 bytes - copy 16 BYTES+MOVZX
4019 cycles, COPYAtoB_SSEM, 2062 bytes - copy 16 BYTES+MOVZX

  36 cycles, COPYAtoB_SSEN,   15 bytes - copy 16 BYTES+MOVZX
  66 cycles, COPYAtoB_SSEN,   53 bytes - copy 16 BYTES+MOVZX
158 cycles, COPYAtoB_SSEN,  103 bytes - copy 16 BYTES+MOVZX
421 cycles, COPYAtoB_SSEN,  203 bytes - copy 16 BYTES+MOVZX
950 cycles, COPYAtoB_SSEN,  503 bytes - copy 16 BYTES+MOVZX
2043 cycles, COPYAtoB_SSEN, 1027 bytes - copy 16 BYTES+MOVZX
4032 cycles, COPYAtoB_SSEN, 2062 bytes - copy 16 BYTES+MOVZX

  34 cycles, COPYAtoB_SSEA,   15 bytes - copy 16 BYTES+MOVZX
  97 cycles, COPYAtoB_SSEA,   53 bytes - copy 16 BYTES+MOVZX
212 cycles, COPYAtoB_SSEA,  103 bytes - copy 16 BYTES+MOVZX
446 cycles, COPYAtoB_SSEA,  203 bytes - copy 16 BYTES+MOVZX
988 cycles, COPYAtoB_SSEA,  503 bytes - copy 16 BYTES+MOVZX
2065 cycles, COPYAtoB_SSEA, 1027 bytes - copy 16 BYTES+MOVZX
4093 cycles, COPYAtoB_SSEA, 2062 bytes - copy 16 BYTES+MOVZX