News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

The fastest way to fill a dword array with string values

Started by frktons, December 09, 2012, 02:49:23 AM

Previous topic - Next topic

MichaelW

For Dave's code I get 2292 2299 2303 2302 2290 on a P4 Northwood and 2674 2676 2674 2677 2690 on a P3. Frank's EXE will not run on my P3 Windows 2000 system:

"Test_FillArray.exe is not a valid Win32 application."

But on my P4 Northwood Windows XP system I typically get:

------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2
------------------------------------------------------------------------
2676    cycles for Dedndave code
------------------------------------------------------------------------
2655    cycles for Dedndave code
------------------------------------------------------------------------
2306    cycles for Dedndave code
------------------------------------------------------------------------
2277    cycles for Dedndave code
------------------------------------------------------------------------


But running it multiple times the first one or two results are significantly higher than the others, suggesting that the code is not delaying long enough after it loads before it starts counting cycles. After adding a:

invoke Sleep, 5000

Below the start label and assembling and linking with ML 6.15.8803 and Link 5.12.8078, typical results are:


------------------------------------------------------------------------
Intel(R) Pentium(R) 4 CPU 3.00GHz

Instructions: MMX, SSE1, SSE2
------------------------------------------------------------------------
2278    cycles for Dedndave code
------------------------------------------------------------------------
2296    cycles for Dedndave code
------------------------------------------------------------------------
2263    cycles for Dedndave code
------------------------------------------------------------------------
2285    cycles for Dedndave code
------------------------------------------------------------------------


And the EXE will then run on my P3 system:

------------------------------------------------------------------------
Pentium III

Instructions: MMX, SSE1
------------------------------------------------------------------------
2457    cycles for Dedndave code
------------------------------------------------------------------------
2437    cycles for Dedndave code
------------------------------------------------------------------------
2437    cycles for Dedndave code
------------------------------------------------------------------------
2439    cycles for Dedndave code
------------------------------------------------------------------------

Well Microsoft, here's another nice mess you've gotten us into.

jj2007

That will be a close race between Dave and Frank...
:biggrin:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1556    cycles for 100 * FA Dave
541     cycles for 100 * FA Jochen
1557    cycles for 100 * FA Frank

1556    cycles for 100 * FA Dave
539     cycles for 100 * FA Jochen
1557    cycles for 100 * FA Frank

1558    cycles for 100 * FA Dave
542     cycles for 100 * FA Jochen
1564    cycles for 100 * FA Frank

230     bytes for FA Dave
281     bytes for FA Jochen
350     bytes for FA Frank

frktons

Quote from: jj2007 on December 09, 2012, 01:21:46 PM
That will be a close race between Dave and Frank...
:biggrin:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1556    cycles for 100 * FA Dave
541     cycles for 100 * FA Jochen
1557    cycles for 100 * FA Frank

1556    cycles for 100 * FA Dave
539     cycles for 100 * FA Jochen
1557    cycles for 100 * FA Frank

1558    cycles for 100 * FA Dave
542     cycles for 100 * FA Jochen
1564    cycles for 100 * FA Frank

230     bytes for FA Dave
281     bytes for FA Jochen
350     bytes for FA Frank


Actually I didn't write any code. I only posted Dave code in
another testbed. Well with your routine you demonstrated that
the good job of Dave could be optimized with better registers
and parallel computing.
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

i don't think Frank has any working code, per se
looks like you got the speed thing   :t
did you verify the resulting table ?   :P

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 246/100 cycles

2069    cycles for 100 * FA Dave
915     cycles for 100 * FA Jochen
2230    cycles for 100 * FA Frank

2029    cycles for 100 * FA Dave
913     cycles for 100 * FA Jochen
2592    cycles for 100 * FA Frank

2034    cycles for 100 * FA Dave
930     cycles for 100 * FA Jochen
2217    cycles for 100 * FA Frank

frktons

Jochen,
the source you posted is for MasmBasic editor, I suppose,
couldn't you post a normal ascii text file, in order to see what it
actually does?

Frank
There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

dedndave

it's a "rich" text file, Frank
i think you can open it with WordPad, if you don't have anything else   :t

dedndave

this is Jochen's data and code
align 16
TestB_s:
Src0123 db "   0   1   2   3"
Add4444 dd 04000000h, 04000000h, 04000000h, 04000000h ; xmm1
Add44xx dd 04000000h, 04000000h, 0FA110000h, 0FA110000h ; xmm2
Addxx44 dd 0FA110000h, 0FA110000h, 04000000h, 04000000h ; xmm3
Addxxxx dd 0FA010000h, 0FA010000h, 0FA010000h, 0FA010000h ; xmm4

Add244xx dd 04000000h, 04000000h, 0FA010000h, 0FA010000h ; xmm2
Add2xx44 dd 0FA010000h, 0FA010000h, 04000000h, 04000000h ; xmm3
Add2xxxx dd 0FA010000h, 0FA010000h, 0FA010000h, 0FA010000h ; xmm4

Sub100a dd 00009EF00h, 00009EF00h, 00009EF00h, 00009EF00h
Sub100b dd 00009FF00h, 00009FF00h, 00009FF00h, 00009FF00h

NameB equ FA Jochen ; assign a descriptive name here
TestB proc
mov esi, offset Src0123
mov edi, offset MyArray
push edi
xor ecx, ecx
movaps xmm0, [esi]
movaps xmm1, [esi+16]
movaps xmm2, [esi+32]
movaps xmm3, [esi+48]
movaps xmm4, [esi+64]
lea edx, [edi+4000]
m2m ecx, -5
; align 4
.Repeat
movaps [edi], xmm0
paddd xmm0, xmm1 ; 4444
movaps [edi+16], xmm0
paddd xmm0, xmm2 ; 44xx
movaps [edi+32], xmm0
paddd xmm0, xmm3 ; xx44
movaps [edi+48], xmm0
paddd xmm0, xmm1 ; 4444
movaps [edi+64], xmm0
paddd xmm0, xmm4 ; xxxx
inc ecx
.if Zero?
    psubd xmm0, oword ptr Sub100a
.elseif ecx==-4
movaps xmm2, [esi+80]
movaps xmm3, [esi+96]
movaps xmm4, [esi+112]
.elseif ecx==5
    psubd xmm0, oword ptr Sub100b
    xor ecx, ecx
.endif
add edi, 80
.Until edi>=edx
pop eax
  ret
TestB endp
TestB_endp:


jj2007

Quote from: dedndave on December 09, 2012, 02:04:35 PM
it's a "rich" text file, Frank
i think you can open it with WordPad, if you don't have anything else   :t

Frank does have something else: \Masm32\RichMasm\RichMasm.exe
Drag *.asc over RichMasm.exe, then click on the bookmarks "Test A" and "Test B" on the right :icon_mrgreen:

By the way: performance drops dramatically if you replace movaps with movups. But there is a well-known workaround, see attachment.

For MisAlign=0 and MisAlign=8, performance is almost like movaps (ca. 8 cycles slower, see below). For all other values, it is still almost 500 cycles faster than Dave's non-SSE code.

MisAlign=0:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1558    cycles for 100 * FA Dave
540     cycles for 100 * FA Jochen
547     cycles for 100 * FA Jochen unaligned

1557    cycles for 100 * FA Dave
539     cycles for 100 * FA Jochen
548     cycles for 100 * FA Jochen unaligned


MisAlign=3:
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1845    cycles for 100 * FA Dave
1365    cycles for 100 * FA Jochen unaligned

1853    cycles for 100 * FA Dave
1362    cycles for 100 * FA Jochen unaligned

sinsi

I got times around 850 for my code (no sse) but half the size.
Here's your latest jj

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 231/100 cycles

856     cycles for 100 * FA Dave
449     cycles for 100 * FA Jochen
436     cycles for 100 * FA Jochen unaligned

803     cycles for 100 * FA Dave
445     cycles for 100 * FA Jochen
435     cycles for 100 * FA Jochen unaligned

809     cycles for 100 * FA Dave
454     cycles for 100 * FA Jochen
433     cycles for 100 * FA Jochen unaligned

230     bytes for FA Dave
281     bytes for FA Jochen
141     bytes for FA Jochen unaligned


TouEnMasm

Here is a source code to verify mmx is faster:
Quote
   .data
   array dd 1000 dup (0)
   controle dd 0
   Nomfichier db "resultat.txt",0
   Hfile dd 0
   NumberOfBytesWritten dd 0
   retourligne db 13,10,0
   .code
   
   start:
   mov eax,30303020h ; 000
   mov edx,offset array
   unit:
   mov [edx],eax   ;000
   add eax,1000000h
   add edx,4
   mov [edx],eax   ;1
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;2
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;3
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;4
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;5
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;6
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;7
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;8
   add eax,1000000h
   add edx,4   
   mov [edx],eax   ;9
   ;-----------------------------
   sub eax,9000000h
   add eax,10000h
   add edx,4
   mov ecx,eax
   shr ecx,16
   .if cl == 3Ah     ;1-
      sub eax,0A0000h
      inc ah   ;303030h + 10000h
      .if ah == 3Ah
         ;fini
         jmp fin
      .else
         jmp unit
      .endif
   .else
      jmp unit
   .endif   
   fin:
   lea edx,controle   ;debug limit must be NULL
   lea edx,array   ;debug view what is in memory
   invoke CreateFile,addr Nomfichier,GENERIC_WRITE,NULL,\
         NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,0
   mov Hfile,eax
   mov edx,offset array
   mov ecx,0
   ecrire:
   push edx
   push ecx
   invoke WriteFile,Hfile,edx,400,addr NumberOfBytesWritten,NULL
   invoke WriteFile,Hfile,addr retourligne,2,addr NumberOfBytesWritten,NULL
   pop ecx
   pop edx
   add edx,400
   add ecx,100
   .if ecx != 1000
      jmp ecrire
   .endif
      
   invoke CloseHandle,Hfile
   invoke ExitProcess,0
;################################################################   
   end start      
ascii table is put in the texte file 000 to 999


Fa is a musical note to play with CL

sinsi

116 bytes of slackware

TestA proc
    pusha
   
    mov edi,offset MyArray
    mov ebp,00303030h
    mov ecx,10
l2: mov edx,10
    push ebp
l1: call proc0
    add ebp,00000100h
    sub edx,1
    jnz l1
    pop ebp
    inc ebp ;add ebp,00000001h
    sub ecx,1
    jnz l2
    popa
    mov eax,offset MyArray
    ret
proc0:
    push ecx
    push edx
   
    lea eax,[ebp]
    lea ebx,[ebp+00010000h]
    lea ecx,[ebp+00020000h]
    lea edx,[ebp+00030000h]
    lea esi,[ebp+00040000h]

    mov ebp,00050000h
   
    mov [edi],eax
    mov [edi+4],ebx
    mov [edi+8],ecx
    mov [edi+12],edx
    mov [edi+16],esi
   
    add eax,ebp
    add ebx,ebp
    add ecx,ebp
    add edx,ebp
    add esi,ebp

    mov [edi+20],eax
    mov [edi+24],ebx
    mov [edi+28],ecx
    mov [edi+32],edx
    mov [edi+36],esi
   
    add edi,40
    pop edx
    pop ecx
    ret
   


TestA endp

This would be nice to do in 64-bit with all those regs, but the xmm's are much better.

jj2007

Quote from: ToutEnMasm on December 09, 2012, 07:03:32 PM
Here is a source code to verify mmx is faster:

Where's da mmx, Yves?

Quote from: sinsi on December 09, 2012, 07:15:06 PM
116 bytes of slackware
...
This would be nice to do in 64-bit with all those regs, but the xmm's are much better.

SSE2 is difficult to beat indeed...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1557    cycles for 100 * FA Dave
540     cycles for 100 * FA Jochen
1838    cycles for 100 * FA Sinsi
547     cycles for 100 * FA Jochen unaligned
1540    cycles for 100 * FA Yves

sinsi

Just goes to show, these tests are of academic interest only  :biggrin:

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 228/100 cycles

786     cycles for 100 * FA Dave
441     cycles for 100 * FA Jochen
849     cycles for 100 * FA Sinsi
419     cycles for 100 * FA Jochen unaligned
1021    cycles for 100 * FA Yves

786     cycles for 100 * FA Dave
437     cycles for 100 * FA Jochen
849     cycles for 100 * FA Sinsi
431     cycles for 100 * FA Jochen unaligned
1029    cycles for 100 * FA Yves

779     cycles for 100 * FA Dave
435     cycles for 100 * FA Jochen
846     cycles for 100 * FA Sinsi
417     cycles for 100 * FA Jochen unaligned
1023    cycles for 100 * FA Yves

230     bytes for FA Dave
281     bytes for FA Jochen
116     bytes for FA Sinsi
141     bytes for FA Jochen unaligned
141     bytes for FA Yves

4208864 = eax FA Dave
4208864 = eax FA Jochen
4208864 = eax FA Sinsi
4208864 = eax FA Jochen unaligned
4208864 = eax FA Yves


dedndave

it should generate leading-zero suppressed strings, right-justified in a field of 3 spaces

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 240/100 cycles

2040    cycles for 100 * FA Dave
919     cycles for 100 * FA Jochen
2609    cycles for 100 * FA Sinsi
1226    cycles for 100 * FA Jochen unaligned
1854    cycles for 100 * FA Yves

2040    cycles for 100 * FA Dave
943     cycles for 100 * FA Jochen
2606    cycles for 100 * FA Sinsi
1197    cycles for 100 * FA Jochen unaligned
1816    cycles for 100 * FA Yves

2037    cycles for 100 * FA Dave
920     cycles for 100 * FA Jochen
2602    cycles for 100 * FA Sinsi
1196    cycles for 100 * FA Jochen unaligned
1892    cycles for 100 * FA Yves

jj2007

Quote from: sinsi on December 09, 2012, 08:05:45 PM
Just goes to show, these tests are of academic interest only  :biggrin:

Absolutely :greensml:

New version:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

1556    cycles for 100 * FA Dave
1811    cycles for 100 * FA Sinsi
441     cycles for 100 * FA Jochen unaligned
1541    cycles for 100 * FA Yves


P.S.: If you don't agree with the suffix "FINAL", write a faster algo :bgrin: