reason to switch to 64 Bit Assembler

habran · February 22, 2013, 08:56:19 AM

as I said before this routine is PARTICULARLY made for UNALIGNED data
that is why I use MOVDQU command
there is no reason to create a sophisticated algorithm for aligned data
you can just use fastest command to do that depending on the ability of your machine

Code Select


    ;r8 can contain sizeof(buffer)
    .for (rcx=dest,rdx=src,r8=count,r8>>=5¦r8¦rcx+=32,rdx+=32,r8--)
             vmovdqa ymm4,[rdx]
             vmovdqa [rcx],ymm4
    .endfor
    ;or for for 16 byte xmm:

    ;r8 can contain sizeof(buffer)
    .for (rcx=dest,rdx=src,r8=count,r8>>=4¦r8¦rcx+=16,rdx+=16,r8--)
             movdqa xmm4,[rdx]
             movdqa [rcx],xmm4
    .endfor

    ;for 32 bit machine
    ;eax can contain sizeof(buffer)
    .for (ecx=dest,edx=src,eax=count,eax>>=4¦eax¦ecx+=16,edx+=16,eax--)
             movdqa xmm4,[edx]
             movdqa [ecx],xmm4
    .endfor
;or for JJ2007
     mov ecx,dest
     mov edx,src
     mov eax,sizeof(buffer)
     shr eax,4
     .while (eax)
             movdqa xmm4,[edx]
             movdqa [ecx],xmm4
             add edx,16
             add ecx,16
             dec eax
      .endw

habran · February 22, 2013, 09:17:31 AM

we can also use this:
mov ecx,dest
mov edx,src
mov eax,sizeof(buffer)
sub eax,16
.while (SDWORD eax > 0)
movdqa xmm4,[edx+eax]
movdqa [ecx+eax],xmm4
sub eax,16
.endw

habran · February 22, 2013, 09:35:21 AM

we can use macros rather then subs
like this:

Code Select


xmcopy16 MACRO dest,crc,size
     mov ecx,dest
     mov edx,src
     mov eax,size
     sub eax,16
     .while (SDWORD eax >= 0)
         movdqa xmm4,[edx+eax]
         movdqa [ecx+eax],xmm4
         sub eax,16
     .endw
ENDM 
xmcopy32 MACRO dest,crc,size
     mov rcx,size
     mov rdx,src
     mov rax,size
     sub rax,32
     .while (SQWORD rax >= 0)
         movdqa xmm4,[rdx+rax]
         movdqa [rcx+rax],xmm4
         sub rax,32
     .endw
ENDM

habran · February 22, 2013, 10:19:03 AM

here is test on my computer for JJ's exe

Code Select


Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xme
mcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 Habran's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33
 104
------------------------------------------------------------------------------------
2048, d0s0-0      133      184      205      203      202      204      184    238
2048, d1s1-0      225      206      226      223      227      223      225    249
2048, d7s7-0      225      208      229      225      228      216      225    249
2048, d7s8-1      228      223      501      367      209      219      221    246
2048, d7s9-2      225      218      498      365      206      217      221    245
2048, d8s7+1      221      217      502      390      206      219      221    244
2048, d8s8-0      221      205      238      229      232      235      221    244
2048, d8s9-1      222      218      492      365      204      218      222    245
2048, d9s7+2      220      217      488      390      206      219      221    244
2048, d9s8+1      226      218      491      390      206      219      221    245
2048, d9s9-0      221      206      224      222      224      226      221    245
2048, d15s15      221      206      226      222      225      226      221    245


--- ok ---

It is interesting how my code has steady speed in different sizes
and it is interesting how older processors perform in different way than newer

thank you JJ for taking time to write testing programs :t
however, I suspect that you are puling my leg because I don't have time nor desire to learn your BSIC$ ;)
(for the reason I mentioned before)
when I talk about a beauty of the source code I talk about visual effect ,readability and functionality
sometime your programs can be maybe even faster than someone else's but no one will try to read it
because most of your MULTO IMPORTANTE routines are hidden either in $$$$$ macros or %$#% external functions
however, it is a pleasure to exchange opinions and diversity in programming technics

habran · February 22, 2013, 10:28:04 AM

Japheth,

QuoteThe fastest machine that I have available is an 5 year old AMD 64 X2 5000+.

I saw on Google that they are advertising new laptops for $249 dollars (probably with AVX)

dedndave · February 22, 2013, 11:03:36 AM

actually, Jochen's MasmBasic is a very productive library
you can bet many of the routines are quite fast
and - many of the functions aren't found in the masm32 library
i would use it more often, myself, except for one thing....

i am trying to learn assembler for windows
high-level constructs mask the assembler code i am trying to learn
the same may be said for many of your macros

habran · February 22, 2013, 01:24:53 PM

hi dedndave,

Quoteactually, Jochen's MasmBasic is a very productive library
you can bet many of the routines are quite fast
and - many of the functions aren't found in the masm32 library

there is no doubt about :t
we are talking here about readability of sources

as soon as I look at his source code I feel like piercing my eyes with a cactus torn
programs that look like this:"LET$!@#$%^&*@#$%^&*!"
who can have now-days enough patience and concentration to follow this code
"Mission Impossible 32" with JJ2007 as main actor (Tom Cruise refused the role because of the age)
and he is hiding his most important sources from public eyes like double agent 007
another drawback is that Jochen's MasmBasic is 32 bit and I am programming only 64 bit

I love assembler that's why I joined to this forum otherwise I would be a member of some BASIC community

please don't tell to JJ about our conversation, I don't want him to feel bad because I like him and appreciate his brains

Macros are helpful to make programs more readable but they should be visible to programmers and named properly

habran · February 22, 2013, 01:51:55 PM

thanks Gunther for your contribution to this topic :t

QuoteIntel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

Speedy Gonzales like my "Ferrari Testarossa xmemcpy"
(AVX tires would make it even faster)

thanks to our God Father JJ Corleone for naming it so

habran · February 24, 2013, 09:04:41 AM

Japheth,

QuoteI fully agree! However - almost 100% faster than MB - which allegedly is already rocket-science? How is this possible? You must do something wrong...

I found this explanation in "INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES" manual

Quote
2.3.5.1 Efficient Handling of Alignment Hazards
The cache and memory subsystems handles a significant percentage of instructions
in every workload. Different address alignment scenarios will produce varying performance
impact for memory and cache operations. For example, 1-cycle throughput of
L1 (see Table 2-21) generally applies to naturally-aligned loads from L1 cache. But
using unaligned load instructions (e.g. MOVUPS, MOVUPD, MOVDQU, etc.) to access
data from L1 will experience varying amount of delays depending on specific microarchitectures
and alignment scenarios.
Table 2-21. Performance Impact of Address Alignments of MOVDQU from L1
Throughput (cycle) Intel Core i7 45 nm Intel Core 65 nm Intel
Processor Microarchitecture CoreMicroarchitecture
________________________________________________________________________
Alignment Scenario 06_1AH 06_17H 06_0FH
16B aligned 1 2 2
________________________________________________________________________
Not-16B aligned, not
cache split
1 ~2 ~2
________________________________________________________________________
Split cache line
boundary ~4.5 ~20 ~20
________________________________________________________________________

Because my procesor is 2.3 gig Core i7 with a lot of cashe
it takes only 1 cycle for ither MOVDQU or MOVDQA

drifter · February 26, 2013, 05:49:48 PM

Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)

Algo memcpy MemCo1 MemCo2 MemCoC3 MemCoP4 MemCoC2 MemCoL xmemcpy
Description CRT rep movs movdqa lps+hps movdqa movdqa Masm32 Habran's
dest-al psllq CeleronM dest-al src-al library Ferrari
Code size ? 70 291 222 200 269 33 104
------------------------------------------------------------------------------------
2048, d0s0-0 196 257 252 252 583 235 600 444
2048, d1s1-0 460 274 687 690 284 277 704 444
2048, d7s7-0 468 277 286 286 289 281 293 444
2048, d7s8-1 302 299 732 521 240 253 705 444
2048, d7s9-2 302 300 867 607 256 253 294 445
2048, d8s7+1 294 726 640 551 239 247 265 444
2048, d8s8-0 471 280 700 288 287 282 293 443
2048, d8s9-1 295 303 637 522 272 253 704 444
2048, d9s7+2 300 301 634 553 289 593 703 444
2048, d9s8+1 301 724 633 552 277 247 292 444
2048, d9s9-0 469 670 694 269 696 282 294 446
2048, d15s15 415 280 289 251 289 284 293 447

--- ok ---

on: February 10, 2013, 11:57:14 PM Gunther wrote:

QuoteThere are a few applications which really need more than 4 GB RAM (large data bases for example), but others do not.

The transporters of the future will need to access 7,000,000,000,000,000,000,000,000,000 points of data - that's a 795,807,864,054,000.1 terrabyte address space :icon_eek:

Gunther · February 26, 2013, 09:24:23 PM

Hi drifter,

Quote from: drifter on February 26, 2013, 05:49:48 PM
The transporters of the future will need to access 7,000,000,000,000,000,000,000,000,000 points of data - that's a 795,807,864,054,000.1 terrabyte address space :icon_eek:

that might be, but that could be reached with a 64 bit architecture. But what's with the hole bunch of other applications? By the way, you'll find a few 64 bit applications in the forum, which I've written.

Gunther

habran · February 26, 2013, 10:08:02 PM

hello drifter,
welcome to the forum

interesting to see the difference in speed with different processors
your is i7 2.8 gig and mine is i7 2.3 but speed is double
I am curies why is that?
Gunter, your is i7 3,4 gig and still slower than qWord's and mine

habran · February 26, 2013, 10:29:12 PM

here are specifications:

Intel® Core™ i7-3610QM Processor Intel® Core™ i7-3770 Processor
(6M Cache, up to 3.30 GHz) (8M Cache, up to 3.90 GHz)
Specifications Specifications
Essentials Essentials
Status   Launched Status     Launched
Launch Date   Q2'12 Launch Date   Q2'12
Processor Number   i7-3610QM Processor Number   i7-3770
# of Cores   4 # of Cores   4
# of Threads   8 # of Threads   8
Clock Speed   2.3 GHz Clock Speed   3.4 GHz
Max Turbo Frequency   3.3 GHz Max Turbo Frequency   3.9 GHz
Intel® Smart Cache   6 MB Intel® Smart Cache   8 MB
Bus/Core Ratio   23 Bus/Core Ratio   34
DMI   5 GT/s DMI   5 GT/s
Instruction Set   64-bit Instruction Set   64-bit
Instruction Set Extensions   AVX Instruction Set Extensions   SSE4.1/4.2, AVX
Embedded Options Available   No Embedded Options Available   Yes
Lithography   22 nm Lithography   22 nm
Max TDP   45 W Max TDP   77 W
Recommended Customer Price   TRAY: $378.00 Recommended Customer Price   TRAY: $294.00
BOX : $305.00

dedndave · February 26, 2013, 11:55:54 PM

the number of clock cycles it takes for a processor to do something doesn't make a very good benchmark
you are comparing one algo to another
not comparing one cpu to another

Gunther · February 27, 2013, 01:07:02 AM

Hi habran,

Quote from: dedndave on February 26, 2013, 11:55:54 PM
the number of clock cycles it takes for a processor to do something doesn't make a very good benchmark

that's the answer.

Gunther

The MASM Forum

News:

reason to switch to 64 Bit Assembler

habran

habran

habran

habran

habran

dedndave

habran

habran

habran

drifter

Gunther

habran

habran

dedndave

Gunther