News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

reason to switch to 64 Bit Assembler

Started by habran, February 10, 2013, 08:03:46 PM

Previous topic - Next topic

habran

as I said before this routine is PARTICULARLY made for UNALIGNED data
that is why I use MOVDQU command
there is no reason to create a  sophisticated algorithm for aligned data
you can just use fastest command to do that depending on  the ability of your machine

    ;r8 can contain sizeof(buffer)
    .for (rcx=dest,rdx=src,r8=count,r8>>=5¦r8¦rcx+=32,rdx+=32,r8--)
             vmovdqa ymm4,[rdx]
             vmovdqa [rcx],ymm4
    .endfor
    ;or for for 16 byte xmm:

    ;r8 can contain sizeof(buffer)
    .for (rcx=dest,rdx=src,r8=count,r8>>=4¦r8¦rcx+=16,rdx+=16,r8--)
             movdqa xmm4,[rdx]
             movdqa [rcx],xmm4
    .endfor

    ;for 32 bit machine
    ;eax can contain sizeof(buffer)
    .for (ecx=dest,edx=src,eax=count,eax>>=4¦eax¦ecx+=16,edx+=16,eax--)
             movdqa xmm4,[edx]
             movdqa [ecx],xmm4
    .endfor
;or for JJ2007
     mov ecx,dest
     mov edx,src
     mov eax,sizeof(buffer)
     shr eax,4
     .while (eax)
             movdqa xmm4,[edx]
             movdqa [ecx],xmm4
             add edx,16
             add ecx,16
             dec eax
      .endw     





Cod-Father

habran

we can also use this:
     mov ecx,dest
     mov edx,src
     mov eax,sizeof(buffer)
     sub eax,16
     .while (SDWORD eax > 0)
             movdqa xmm4,[edx+eax]
             movdqa [ecx+eax],xmm4
              sub eax,16
      .endw     
Cod-Father

habran

#77
we can use macros rather then subs
like this:

xmcopy16 MACRO dest,crc,size
     mov ecx,dest
     mov edx,src
     mov eax,size
     sub eax,16
     .while (SDWORD eax >= 0)
         movdqa xmm4,[edx+eax]
         movdqa [ecx+eax],xmm4
         sub eax,16
     .endw
ENDM
xmcopy32 MACRO dest,crc,size
     mov rcx,size
     mov rdx,src
     mov rax,size
     sub rax,32
     .while (SQWORD rax >= 0)
         movdqa xmm4,[rdx+rax]
         movdqa [rcx+rax],xmm4
         sub rax,32
     .endw
ENDM             
Cod-Father

habran

here is test on my computer for JJ's exe

Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xme
mcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 Habran's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33
104
------------------------------------------------------------------------------------
2048, d0s0-0      133      184      205      203      202      204      184    238
2048, d1s1-0      225      206      226      223      227      223      225    249
2048, d7s7-0      225      208      229      225      228      216      225    249
2048, d7s8-1      228      223      501      367      209      219      221    246
2048, d7s9-2      225      218      498      365      206      217      221    245
2048, d8s7+1      221      217      502      390      206      219      221    244
2048, d8s8-0      221      205      238      229      232      235      221    244
2048, d8s9-1      222      218      492      365      204      218      222    245
2048, d9s7+2      220      217      488      390      206      219      221    244
2048, d9s8+1      226      218      491      390      206      219      221    245
2048, d9s9-0      221      206      224      222      224      226      221    245
2048, d15s15      221      206      226      222      225      226      221    245


--- ok ---

It is interesting how my code has steady speed in different sizes
and it is interesting how older processors perform in different way than newer

thank you JJ for taking time to write testing programs :t
however, I suspect that you are puling my leg because I don't have time nor desire to learn your BSIC$  ;)
(for the reason I mentioned before)
when I talk about a beauty of the source code I talk about visual effect ,readability and functionality
sometime your programs can be maybe even faster than someone else's but no one will try to read it
because most of your MULTO IMPORTANTE routines are hidden either in $$$$$ macros or %$#% external functions
however, it is a pleasure to exchange opinions and diversity in programming technics  :biggrin:
Cod-Father

habran

Japheth,
     
QuoteThe fastest machine that I have available is an 5 year old AMD 64 X2 5000+.
I saw on Google that they are advertising new laptops for $249 dollars (probably with AVX) :biggrin:
Cod-Father

dedndave

actually, Jochen's MasmBasic is a very productive library
you can bet many of the routines are quite fast
and - many of the functions aren't found in the masm32 library
i would use it more often, myself, except for one thing....

i am trying to learn assembler for windows
high-level constructs mask the assembler code i am trying to learn
the same may be said for many of your macros

habran

hi dedndave,

Quoteactually, Jochen's MasmBasic is a very productive library
you can bet many of the routines are quite fast
and - many of the functions aren't found in the masm32 library
there is no doubt about :t
we are talking here about readability of sources  :bgrin:
as soon as I look at his source code I feel like piercing my eyes  with a cactus torn
programs that look like this:"LET$!@#$%^&*@#$%^&*!"
who can have now-days enough patience and concentration to follow this code
"Mission Impossible 32" with JJ2007 as main actor (Tom Cruise refused the role because of the age) 
and he is hiding his most important sources from public eyes like double agent 007
another drawback is that Jochen's MasmBasic is 32 bit and I am programming only 64 bit

I love assembler that's why I joined to this forum otherwise I would be a member of some BASIC community

please don't tell to JJ about our conversation, I don't want him to feel bad because I like him and appreciate his brains

Macros are helpful to make programs more readable but they should be visible to programmers and named properly :biggrin:






Cod-Father

habran

thanks Gunther for your contribution to this topic :t
QuoteIntel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
Speedy Gonzales like my "Ferrari Testarossa xmemcpy"
(AVX tires would make it even faster)

thanks to our God Father JJ Corleone for naming it so

Cod-Father

habran

Japheth,
QuoteI fully agree! However - almost 100% faster than MB - which allegedly is already rocket-science? How is this possible? You must do something wrong...
I found this explanation in "INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES" manual
Quote
2.3.5.1 Efficient Handling of Alignment Hazards
The cache and memory subsystems handles a significant percentage of instructions
in every workload. Different address alignment scenarios will produce varying performance
impact for memory and cache operations. For example, 1-cycle throughput of
L1 (see Table 2-21) generally applies to naturally-aligned loads from L1 cache. But
using unaligned load instructions (e.g. MOVUPS, MOVUPD, MOVDQU, etc.) to access
data from L1 will experience varying amount of delays depending on specific microarchitectures
and alignment scenarios.
Table 2-21. Performance Impact of Address Alignments of MOVDQU from L1
Throughput (cycle) Intel Core i7           45 nm Intel Core                   65 nm Intel
                                Processor            Microarchitecture            CoreMicroarchitecture
________________________________________________________________________
Alignment Scenario     06_1AH                 06_17H                               06_0FH
16B aligned                   1                            2                                         2
________________________________________________________________________
Not-16B aligned, not
cache split
                                    1                          ~2                                       ~2
________________________________________________________________________
Split cache line
boundary                     ~4.5                     ~20                                      ~20
________________________________________________________________________

Because my procesor is 2.3 gig Core i7 with a lot of cashe
it takes only 1 cycle for ither MOVDQU or MOVDQA
Cod-Father

drifter

Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

Algo           memcpy   MemCo1   MemCo2  MemCoC3  MemCoP4  MemCoC2   MemCoL  xmemcpy
Description       CRT rep movs   movdqa  lps+hps   movdqa   movdqa   Masm32 Habran's
                       dest-al    psllq CeleronM  dest-al   src-al  library  Ferrari
Code size           ?       70      291      222      200      269       33      104
------------------------------------------------------------------------------------
2048, d0s0-0      196      257      252      252      583      235      600      444
2048, d1s1-0      460      274      687      690      284      277      704      444
2048, d7s7-0      468      277      286      286      289      281      293      444
2048, d7s8-1      302      299      732      521      240      253      705      444
2048, d7s9-2      302      300      867      607      256      253      294      445
2048, d8s7+1      294      726      640      551      239      247      265      444
2048, d8s8-0      471      280      700      288      287      282      293      443
2048, d8s9-1      295      303      637      522      272      253      704      444
2048, d9s7+2      300      301      634      553      289      593      703      444
2048, d9s8+1      301      724      633      552      277      247      292      444
2048, d9s9-0      469      670      694      269      696      282      294      446
2048, d15s15      415      280      289      251      289      284      293      447


--- ok ---


on: February 10, 2013, 11:57:14 PM Gunther wrote:
QuoteThere are a few applications which really need more than 4 GB RAM (large data bases for example), but others do not.

The transporters of the future will need to access 7,000,000,000,000,000,000,000,000,000 points of data - that's a 795,807,864,054,000.1 terrabyte address space  :icon_eek:

Gunther

Hi drifter,

Quote from: drifter on February 26, 2013, 05:49:48 PM
The transporters of the future will need to access 7,000,000,000,000,000,000,000,000,000 points of data - that's a 795,807,864,054,000.1 terrabyte address space  :icon_eek:

that might be, but that could be reached with a 64 bit architecture. But what's with the hole bunch of other applications? By the way, you'll find a few 64 bit applications in the forum, which I've written.

Gunther
You have to know the facts before you can distort them.

habran

#86
hello drifter,
welcome to the forum :biggrin:
interesting to see the difference in speed with different processors
your is i7 2.8 gig and mine is i7 2.3 but speed is double
I am curies why is that?
Gunter, your is i7 3,4 gig and still slower than qWord's and mine


Cod-Father

habran

#87
here are specifications:

Intel® Core™ i7-3610QM Processor                         Intel® Core™ i7-3770 Processor                     
(6M Cache, up to 3.30 GHz)                                    (8M Cache, up to 3.90 GHz)     
Specifications                                                         Specifications                 
Essentials                                                              Essentials                                               
Status   Launched                                               Status       Launched                                           
Launch Date   Q2'12                                            Launch Date   Q2'12                                 
Processor Number   i7-3610QM                             Processor Number   i7-3770                                 
# of Cores   4                                                           # of Cores   4                                             
# of Threads   8                                                   # of Threads   8                                           
Clock Speed   2.3 GHz                                         Clock Speed   3.4 GHz                                 
Max Turbo Frequency   3.3 GHz                                 Max Turbo Frequency   3.9 GHz                         
Intel® Smart Cache   6 MB                                      Intel® Smart Cache   8 MB                                 
Bus/Core Ratio   23                                                  Bus/Core Ratio   34                                             
DMI   5 GT/s                                                            DMI   5 GT/s                                             
Instruction Set   64-bit                                             Instruction Set   64-bit                   
Instruction Set Extensions   AVX                                Instruction Set Extensions   SSE4.1/4.2, AVX       
Embedded Options Available   No                          Embedded Options Available   Yes                           
Lithography   22 nm                                             Lithography   22 nm                                           
Max TDP   45 W                                                       Max TDP   77 W                               
Recommended Customer Price   TRAY: $378.00        Recommended Customer Price   TRAY: $294.00                   
                                                                              BOX : $305.00                             
Cod-Father

dedndave

the number of clock cycles it takes for a processor to do something doesn't make a very good benchmark
you are comparing one algo to another
not comparing one cpu to another

Gunther

Hi habran,

Quote from: dedndave on February 26, 2013, 11:55:54 PM
the number of clock cycles it takes for a processor to do something doesn't make a very good benchmark

that's the answer.

Gunther
You have to know the facts before you can distort them.