News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

CodeTune Timming Analyzer - V 1.0 (Update 09/01/16)

Started by guga, December 30, 2015, 08:33:13 AM

Previous topic - Next topic

guga

Thanks guys


I Updated the library at the 1st post.

New version include

  • CreateTimeProfileEx api that uses all available algo methods
  • STDEx structure that stores the fastest value found on the function above
  • Cleaned the source code
  • Added Masm and RosAsm examples
  • The functions are faster and more stable


TODO:

  • Add Apis to estimate the time that CreateTimeProfile and CreateTimeprofileEx will finish working. (Good, if the user wants to add things like a progressbar running or a timer count or simply to know how much time the Api will finish analysing his code
  • Build converter functions (Nanoseconds to miliseconds, to Gigahertz, to megahertz, to clock cycles etc


I Also updated the Api Guide in pdf format, but for sizing limitation of the forum i uploaded it in 4shared (link also on the 1st post)

A small tip: To enhance the accuracy of the library it is good if perform an alignment on your code. I don´t know how to make a proper alignment in Masm, but the better approach is align to 16 bytes boundary on the start of each of your functions inside your app (I´m not saying to align the PE sections, but align  your functions to a 16 byte boundary). This will make your app works better and also enhance the accuracy of the library itself.
Although the library Api´s are already aligned, if you align your own apps on that way, you probably will get a better performance (At least on x86 32 bits, dunno how is it on 64 bits)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Damn, i found something that is tearing the library apart :(

Someone ever tested Blowfish algorithm ???

Here it only works if i set
sample: 50 to 300
iteration = 100

The result is 86708 nanoseconds !!!!  = 254350 clock cycles !!!!!!!!

Any value on iteration higher then that takes ages to finish.

I realize that BlowFish internally perform a few thousands of loops, but, is it that slow ????????????????????

Jochen, have you ever tried this algorithm ? If you need, i can post here a working example of it. I ported to asm the file from Paul Kocher.

I´m testing the initialization routine (Blowfish_Init) and not the encrypter/decrypter.

How is that possible that a functon takes about 0.08 miliseconds to run after a single call ????
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Grincheux

I send you a mail with Sph dll which contains many algorythms. Too big to post here.
Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

guga

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Hmm..the dll is for x64 only. Do you have a 32 bits version ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Grincheux

I send you the source and the Pelle's C projet. Just modfy the project to select 32 bits.
Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

Grincheux

Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

guga

I can´t compile it. :(

I trully hate Visual studio :icon_mrgreen: whenever i try to compile the blowfish.c from tom library, it says unreferenced library XXXX.....

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Grincheux

Try this link to download Pelles C Compiler : http://www.smorgasbordet.com/pellesc/
Kenavo (Bye)
----------------------
Help me if you can, I'm feeling down...

guga

Well...who ever "blow the fish'" back in the 90´s deserved be buried upside-down on a hole filled with sand, crabs and some ants to he be eaten alive sloooowly  :greensml: :greensml: :greensml: :greensml:

Blowfish value is, in fact, something around 87247 ns but it is taking 990 seconds (300 samples x 3000 iterations = 9 to 27 million loops using all algos in CreateTimeProfileEx not considering the multiplication by millions internal loops of the tested function itself. OUCH !!!) to finish on the Api after i made a couple of preliminary tests to try to stop the hang.

I´m having to make some design choices on the strategy to try to overcome problems of tested functions that contains endless loops internally. I´m reviewing the good sample criteria to see if the new changes will not break the library for "normal" functions to be tested.

I have chosen to use as a criteria to be considered a good sample when Variance is smaller then the Mean, and also when STD/VAriance values are smaller then the time the user´s processor takes to perform 1 clock cycle from where i take as a basis for enabling or disabling other criteria of consider a good sample or not.

Concerning the Variance be bigger then the mean, to overcome the problem with BlowFish i had to use Mean^2 (power of 2 of the Mean) so it can be more or like approximate to the Variance (STD is the square of the variance, so..). It worked, but yet, it is not the best choice for other functions to be tested.

I´m thinking on a general solution to not ruin the performance and accuracy just because some tested functions have extreme "heavy"computations by their own.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

I guess i suceeded to fix that particular problem. I´m making some tests to be sure before release the update.

Also, i´m porting Agner´s fog to test the rdmsr as another algo method. If it really works as expected, then it maybe helpful having more information about wht is going on with your code like branch predictions etc...But i confess that it is hard to port.

Since he uses C and i never converted wdm header files before to RosAsm, it is painful for me to understand but...i´m trying :)

Also i had to create a Fastcall macro, since ntoskrnl Api uses fastcall calling convention for some Apis.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Btw...Someone ever used Agner´sFog pcmtest file on XP (32 bits) bfore ?

I can´t run startcounter here. Neither i can compile the main file. All i was able to compile was the textB App. :( But i´m unable to understand the complete functionality

What exactly is this?



SCounterDefinition CounterDefinitions[] = {
    //  id   scheme cpu    countregs eventreg event  mask   name
    {100,  S_P4, PRALL,  4,   7,     0,      9,      7,  "Uops"     }, // uops from any source
    {101,  S_P4, PRALL,  4,   7,     0,      9,      2,  "UopsTC"   }, // uops from trace cache
    {102,  S_P4, PRALL,  4,   7,     0,      9,      1,  "UopsDec"  }, // uops directly from decoder
    {103,  S_P4, PRALL,  4,   7,     0,      9,      4,  "UopsMCode"}, // uops from microcode ROM
    {110,  S_P4, PRALL, 12,  17,     4,      1,      1,  "UopsNB"   }, // uops non-bogus
    {111,  S_P4, PRALL, 12,  17,     4,      2,   0x0c,  "UopsBogus"}, // uops bogus
    {150,  S_P4, PRALL,  8,  11,     1,      4, 0x8000,  "UopsFP"   }, // uops floating point, except move etc.
    {151,  S_P4, PRALL,  8,  11,     1,   0x2e,      8,  "UopsFPMov"}, // uops floating point and SIMD move
    {152,  S_P4, PRALL,  8,  11,     1,   0x2e,   0x10,  "UopsFPLd" }, // uops floating point and SIMD load
    {160,  S_P4, PRALL,  8,  11,     1,      2, 0x8000,  "UopsMMX"  }, // uops 64-bit MMX
    {170,  S_P4, PRALL,  8,  11,     1,   0x1a, 0x8000,  "UopsXMM"  }, // uops 128-bit integer XMM
    {200,  S_P4, PRALL, 12,  17,     5,      6,   0x0f,  "Branch"   }, // branches
    {201,  S_P4, PRALL, 12,  17,     5,      6,   0x0c,  "BrTaken"  }, // branches taken
    {202,  S_P4, PRALL, 12,  17,     5,      6,   0x03,  "BrNTaken" }, // branches not taken
    {203,  S_P4, PRALL, 12,  17,     5,      6,   0x05,  "BrPredict"}, // branches predicted
    {204,  S_P4, PRALL, 12,  17,     4,      3,   0x01,  "BrMispred"}, // branches mispredicted
    {210,  S_P4, PRALL,  4,   7,     2,      5,   0x02,  "CondJMisp"}, // conditional jumps mispredicted
    {211,  S_P4, PRALL,  4,   7,     2,      5,   0x04,  "CallMisp" }, // indirect call mispredicted
    {212,  S_P4, PRALL,  4,   7,     2,      5,   0x08,  "RetMisp"  }, // return mispredicted
    {220,  S_P4, PRALL,  4,   7,     2,      5,   0x10,  "IndirMisp"}, // indirect calls, jumps and returns mispredicted
    {310,  S_P4, PRALL,  0,   3,     0,      3,   0x01,  "TCMiss"   }, // trace cache miss
    {320,  S_P4, PRALL,  0,   3,     7,   0x0c,  0x100,  "Cach2Miss"}, // level 2 cache miss
    {321,  S_P4, PRALL,  0,   3,     7,   0x0c,  0x200,  "Cach3Miss"}, // level 3 cache miss
    {330,  S_P4, PRALL,  0,   3,     3,   0x18,   0x02,  "ITLBMiss" }, // instructions TLB Miss
    {340,  S_P4, PRALL,  0,   3,     2,      3,   0x3a,  "LdReplay" }, // memory load replay


    //  id   scheme cpu    countregs eventreg event  mask   name
    {  9,  S_P1, PRALL,  0,   1,     0,   0x16,        2,  "Instruct" }, // instructions executed
    { 11,  S_P1, PRALL,  0,   1,     0,   0x17,        2,  "InstVpipe"}, // instructions executed in V-pipe
    {202,  S_P1, PRALL,  0,   1,     0,   0x15,        2,  "Flush"    }, // pipeline flush due to branch misprediction or serializing event   
    {310,  S_P1, PRALL,  0,   1,     0,   0x0e,        2,  "CodeMiss" }, // code cache miss
    {311,  S_P1, PRALL,  0,   1,     0,   0x29,        2,  "DataMiss" }, // data cache miss


    //  id   scheme  cpu     countregs eventreg event  mask   name
    {  9, S_P2MC, PRALL,    0,   1,     0,   0xc0,     0,  "Instruct" }, // instructions executed
    { 10, S_P2MC, PRALL,    0,   1,     0,   0xd0,     0,  "IDecode"  }, // instructions decoded
    { 20, S_P2MC, PRALL,    0,   1,     0,   0x80,     0,  "IFetch"   }, // instruction fetches
    { 21, S_P2MC, PRALL,    0,   1,     0,   0x86,     0,  "IFetchStl"}, // instruction fetch stall
    { 22, S_P2MC, PRALL,    0,   1,     0,   0x87,     0,  "ILenStal" }, // instruction length decoder stalls
    {100, S_P2MC, INTEL_PM, 0,   1,     0,   0xc2,     0,  "Uops(F)"  }, // microoperations in fused domain
    {100, S_P2MC, PRALL,    0,   1,     0,   0xc2,     0,  "Uops"     }, // microoperations
    {110, S_P2MC, INTEL_PM, 0,   1,     0,   0xa0,     0,  "Uops(UF)" }, // unfused microoperations submitted to execution units (Undocumented counter!)
    {104, S_P2MC, INTEL_PM, 0,   1,     0,   0xda,     0,  "UopsFused"}, // fused uops
    {115, S_P2MC, INTEL_PM, 0,   1,     0,   0xd3,     0,  "SynchUops"}, // stack synchronization uops
    {121, S_P2MC, PRALL,    0,   1,     0,   0xd2,     0,  "PartRStl" }, // partial register access stall
    {130, S_P2MC, PRALL,    0,   1,     0,   0xa2,     0,  "Rs Stall" }, // all resource stalls
    {201, S_P2MC, PRALL,    0,   1,     0,   0xc9,     0,  "BrTaken"  }, // branches taken
    {204, S_P2MC, PRALL,    0,   1,     0,   0xc5,     0,  "BrMispred"}, // mispredicted branches
    {205, S_P2MC, PRALL,    0,   1,     0,   0xe6,     0,  "BTBMiss"  }, // static branch prediction made
    {310, S_P2MC, PRALL,    0,   1,     0,   0x28,  0x0f,  "CodeMiss" }, // level 2 cache code fetch
    {311, S_P2MC, INTEL_P23,0,   1,     0,   0x29,  0x0f,  "L1D Miss" }, // level 2 cache data fetch
(...)
    //  end of list   
    {0, S_UNKNOWN, PRUNKNOWN, 0,  0,     0,      0,     0,    0     }  // list must end with a record of all 0
};

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

I`m doing some designe choices here.

I suceeded to port Agner´s fog code to RosAsm (weell..kind-of  :icon_mrgreen:), but i needed to update RosAsm to it allow use those MONSTER wmd strutures.

Didn´t knew how weird were those structures until now. But, i hope i ported it roght. I´ll make some tests tonight
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

dedndave

isn't that the TIB structure ?
it looks like it (thread info block aka thread envrionment block)

i.e.,
1) you don't have to initialize it
2) all you really need are offsets to work with specific values

i often use something like

    mov     eax,fs:[8]

notice that it's the stack limit - but my code doesn't need to know the name - lol

guga

Hi Dave

No, the driver uses something called KPCR http://masm32.com/board/index.php?topic=5023.0;topicseen

It took me hours to try to port this beast. I already ported NT_TIB/TEB structure before but, nothing compare to this monster. The worst thing is the M$ header i had was utterly wrong. It said that KPCR was a simple structure,but, in fact it contains more then 1700 members !!!

I had to compare the info from 2 different sites and port all other structures used to build such a thing. I also had to take a look at my Windows Kernel (The one used for M$ research released to public/students and the leaked one i had - the same used by ReactOS, btw)
http://msdn.moonsols.com/win7rtm_x86/KPCR.html
http://www.nirsoft.net/kernel_struct/vista/KPCR.html

I also made a macro called KeGetCurrentProcessorNumber that behave the same way as he did.

[KPCR.NumberDis 81] ; equate value related to the KPCR structure
; KeGetCurrentProcessorNumber
[KeGetCurrentProcessorNumber | movzx eax B$fs:KPCR.NumberDis | cdq]


The original disassembled file of this was:

mov cl B$fs:051 (in hex)
movzx eax cl
cdq


I hope i suceeded to port it correctly. :greensml:


Btw, some routines maybe usefull for you, dave. For example how to enable/disable rtdsc/rdpmc. I ported as:

                .Else_If D@command = PMC_ENABLE
                    mov eax CR4 ; Read CR4
                    and eax 0-5 ; Enable RDTSC
                    or eax 0100 ; Enable RDPMC
                    mov CR4 eax ; Write CR4
                .Else_If D@command = PMC_DISABLE
                    mov eax CR4 ; Read CR4
                    and eax 0FFFFFEFF ; 0-257 Disable RDPMC
                    mov CR4 eax ; Write CR4


What i dont know yet is how he found the values after the and/or instructions (-5, 0100, etc). How he knew the proper bit to enable or disable. I would like to create a proper equate for that, instead using the hexadecimal values. (For making it easier to read)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com