News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Self modifying code

Started by Siekmanski, February 27, 2016, 10:18:07 AM

Previous topic - Next topic

Siekmanski

Is it legal to post and discuss self-modifying code here at The Masm Forum ?

- To improve performance by unrolling loops into L1/L2/L3 code cache memory ( skipping additions / subtractions of memory data pointers and changing immediate values etc.)
- Compressing code to be decompressed and executed at runtime. ( minimizes executable size )

Marinus
Creative coders use backward thinking techniques as a strategy.

Magnum

I see no mention of it in the rules.

Take care,
                   Andy

Ubuntu-mate-18.04-desktop-amd64

http://www.goodnewsnetwork.org

TouEnMasm

Pirates techniques are not allowed in this forum,That is a clear rule.
Better is to wait the permission of hutch,but I don't think he can agree with that.
But,There is some samples in the forum.
Fa is a musical note to play with CL

jj2007

Greetings from Italy: Our logic here is "everything that is not explicitly forbidden is allowed".

Which is in sharp contrast to German logic: "everything that is not explicitly allowed is forbidden" 8)

Anyway, for Hutch to decide (does SMF has hidden members-only sub-forums?).

Your ideas look interesting, although from my experience self-modifying code is a no-no for performance, as the code cache gets invalidated. But you seem to be aware of that...

Siekmanski

These ideas are not intended for pirate techniques.  :eusa_naughty:

My idea is, reserve process-memory unrolling and self-modifying the code loops once.
Not on the fly at each execution so, no code cache invalidation.
Then prefetch the code to the cache you want it in. ( preventing cache miss penalties. )
So, you can take advantage of the different L1/L2/L3 cache sizes of different machines and skip the opcodes you don't need anymore.

This should boost up performance.  8)

@Hutch, do you allow this kind of discussions ?
Creative coders use backward thinking techniques as a strategy.

avcaballero

It seems that Hutch has choosed "the German method" to say "no"  :biggrin:

I don't know a word on this subject, but I think it is an intersesting subject, so I'd be glad to hear about it.

Regards

FORTRANS

Hi,

Quote from: Siekmanski on February 27, 2016, 10:18:07 AM
Is it legal to post and discuss self-modifying code here at The Masm Forum ?

- To improve performance by unrolling loops into L1/L2/L3 code cache memory ( skipping additions / subtractions of memory data pointers and changing immediate values etc.)

   I confess I do not see how you would improve performance with
self-modifying code.  What information would you make use of at
run time that was not available at compile time?  Even given that I
am not being imaginative enough about such conditions, it would
seem difficult to improve performance by copying things around to
unroll loops or the like.

Quote
- Compressing code to be decompressed and executed at runtime. ( minimizes executable size )

   I have done this using the EXEPACK option of the linker.  Admittedly
for 16-bit code.  It reduced the executable's size from 63 kb to 25 kb.
As it was for the HP 200LX palmtop where storage can be limited.  So
it has a similar memory footprint as the uncompressed program, but
uses less storage space.

Regards,

Steve N.

Raistlin

- To improve performance by unrolling loops into L1/L2/L3 code cache memory ( skipping additions / subtractions of memory data pointers and changing immediate values etc.

I understand this statement & I am interested in the theory for the specific reason it was suggested
as I am working on a project that does "just that" -> re: enumerate underlying hardware dynamically.
My own path was to look at cache granularity with L1 code preservation and L1/2/3 Data optimizations.
This possible feature enhancement would be of use, if found to be practical - I for one would like to know more.
If there's a way we can chat about this and keep the riff-raff out of it - I'am game. :t


Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...

jimg

I used self modifying code in my sort routines to save space and decrease cpu ticks.  To code for the various possibilities in each type of sort, would take many compares and internal jumps to the correct section for the situation, which take time.  The alternative of duplicating the code for each type without the jumps would make the routine many times it's size.  Since the routine loops internally millions of times on a large sized chunk of data to sort, a simple precompiler at the start  that changes several compare type instructions and result handlers makes for a smaller and faster general purpose routine.

Besides, it's a lot of fun :)   

Siekmanski

Hi Steve,

The advantage could be, to prepare a routine on the fly ONCE, that runs the fastest possible on each different architecture.
You only need to know the available instruction sets, cores and cache sizes to construct the "fastest" code for the specific architecture it runs on.

Hi Raistlin,

This idea came up when I picked up my FFT routines again and wanted it to be as fast as possible.
BTW, found out a way to construct the time domain decomposition table without the need of bit-reversing.....
Special cases for the first 2 Log2 loops and the last Log2 loop.
Special case if Imaginary data is zero or not available if you need Real-Imag output.
Still have to write a Real FFT. ( where only Real data is processed )

So, you now know how many different routines are needed to accomplish the same thing...

Still no answer from Hutch if this is allowed to discuss here.....

Else, PM me your email address to discuss this further.
Creative coders use backward thinking techniques as a strategy.

avcaballero

Here it is a sample code. As we don't know yet if we could talk about it, i don't post the executable

org 100h
use16
start:
mov al,13h
int 10h
mov bl,3
mov si,0a0a0h
mov ds,si
again: mov cx,0c8bh
xor ch,[bx+si]
mov [si+0fec2h],ch
dec si
jnz again+1
int 16h ;ah=00h
xchg ax,bx
int 10h ;ax=0003h
ret

If you look closer, you will see "jnz again+1". One byte after the "again" label make the loop different:

010C 8B0C             MOV   CX,[SI]
010E 3228             XOR   CH,[BX+SI]
0110 88ACC2FE         MOV   [FEC2+SI],CH
0114 4E               DEC   SI
0115 75F5             JNZ    010C

If you compile and run the code (up to you) you will se a Sierpinski Triangle fractal, in just 29 bytes.

nidud

#11
deleted

Siekmanski

Quote from: nidud on March 01, 2016, 03:48:58 AM
The conventional way will be hardware specific DLL-files where the application loads the appropriate ones on start-up.

If you think about it, having multiple versions of the same function and select one based on available hardware at runtime will create a larger EXE (at least in memory) than loading a DLL.

With regards to self-modifying code this forum is loaded with samples using VirtualProtect() to gain write access to the code segment so I wouldn't worry too much about that  :P

:biggrin: I just wanted to be sure out of respect for this great forum.


You need a lot of dll-files to for all the possible combinations of caches and instruction sets.

http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h

This is why I want to create streamlined versions from the basic routines in memory.
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: Siekmanski on March 01, 2016, 07:27:10 AMYou need a lot of dll-files to for all the possible combinations of caches and instruction sets.

One should expect, for example, that gdi32.dll and gdiplus.dll look different for major families of cpus, such as amd/intel i?/intel celeron etc ::)

Raistlin

BTW, found out a way to construct the time domain decomposition table without the need of bit-reversing....

Probably should'nt waste a post like this - BUT AWESOME ! - I'am all ears
Are you pondering what I'm pondering? It's time to take over the world ! - let's use ASSEMBLY...