Self modifying code

Siekmanski · February 27, 2016, 10:18:07 AM

Is it legal to post and discuss self-modifying code here at The Masm Forum ?

- To improve performance by unrolling loops into L1/L2/L3 code cache memory ( skipping additions / subtractions of memory data pointers and changing immediate values etc.)
- Compressing code to be decompressed and executed at runtime. ( minimizes executable size )

Marinus

Magnum · February 27, 2016, 10:39:36 AM

I see no mention of it in the rules.

TouEnMasm · February 27, 2016, 08:33:06 PM

Pirates techniques are not allowed in this forum,That is a clear rule.
Better is to wait the permission of hutch,but I don't think he can agree with that.
But,There is some samples in the forum.

jj2007 · February 27, 2016, 08:49:09 PM

Greetings from Italy: Our logic here is "everything that is not explicitly forbidden is allowed".

Which is in sharp contrast to German logic: "everything that is not explicitly allowed is forbidden" 8)

Anyway, for Hutch to decide (does SMF has hidden members-only sub-forums?).

Your ideas look interesting, although from my experience self-modifying code is a no-no for performance, as the code cache gets invalidated. But you seem to be aware of that...

Siekmanski · February 28, 2016, 12:26:27 AM

These ideas are not intended for pirate techniques.

My idea is, reserve process-memory unrolling and self-modifying the code loops once.
Not on the fly at each execution so, no code cache invalidation.
Then prefetch the code to the cache you want it in. ( preventing cache miss penalties. )
So, you can take advantage of the different L1/L2/L3 cache sizes of different machines and skip the opcodes you don't need anymore.

This should boost up performance. 8)

@Hutch, do you allow this kind of discussions ?

avcaballero · February 29, 2016, 09:02:34 PM

It seems that Hutch has choosed "the German method" to say "no"

I don't know a word on this subject, but I think it is an intersesting subject, so I'd be glad to hear about it.

Regards

FORTRANS · March 01, 2016, 12:27:37 AM

Hi,

Quote from: Siekmanski on February 27, 2016, 10:18:07 AM
Is it legal to post and discuss self-modifying code here at The Masm Forum ?

- To improve performance by unrolling loops into L1/L2/L3 code cache memory ( skipping additions / subtractions of memory data pointers and changing immediate values etc.)

I confess I do not see how you would improve performance with
self-modifying code. What information would you make use of at
run time that was not available at compile time? Even given that I
am not being imaginative enough about such conditions, it would
seem difficult to improve performance by copying things around to
unroll loops or the like.

Quote
- Compressing code to be decompressed and executed at runtime. ( minimizes executable size )

I have done this using the EXEPACK option of the linker. Admittedly
for 16-bit code. It reduced the executable's size from 63 kb to 25 kb.
As it was for the HP 200LX palmtop where storage can be limited. So
it has a similar memory footprint as the uncompressed program, but
uses less storage space.

Regards,

Steve N.

Raistlin · March 01, 2016, 12:49:59 AM

Code Select

- To improve performance by unrolling loops into L1/L2/L3 code cache memory ( skipping additions / subtractions of memory data pointers and changing immediate values etc.

I understand this statement & I am interested in the theory for the specific reason it was suggested
as I am working on a project that does "just that" -> re: enumerate underlying hardware dynamically.
My own path was to look at cache granularity with L1 code preservation and L1/2/3 Data optimizations.
This possible feature enhancement would be of use, if found to be practical - I for one would like to know more.
If there's a way we can chat about this and keep the riff-raff out of it - I'am game. :t

jimg · March 01, 2016, 01:18:26 AM

I used self modifying code in my sort routines to save space and decrease cpu ticks. To code for the various possibilities in each type of sort, would take many compares and internal jumps to the correct section for the situation, which take time. The alternative of duplicating the code for each type without the jumps would make the routine many times it's size. Since the routine loops internally millions of times on a large sized chunk of data to sort, a simple precompiler at the start that changes several compare type instructions and result handlers makes for a smaller and faster general purpose routine.

Besides, it's a lot of fun :)

Siekmanski · March 01, 2016, 01:49:37 AM

Hi Steve,

The advantage could be, to prepare a routine on the fly ONCE, that runs the fastest possible on each different architecture.
You only need to know the available instruction sets, cores and cache sizes to construct the "fastest" code for the specific architecture it runs on.

Hi Raistlin,

This idea came up when I picked up my FFT routines again and wanted it to be as fast as possible.
BTW, found out a way to construct the time domain decomposition table without the need of bit-reversing.....
Special cases for the first 2 Log2 loops and the last Log2 loop.
Special case if Imaginary data is zero or not available if you need Real-Imag output.
Still have to write a Real FFT. ( where only Real data is processed )

So, you now know how many different routines are needed to accomplish the same thing...

Still no answer from Hutch if this is allowed to discuss here.....

Else, PM me your email address to discuss this further.

avcaballero · March 01, 2016, 02:35:53 AM

Here it is a sample code. As we don't know yet if we could talk about it, i don't post the executable

Code Select


org 100h
use16
start:
	mov	al,13h
	int	10h
	mov	bl,3
	mov	si,0a0a0h
	mov	ds,si
again:	mov	cx,0c8bh
	xor	ch,[bx+si]
	mov	[si+0fec2h],ch
	dec	si
	jnz	again+1
	int	16h		;ah=00h
	xchg	ax,bx
	int	10h		;ax=0003h
	ret

If you look closer, you will see "jnz again+1". One byte after the "again" label make the loop different:

Code Select


010C 8B0C             MOV   CX,[SI]
010E 3228             XOR   CH,[BX+SI]
0110 88ACC2FE         MOV   [FEC2+SI],CH
0114 4E               DEC   SI
0115 75F5             JNZ    010C

If you compile and run the code (up to you) you will se a Sierpinski Triangle fractal, in just 29 bytes.

nidud · March 01, 2016, 03:48:58 AM

deleted

Siekmanski · March 01, 2016, 07:27:10 AM

Quote from: nidud on March 01, 2016, 03:48:58 AM
The conventional way will be hardware specific DLL-files where the application loads the appropriate ones on start-up.

If you think about it, having multiple versions of the same function and select one based on available hardware at runtime will create a larger EXE (at least in memory) than loading a DLL.

With regards to self-modifying code this forum is loaded with samples using VirtualProtect() to gain write access to the code segment so I wouldn't worry too much about that :P

I just wanted to be sure out of respect for this great forum.

You need a lot of dll-files to for all the possible combinations of caches and instruction sets.

http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h

This is why I want to create streamlined versions from the basic routines in memory.

jj2007 · March 01, 2016, 02:17:19 PM

Quote from: Siekmanski on March 01, 2016, 07:27:10 AMYou need a lot of dll-files to for all the possible combinations of caches and instruction sets.

One should expect, for example, that gdi32.dll and gdiplus.dll look different for major families of cpus, such as amd/intel i?/intel celeron etc ::)

Raistlin · March 01, 2016, 05:19:13 PM

Code Select

BTW, found out a way to construct the time domain decomposition table without the need of bit-reversing....

Probably should'nt waste a post like this - BUT AWESOME ! - I'am all ears

The MASM Forum

News:

Self modifying code

Siekmanski

Magnum

TouEnMasm

jj2007

Siekmanski

avcaballero

FORTRANS

Raistlin

jimg

Siekmanski

avcaballero

nidud

Siekmanski

jj2007

Raistlin