For my implementation of multi-core support I basically took the following approach:
1) Create a thread-pool of size N (N == logical core count).
2) The threads themselves run a custom user-space scheduler so that when I need a thread I don't have the overhead of creating one, I just assign a task and it wakes up and does it.
This has had huge performance gains for me especially when you want to quickly fire off a bunch of parallel tasks at a high rate (IE: not long running threads).
3) the threads are assigned to every even processor first, then every odd, so that if you have 4 real cores + 4 HT cores, and you request 4 parallel tasks they will be allocated to real cores, only if the number of tasks
> physical cores do the HT ones come into play.
4) The locking and synchronization setup between these threads and the custom scheduler is implemented primarily with a user-mode spinlock and atomic operations, as I've found these to be considerably faster than using the kernel ones.
One way to use this setup dynamically for example during a rendering-loop is run some tasks with say 4 threads, then every 5 iterations up the core count until the frametime in ms is no longer dropping.
That way you know you've reached the limit for the particular machine given ram, cores, latency and whatever else may be at play. It's basically like run-time profiling to determine the optimal thread-count.