Parallell mixer
The idea behind implementing a parallell mixer is to utilize all possible CPU power on systems that support it. The Buzé engine supports switching mixers in realtime, and can be toggled from View -> Preferences -> Player -> Use parallell mixer.
The concept of "parallell tree processing" is obviously not new. I found and read from these papers on this subject:
- Tree Traversal Scheduling: A Global Scheduling Technique.. - Zhou, Jennings, Conte (2001)
- Treegion scheduling for highly parallel processors - S. Banerjia, W. A. Havanki, and T. M. Conte (1997)
Although these papers were on a wholly different subject, the techniques they describe gave meat to my own ideas.
The technique that is currently used can be described in two steps.
The first step assigns each machine a superblock ID (or a "subtask", or just "task") similar to the groupings displayed in this image:
http://www.batman.no/images/parallelldiagram.gif
The head of each superblock is put in a task list ordered by their distance from the master. Those farthest away will be processed first. Every time something changes in the song tree, such as a machine is added, deleted, disconnected etc, the superblock mappings must be re-calculated.
The second step is what happens during playback. The mixer sets up as many threads as necessary. For every frame, each thread requests and works superblocks until the task list is empty and the entire tree is processed. When a subtask has an effect/master as head machine, the it may be necessary to wait for the heads inputs to finish mixing.
Pseudoish code:
- Precalculating
- find depths for all machines, i.e how many connections there are to the master
- run an algorithm to find superblocks in any given tree of machines
- find the generator with the higest depth
- add machine to tasklist
- machine.superblock=next_id
- for each output in machine
- if outputmachine has only one input, recurse from 3 with outputmachine as machine and next_id unchanged, else
- if outputmachine has more than one input, count how many of these inputs are not assigned a superblock if the count is zero, add this machine to the tasklist and recurse from 3 with outputmachine as machine and next_id+1
- return next_id so recursions know which superblock-id is next available
- Multi-threaded traversal
- order the task list by depth, we now know which order to process
- each parallell thread enters a loop:
- retreive the deepest task (i.e a generator, later entries are also effects with inputs>1)
- wait for all inputs to signal they're done (generators skip this step)
- find each of the endpoints in the superblock of this machine
- mix each of them
- goto 1 until tasklist is empty
Unfortunately, unofficial benchmarks show there is nearly no advantage of the current implementation because of the parallellization overhead itself. It is unclear whether all work must be discarded or the answer is optimization. Here are some links on the subject:
From AMD - http://developer.amd.com/articles.aspx?id=4&num=5 :
To improve the utilization of the L2 cache on a dual-core processor (or frankly, in any multi-threaded system), avoid any instances of having multiple threads write to the same variables; this will require that all caches reload the data from main memory. Also, avoid having one thread frequently writing to a memory location that another thread is using.
From MSDN - http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndllpro/html/msdn_threadli.asp :
As a final note, please remember that threads are a system resource that is not free; thus, there may be a greater penalty to multithreading than performance hits alone. As a rule of thumb, I would argue that you employ multithreading intelligently and conservatively. Use threads where they can benefit your application design, but avoid them wherever serial execution can achieve the same effect.
About buffer sizes and hyperthreading: http://www.intel.com/cd/ids/developer/asmo-na/eng/20461.htm?prn=Y
