I am currently writing a bulk processing algorithm for pitch detection in audio being streamed from disk. I have tightened up my algorithm such that it runs in nearly real-time for serially streamed data.
Ideally I would like the system to be working way faster than real-time such that I could hand it real-time data and after a not huge delay be generating the pitch track data.
Now the thing that strikes me is that the serial processing of the data is where I could provide a hell of a lot of speedup. I'm running on a quad core i7 (with 8 hardware threads) so I ought to be able to improve the speed significantly by spreading processing across multiple blocks.
As it goes I currently do the following:
Now it strikes me that once I have a window I could easily copy that data into a given thread work buffer (as well as providing a memory location the result will be written to). This way I could, effectively buffer up to 7 (Leave thread 8 open to pump the data) threads worth of data that a thread pool would then process.
When I try to submit the 8th window of audio I want the pool to block until a thread is available to process the data and so on. The idea being that I would keep 7 threads constantly working processing the data. From previous experience I would expect to see about 5x speed up from doing this.
In the past I have written under C++ my own task based system that would do the job perfectly but this app is being developed under C#. To get good parallelism with low overhead under C++ I spent significant amounts of time building a good lockless queueing mechanism.
I was rather hoping, under C#, that someone would have taken the pain out of doing this for me. However I can't find anything that would seem to work. I've looked at System.Threading.ThreadPool and it appears to have no way of checking how many threads are currently in action. Not to mention that the overheads seems prohibitive. The big problem then arises that I can't re-use an existing pre-allocated structure (which is important in my processing) forcing me to re-create it each time I submit a work item. This has the huge disadvantage that I then generate work quicker than I can process it so not only do I end up wasting tonnes of time setting up structure and workspaces that I really ought not to need but my memory usage spirals out of control.
I then found out about System.Threading.Tasks but that too doesn't seem to offer the functionality I'm after.
I guess I could just use my C++ Task Manager via interop but I really assumed that in this day and age someone would already have set up something similar. So am I missing something? Or can anyone provide me with a link to such a task management engine?
Task Parallel Library was designed and implemented especially for tasks you are trying to solve! Also you can pipeline this process as well.
So you have to ensure:
Well, as always in these cases, I recommend using ZeroMQ . It'll allow you to control the numbers of consumers quite easily.
As for your scratch-pad areas, first, 0.5GB is not a lot of memory this day and age. I think my phone has more RAM than that, let alone my desktop... If you want to go real easy on memory consumption, just create one scratch-pad area per thread, put all of them in a pool and have the producer get on scratch-pad area before queuing a task, attaching that area of the task. When the consumer is done, return the scratch-pad area back to the pool.
I would use the Task Parallel dataflow library here. Its designed to allow creation of process blocks that can be chained together with explicit control over degree of parallelism and blocking semantics.