Use threads when traversing a tree - 【StackMirror】|c#|multithreading|performance|parallel-processing|traversal

I will like to speed the process of traversing a tree. Here is an example of a node:

    class Node
    {
        public List<Node> Children { get; set; }
        public int SompeProperty { get; set; }
        public String SomeOtherProperty { get; set; }
    }

the way I traverse the try is like:

    static void TraverseTree(Node ParentNode)
    {
        if (ParentNode.Children == null)
            return;

        foreach (var child in ParentNode.Children)
        {
            TraverseTree(child);               
        }
    }

the ParentNode.Children method takes about 1 millisecond because a Node represents a file or a directory. I just used this example of a node to illustrate better my point.

so if you think about it if the first node has 4 children and each of those children has 10000000 descendants we could increase the speed of this traversal if we where to traverse each of those 4 children in a separeate thread taking advantage of parallel programming. if that would have been the scenario then I would have taken that approach. But if I don't know in advance the structure of a tree how could I do it?

I been thinking about:

1) start traversing the tree place the first 10 nodes that has children on a stack then start the traversal of each on a separate thread.

2) Do something like:

    static void TraverseTree(Node ParentNode)
    {
        if (ParentNode.Children == null)
            return;

        foreach (var child in ParentNode.Children)
        {
            ThreadPool.QueueUserWorkItem(new WaitCallback((x) =>
            {                    
                TraverseTree(child);   
            }), null);                            
        }
    }

this often gives me strange results but it is significantly faster.

Results

Using task improved the speed of the algorithm by about 40% here are the results:

scanning my whole C:\ drive took about 5.81 seconds with the following algorithm:

        //directoryPath  = "C:\"
    var now = DateTime.Now;

        Task<List<ScanItem>> t1 = new Task<List<ScanItem>>(() =>
        {
            return GetAllFilesInDirectory(directoryPath);
        });

        t1.Start();

        t1.Wait();

        var done = DateTime.Now-now;  // done = 5.81 average

scanning my whole C:\ drive took about 3.01 seconds with the following algorithm:

        //directoryPath  = "C:\"  
        var now = DateTime.Now;


        // get all directories in my c: drive it should only contain directories
        var directories = Directory.GetDirectories(directoryPath);

        // directories = 17 directories:  inetpub, MSOCache, PrefLogs, ProgramFiles, ProgramFiles (x86) etc...

        Task<List<ScanItem>>[] myTasks = new Task<List<ScanItem>>[directories.Length];

        // create a task fore each directory in the c:\ drive
        for (int k = 0; k < myTasks.Length; k++)
        {
            var currentDir = directories[k];
            myTasks[k] = new Task<List<ScanItem>>(() =>
            {
                return GetAllFilesInDirectory(currentDir);
            });                
        }

        // start all the tasks
        for (int k = 0; k < myTasks.Length; k++)
            myTasks[k].Start();


        Task.WaitAll(myTasks); // wait for all tasks to finish

        var done = now - DateTime.Now;  // average about 3.01 seconds

If I where to traverse the list the first algorithm returns 318,222 files and directories (that is the correct number). the second algorithm returns 318,195 which is very close I don't understand why though...

I am testing this in a computer that has 8 cores. Maybe if I where to run this on a computer that had 2 cores using one task could be faster than creating all this 17 tasks.

if you want to know what algorithm I use to get the files that fast then look at https://stackoverflow.com/a/724184/637142

2012-04-05 16:14
by Tono Nam

It's worth noting that there is an overhead associated with parallelizing the traversal. If you're doing a lot with each node you can see some fairly big gains; but if you're doing very little in the CPU of each node then the overhead of creating new tasks and scheduling them can actually hurt more than it helps - Servy 2012-04-05 16:46

Are you forgetting to add the files in the top level directory to the 318,195 - johnnycrash 2012-04-05 23:00

Yeah that is the reason why I got different results... I was missing to add the directories and files in the c:\ dir - Tono Nam 2012-04-06 15:11

Use the Task Parallel Library, rather than rolling your own parallelism code. It is ideally suited to solve this sort of problem.

The way the TPL works is rather than you assigning threads to a problem, you simply break up the problem into "tasks" and let the TPL take care of figuring out how to parallelize the work amongst a pool of available workers. Just make a task for each sub-branch of the tree; those tasks can in turn spawn off tasks of their own for their sub-branches. The TPL will assign threads out of a pool until the processors are saturated.

Because of this, it is important to let the TPL know whether your tasks are going to be gated on the CPU or the I/O:

If the tasks are CPU-bound then the TPL will assign one pooled thread per CPU and make the other tasks wait until there is a core available; that maximizes throughput and saturates all the processors. That is exactly what you want: if you bought a machine with four processors and two of them are idle then you paid for two cores that you're not using.
If a single task is I/O bound then you can use the LongRunning option when creating the task to indicate to the TPL that this task should not consume an entire core; other tasks should be given a turn at that core.
If, as it seems is the case, you have many I/O bound tasks then you should consider using a TaskCompletionSource instead, as that allows for more efficient use of "continuation" callbacks. Consider also using the new async/await feature of C# 5 to schedule continuations; it affords a far more pleasant way of writing the asynchronous code.

And of course, do not forget that if the problem actually is saturating the I/O capability of the machine then no amount of processor parallelism is going to make a dent. If you're filling a swimming pool, adding more hoses to the same faucet doesn't increase the flow through that faucet.

2012-04-05 16:38
by Eric Lippert

Make sure that you give hints to the TPL about whether the task is going to be processor bound - How - ebb 2012-04-05 17:36

@ebb: I've clarified the paragraph - Eric Lippert 2012-04-05 17:49

If you want to traverse a tree in parallel, you must:

Partition the tree such that separate threads are guaranteed to work on different parts of the tree (e.g. starting at the root, you could assign descendant nodes to new threads until the maximum degree of parallelism has been achieved.
Ensure that your tree structure can safely be traversed by multiple threads in general (i.e. traversing does not cause state-altering side effects in the tree implementation).
Ensure that no thread is updating the tree during traversal.

If you are getting "strange results", one of the above is probably not true. Keep in mind that the order in which nodes are traversed is non-deterministic in the multi-threaded example. Did you account for that when declaring the results "strange"?

Even so:

In the directory example, you may well end up with IO contention limiting effectiveness of your multithreading approach
Traversing in-memory nodes will tend to kick things out of the cache, reducing the return on investment of using multiple threads (false sharing).

2012-04-05 16:19
by Eric J.

That's why the first time I traverse my c:\ it takes about 40 seconds and the second time I traverse it takes about 8 seconds? that is what you mean by the cache? if I where to traverse each of the folders in the c:\ drive separately that will slow down the process - Tono Nam 2012-04-05 16:25

@Tono: The second time you traverse your file system, yes, information about the directory structure will be cached by the operating system. With an in-memory node collection (i.e. one that does not have to go to disk), there's a similar issue in that certain memory lines will be in the CPU cache. Accessing them serially is then potentially much faster than accessing them randomly. So, multithreading can actually slow things down if your workload isn't CPU intensive - Eric J. 2012-04-05 16:27

I place the results in my edit... I am interested in testing this out on a computer that has just 1 cor - Tono Nam 2012-04-05 17:59

Keep in mind that multithreading is only useful when your application is taking 100% of CPU time on a single core; if the CPU usage is low (because it's waiting after the hard drive or the network), you won't see any benefit in running your code in parallel.

2012-04-05 16:29
by Lâm Tran Duy

-1

Recently I had to create an algorithm that is able to discover a huge tree structure (file system actually, but it could be anything) and execute an async operation on every item. I came up with a small library (built using .Net TPL and concurrent queue) that is able to do that:

discovers a huge tree in parallel
parent items are always processed before children
resource usage depends on the given max degree of parallelism, not on tree size
works asynchronously

Parallel async TreeWalker

2016-07-24 19:33
by Miklós Tóth