No speedup in multithread program - 【StackMirror】|java|multithreading|performance|parallel-processing|go

I was playing with Go language concurrency and found something which is kinda opaque to me.

I wrote parallel matrix multiplication, that is, each task computes single line of product matrix, multiplying corresponding rows and columns of source matrices.

Here is Java program

public static double[][] parallelMultiply(int nthreads, final double[][] m1, final double[][] m2) {
    final int n = m1.length, m = m1[0].length, l = m2[0].length;
    assert m1[0].length == m2.length;

    double[][] r = new double[n][];

    ExecutorService e = Executors.newFixedThreadPool(nthreads);
    List<Future<double[]>> results = new LinkedList<Future<double[]>>();
    for (int ii = 0; ii < n; ++ii) {
        final int i = ii;
        Future<double[]> result = e.submit(new Callable<double[]>() {
            public double[] call() throws Exception {
                double[] row = new double[l];
                for (int j = 0; j < l; ++j) {
                    for (int k = 0; k < m; ++k) {
                        row[j] += m1[i][k]*m2[k][j];
                    }
                }
                return row;
            }
        });
        results.add(result);
    }
    try {
        e.shutdown();
        e.awaitTermination(1, TimeUnit.HOURS);
        int i = 0;
        for (Future<double[]> result : results) {
            r[i] = result.get();
            ++i;
        }
    } catch (Exception ex) {
        ex.printStackTrace();
        return null;
    }

    return r;
}

and this is Go program

type Matrix struct {
    n, m int
    data [][]float64
}

func New(n, m int) *Matrix {
    data := make([][]float64, n)
    for i, _ := range data {
        data[i] = make([]float64, m)
    }
    return &Matrix{n, m, data}
}

func (m *Matrix) Get(i, j int) float64 {
    return m.data[i][j]
}

func (m *Matrix) Set(i, j int, v float64) {
    m.data[i][j] = v
}

func MultiplyParallel(m1, m2 *Matrix) *Matrix {
    r := New(m1.n, m2.m)

    c := make(chan interface{}, m1.n)
    for i := 0; i < m1.n; i++ {
        go func(i int) {
            innerLoop(r, m1, m2, i)
            c <- nil
        }(i)
    }

    for i := 0; i < m1.n; i++ {
        <-c
    }

    return r
}

func innerLoop(r, m1, m2 *Matrix, i int) {
    for j := 0; j < m2.m; j++ {
        s := 0.0
        for k := 0; k < m1.m; k++ {
            s = s + m1.Get(i, k) * m2.Get(k, j)
        }
        r.Set(i, j, s)
    }
}

When I use Java program with nthreads=1 and nthreads=2 there is nearly double speedup on my dual-core N450 Atom netbook. When I use Go program with GOMAXPROCS=1 and GOMAXPROCS=2 there is no speedup at all!

Even though Java code uses additional storage for Futures and then collectes their values to the result matrix instead of direct array update in the worker code (that's what Go version does), it performs much more faster on several cores than Go version.

Especially funny is that Go version with GOMAXPROCS=2 loads both cores (htop displays 100% load on both processors while program works), but the time of computation is the same as with GOMAXPROCS=1 (htop displays 100% load only on one core in this case).

Another concern is that Java program is faster than Go one even in simple single-thread multiplication, but that is not exactly unexpected (taking benchmarks from here into account) and should not affect multicore performance multiplier.

What I'm doing incorrectly here? Is there a way to speedup Go program?

UPD: it seems i found what I'm doing incorrectly. I was checking time of java program using System.currentTimeMillis() and Go program using time shell command. I mistakingly took 'user' time from zsh output as program working time instead of 'total' one. Now i double-checked the computation speed and it gives me nearly double speedup too (though it is slighlty lesser than Java's):

% time env GOMAXPROCS=2 ./4-2-go -n 500 -q
env GOMAXPROCS=2 ./4-2-go -n 500 -q  22,34s user 0,04s system 99% cpu 22,483 total
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q -p
env GOMAXPROCS=2 ./4-2-go -n 500 -q -p  24,09s user 0,10s system 184% cpu 13,080 total

Seems I have to be more attentive.

Still java program gives five time lesser times on the same case. But it is a matter for another question I think.

2012-04-04 18:19
by Vladimir Matveev

I compiled and ran your code and did not observe this. Instead, for a single multiply of a 1000x1000 matrix, I saw the expected speedup. You should post a complete program and give details of how you are timing it - Sonia 2012-04-04 18:56

Thank you for your answer. Please see my update in the question - Vladimir Matveev 2012-04-05 07:18

You are probably experiencing the effects of false sharing. In a nutshell, if two pieces of data happen to fall onto the same CPU cache line, modifying these two pieces of data from threads that execute on different CPU cores will trigger the expensive cache coherency protocol.

This kind of cache "ping-pong" is extremely hard to diagnose, and can happen on logically completely unrelated data, just because they happen to be placed close enough in memory. The 100% CPU load is typical of false sharing - your cores really are working 100%, they are just not working on your program - they are working on synchronizing their caches.

The fact that in Java program you have a thread-private data until the time comes to "integrate" it into the final result is what saves you from false sharing. I'm not familiar with Go, but judging on your own words, threads are writing directly to the common array, which is exactly the kind of thing that could trigger the false sharing. This is an example how a perfectly valid single-threaded reasoning does exactly the opposite in the multi-threaded environment!

For more in-depth discussion on the topic, I warmly recommend Herb Sutter's article: Eliminate False Sharing, or a lecture: Machine Architecture: Things Your Programming Language Never Told You (and associated PDF slides).

2012-04-04 18:53
by Branko Dimitrijevic

Thank you very much for your detailed answer. As follows from my update in the question it is likely that this is not my case (except that go program still is 5 times slower that java, both parallel and non-parallel), but I still will mark your answer as accepted - Vladimir Matveev 2012-04-05 07:21

@Branko See "Anchoring" at http://en.wikipedia.org/wiki/Listofcognitive_biase - NoName 2012-04-06 07:15

@Atom This wasn't just an instinctive reaction to a vaguely familiar situation. There were telltale signs of false sharing: zero scaling, 100% CPU utilization, separate vs. "near" data. This was all done before I realized the question was edited and the OP admitted making a measurement mistake. But even with that mistake, there is still less-than-perfect scaling, so I'm probably still right after all - Branko Dimitrijevic 2012-04-06 08:13

@googolplex You still have less-than-perfect scaling. The false sharing might be playing a role after all - Branko Dimitrijevic 2012-04-06 08:15

@Branko In the Go code, there is just 1 r.Set(i, j, s) per m1.m iterations of the innermost loop. When m1.m is about 500 (as specified in the question) then false sharing shouldn't be a major problem. In the case of false sharing r.data[i][j] is in the CPU's cache, not in memory - so handling the conflict can be expected to be reasonably fast in comparison to the time it takes to process the innermost loop. And if r.data[i][j] is in memory then it is an issue of the effectiveness of cache use by the program (so it isn't a false sharing issue) - NoName 2012-04-06 09:44

If you are able to run these code in Linux environment you can use perf to measure the false sharing effect.

2012-04-04 19:20
by vasste

Thanks, that would be useful. Didn't know about this tool - Vladimir Matveev 2012-04-05 07:19

For Linux, Windows 32 and ditto 64 there are also AMD's CodeXL and CodeAnalyst. They will profile an application running on an AMD processor in much greater detail than one from intel since the applicable performance registers are different.

2014-01-16 13:54
by Olof Forshell