Jvm native code compilation crazyness - I seem to suffer odd performance penalties for some time even after the code is compiled. Why?

Go To StackoverFlow.com

13

Issue

When benchmarking a simple QuickSort implementation in Java, I faced unexpected humps in the n vs time graphics I was plotting:

enter image description here

I know HotSpot will attempt to compile code to native after it seems certain methods are being heavily used, so I ran the JVM with -XX:+PrintCompilation. After repeated trials, it seems to be compiling the algorithm's methods always in the same way:

@ iteration 6 -> sorting.QuickSort::swap (15 bytes)
@ iteration 7 -> sorting.QuickSort::partition (66 bytes)
@ iteration 7 -> sorting.QuickSort::quickSort (29 bytes)

I am repeating the above graphic with this added info, just to make things a bit clearer:

enter image description here

At this point, we must all be asking ourselves : why are we still getting those ugly humps AFTER the code is compiled? Maybe it has something to do with the algorithm itself? It sure could be, and luckily for us there's a quick way to sort that out, with -XX:CompileThreshold=0:

enter image description here

Bummer! It really must be something the JVM is doing in the background. But what? I theorized that although code is being compiled, it may take a while until the compiled code actually starts to be used. Maybe adding a couple of Thread.sleep()s here and there could help us a bit sorting this issue out?

enter image description here

Ouch! The green colored function is the QuickSort's code ran with a 1000ms internal between each run (details in the appendix), while the blue colored function is our old one (just for comparison).

At fist, giving time to the HotSpot only seems to make matters worse! Maybe it only seems worse by some other factor, such as caching issues?

Disclaimer : I am running 1000 trials for each point of the shown graphics, and using System.nanoTime() to measure the results.

EDIT

Some of you may at this stage wonder how the use of sleep() might distort the results. I ran the Red Plot (no native compilation) again, now with the sleeps in-between:

enter image description here

Scary!

Appendix

Here I present the QuickSort code I am using, just in case:

public class QuickSort {

    public <T extends Comparable<T>> void sort(int[] table) {
        quickSort(table, 0, table.length - 1);
    }

    private static <T extends Comparable<T>> void quickSort(int[] table,
            int first, int last) {
        if (first < last) { // There is data to be sorted.
            // Partition the table.
            int pivotIndex = partition(table, first, last);
            // Sort the left half.
            quickSort(table, first, pivotIndex - 1);
            // Sort the right half.
            quickSort(table, pivotIndex + 1, last);
        }
    }

    /**
     * @author http://en.wikipedia.org/wiki/Quick_Sort
     */
    private static <T extends Comparable<T>> int partition(int[] table,
            int first, int last) {
        int pivotIndex = (first + last) / 2;
        int pivotValue = table[pivotIndex];
        swap(table, pivotIndex, last);
        int storeIndex = first;
        for (int i = first; i < last; i++) {
            if (table[i]-(pivotValue) <= 0) {
                swap(table, i, storeIndex);
                storeIndex++;
            }
        }
        swap(table, storeIndex, last);
        return storeIndex;
    }

    private static <T> void swap(int[] a, int i, int j) {
        int h = a[i];
        a[i] = a[j];
        a[j] = h;
    }
}

as well the code I am using to run my benchmarks:

public static void main(String[] args) throws InterruptedException, IOException {
    QuickSort quickSort = new QuickSort();

    int TRIALS = 1000;

    File file = new File(Long.toString(System.currentTimeMillis()));
    System.out.println("Saving @ \"" + file.getAbsolutePath() + "\"");

    for (int x = 0; x < 30; ++x) {
    //          if (x > 4 && x < 17)
    //              Thread.sleep(1000);

        int[] values = new int[x];

        long start = System.nanoTime();

        for (int i = 0; i < TRIALS; ++i)
            quickSort.sort(values);

        double duration = (System.nanoTime() - start) / TRIALS;
        String line = x + "\t" + duration;
        System.out.println(line);
        FileUtils.writeStringToFile(file, line + "\r\n", true);
    }
}
2012-04-05 21:28
by devoured elysium
(the original question was erased - devoured elysium 2012-04-05 21:33
+1 nicely written question... However I've got one question: is the goal to make a fast quicksort to be able to sort gigantic data sets? If so a (correctly done) concurrent quicksort may help on today's cpus - TacticalCoder 2012-04-05 21:34
@TacticalCoder: A couple of colleagues faced this issue in their own projects some weeks ago and I'm making use of my week-long vacation to try to understand why this puzzling oddity is happening. I have no goal of implementing the shown algorithm, only to understand how the JVM works - devoured elysium 2012-04-05 21:36
gotcha... I pretty much guessed so but curiosity killed the cat ; - TacticalCoder 2012-04-05 21:44
Keep in mind that many JITCs will RECOMPILE already compiled code as it becomes "hotter" and "hotter", adding optimizations that were not in the original version - Hot Licks 2012-04-05 22:58
Yes, but that is not happening (otherwise I'd be notified of it) - devoured elysium 2012-04-05 23:36


5

Well, it seems that I sorted the issue out on my own.

I was right about the idea that compiled code could take a while to kick in. The problem was a flaw in the way I actually implemented my benchmarking code:

if (x > 4 && x < 17)
    Thread.sleep(1000);

in here I assumed that as the only "affected" area would be between 4 and 17, I could go on and just do a sleep over those values. This is simply not so. The following plot may be enlightening:

enter image description here

Here I am comparing the original no compilation function (red) to another no compilation function, but separed with sleeps in-between. As you may see, they work in different orders of magnitude, that meaning that mixing results of code with and without sleeps will yield unsound results, as I was guilty of doing.

The original question remains unaswered, yet. What causes the humps to ocurr even after compilation took place? Let's try to find that out, putting a 1s sleep in ALL points taken:

enter image description here

That yields the expected result. The odd humps were happening the native code still didn't kick in.

Comparing a sleep 50ms with a sleep 1000ms function yields yet again, the expected result:

enter image description here

(the gray one seems to still show a bit of delay)

2012-04-05 22:08
by devoured elysium
Ads