String.Join performance issue in C#

Go To StackoverFlow.com

13

I've been researching a question that was presented to me: How to write a function that takes a string as input and returns a string with spaces between the characters. The function is to be written to optimize performance when it is called thousands of times per second.

  1. I know that .net has a function called String.Join, to which I may pass in the space character as a separator along with the original string.

  2. Barring the use of String.Join, I can use the StringBuilder class to append spaces after each character.

  3. Another way to accomplish this task is to declare a character array with 2*n-1 characters (You have to add n-1 characters for the spaces). The character array can be filled in a loop and then passed to the String constructor.

I've written some .net code that runs each of these algorithms one millions times each with the parameter "Hello, World" and measures how long it takes to execute. Method (3) is much, much faster than (1) or (2).

I know that (3) should be very fast because it avoids creating any additional string references to be garbage collected, but it seems to me that a built-in .net function such as String.Join should yield good performance. Why is using String.Join so much slower than doing the work by hand?

public static class TestClass
{
    // 491 milliseconds for 1 million iterations
    public static string Space1(string s) 
    {            
        return string.Join(" ", s.AsEnumerable());
    }

    //190 milliseconds for 1 million iterations
    public static string Space2(string s) 
    {
        if (s.Length < 2)
            return s;
        StringBuilder sb = new StringBuilder();
        sb.Append(s[0]);
        for (int i = 1; i < s.Length; i++)
        {
            sb.Append(' ');
            sb.Append(s[i]);
        }            
        return sb.ToString();
    }

    // 50 milliseconds for 1 million iterations
    public static string Space3(string s) 
    {
        if (s.Length < 2)
            return s;
        char[] array = new char[s.Length * 2 - 1];
        array[0] = s[0];
        for (int i = 1; i < s.Length; i++)
        {
            array[2*i-1] = ' ';
            array[2*i] = s[i];
        }
        return new string(array);
    }

Update: I have changed my project to "Release" mode and updated my elapsed times in the question accordingly.

2012-04-05 16:57
by Daniel Allen Langdon
If you're comparing performance, are you in release build with optimizations on - Servy 2012-04-05 17:01
When you create the StringBuilder in option 2 you can pass in an initial capacity of 2*n-1, which should prevent it from re-creating and copying it's internal buffer on larger strings - Servy 2012-04-05 17:02
@Servy, I simply created a new Console project in Visual Studio 2010 - Daniel Allen Langdon 2012-04-05 17:04
@RiceFlourCookies: Then you are probably in debug, which means it didn't happen. Make sure you are in release mode or there is no point in discussing performanc - Ed S. 2012-04-05 17:06
You should take a look at http://www.codinghorror.com/blog/2009/01/the-sad-tragedy-of-micro-optimization-theater.html for reference. It might be helpful for your end result - Erik Philips 2012-04-05 17:09
On your second method, you can use the overload constructor to indicate the final string length to avoid multiple resizing: StringBuilder sb = new StringBuilder(s.Length * 2) - Alejandro B. 2012-04-05 18:01
BTW - Running in "release" isn't enough - also make sure that you're running outside of the VS test host. That being said, it'll most likely just make your version even faster (relatively) ; - Reed Copsey 2012-06-14 17:31


7

Why is using String.Join so much slower than doing the work by hand?

The reason String.Join is slower in this case is that you can write an algorithm that has prior knowledge of the exact nature of your IEnumerable<T>.

String.Join<T>(string, IEnumerable<T>) (the overload you're using), on the other hand, is intended to work with any arbitrary enumerable type, which means it cannot pre-allocate to the proper size. In this case, it's trading flexibility for pure performance and speed.

Many of the framework methods do handle certain cases where things could be sped up by checking for conditions, but this typically is only done when that "special case" is going to be common.

In this case, you're effectively creating an edge case where a hand-written routine will be faster, but it is not a common use case of String.Join. In this case, since you know, exactly, in advance what is required, you have the ability to avoid all of the overhead required to have a flexible design by pre-allocating an array of exactly the right size, and building the results manually.

You'll find that, in general, it's often possible to write a method that will out perform some of the framework routines for specific input data. This is common, as the framework routines have to work with any dataset, which means that you can't optimize for a specific input scenario.

2012-06-14 17:30
by Reed Copsey


4

Your String.Join example works on an IEnumerable<char>. Enumerating an IEnumerable<T> with foreach is often slower than executing a for loop (it depends on the the collection type and other circumstances, as Dave Black pointed out in a comment). Even if Join uses a StringBuilder, the internal buffer of the StringBuilder will have to be increased several times, since the number of items to append is not known in advance.

2012-04-05 17:08
by Olivier Jacot-Descombes
But the number of appends is known in advance, and the StringBuilder constructor has the option of setting an initial capacity, so that can be overcome. For the first option there's no good way around the issue - Servy 2012-04-05 17:11
@Servy: No, it does not. It does not know the size of an IEnumerable<T>. Taking a look via reflector, it does indeed use a StringBuilder, but it does not/cannot set a capacity. Only LINQ gives you a way to get the size of an enumerable, but it is an O(n) operation - Ed S. 2012-04-05 17:16
@EdS. I'm referring to the comparison between cases 2 and 3, not between 1 and 3. case 2 is fixable, and once fixed should be at least close to case 3. Case 1 will always be slower - Servy 2012-04-05 17:17
I now feel like a tool. I thought that I couldn't pass a string directly to String.Join without casting it to array or enumerable, but I was wrong. Just using String.Join(" ", s) runs in about 40ms compared to 50ms for the code that I wrote using a character array - Daniel Allen Langdon 2012-04-05 17:21
@Servy: Yeah but it isn't doing what you want it to do. Look at what it outputs when you pass only a string. It's just going to return the original string. You're using the version which takes a params object[], but only supplying one value, so it does nothing, which of course... makes it fast :) If you need it to be fast (seems like you do) just write your own function that can take advantage of knowledge you have that the library does not - Ed S. 2012-04-05 17:22
@Servy: That said, I would have thought that the option that takes a string[] would have been fast, but it is not - Ed S. 2012-04-05 17:26
OK, survey, now I really feel like a tool. You should deduct some points from my rep for that :- - Daniel Allen Langdon 2012-04-05 17:31
There is no string[] involved; however the string could be converted to a char[] with s.ToCharArray() and passed to the Join overload with the object[] parameter - Olivier Jacot-Descombes 2012-04-05 17:38
@OlivierJacot-Descombes: There is an overload that takes a string[]. It doesn't really matter though; they're all significantly slower than doing it yourself - Ed S. 2012-04-05 17:46
The question is, "Why?" Why would the .net framework include a function that is magnitudes slower than doing it yourself - Daniel Allen Langdon 2012-04-05 18:25
@RiceFlourCookies It's much slower when you want to join individual characters. It's perfectly acceptable, and about as quick as doing it yourself, when joining strings. It's also something that a lot of people do often in contexts in which performance is not an issue. it's a convenience method, not a performance method, by design - Servy 2012-04-05 19:04
@EdS: The OP passes a string as paramter to Join, not a string[] - Olivier Jacot-Descombes 2012-04-05 19:11
I know, I thought you were responding to my comment where I mention a string[] - Ed S. 2012-04-05 19:20
@OlivierJacot-Descombes you make a bold sweeping statement that is quite inaccurate "...Enumerating an IEnumerable with foreach is much slowler than executing a for statement." It completely depends on the type you're iterating over - Dave Black 2015-08-20 15:22
@DaveBlack: I updated my answer. Thank you for the constructive criticism. The C# compiler apparently makes optimizations when looping arrays with foreach; however, for is about twice as fast for lists as foreach - Olivier Jacot-Descombes 2015-08-20 19:49


3

Since you aren't using the Release build (which should have optimizations checked by default) and/or you're debugging through visual studio then the JITer will be prevented from making a lot of it's optimizations. Because of this you're just not getting a good picture of how long each operation really takes. Once you add in the optimizations you can get the real picture of what's going on.

It's also important that you not be debugging in visual studio. Go to the bin/release folder and double click the executable entirely outside of visual studio.

2012-04-05 17:07
by Servy
I tested this myself, release build, not run under VS. Option 3 is fastest by far, option 1 is slowest. (67ms and 536ms respectively for 1,000,000 iterations - Ed S. 2012-04-05 17:11
@EdS. It's certainly possible for the relative times to be the same, but there's sufficient possibility for differences that it's an important first step before any other analysis takes place - Servy 2012-04-05 17:13
I'm not sure what you mean by that. I tested it "properly" and came back with the same result as the OP. I am now looking at the non-generic version(s) of String.Join - Ed S. 2012-04-05 17:14


2

In your first method, you are using the overload of String.Join that operates on an Enumerable, which requires that the method walk the characters of the string using an enumerator. Internally, this uses a StringBuilder as the exact number of characters is unknown.

Have you considered using the String.Join overload that takes a string (or string array) instead? That implementation allows a fixed length buffer to be used (similar to your third method) along with some internal unsafe string operations for speed. The call would change to - String.Join(" ", s); Without actually doing the legwork to measure, I would expect this to be on par or faster than your third approach.

2012-04-05 17:28
by Jesse Squire
I tried the approach you suggested and it seems to yield no increase in performance at all - Daniel Allen Langdon 2012-04-05 17:32
Nope, they're all relatively slow. I don't understand why the versions that take an array, where the size is known, aren't faster than they are - Ed S. 2012-04-05 17:47
Interesting and surprising. Looking at the code in Reflector shows that overload uses a fixed length "UnSafeCharBuffer" along with raw pointer access to the underlying character array to copy the source. I'd have expected that to be orders of magnitude faster than the enumerator-abstracted version. Looks like something that may be fun to play with when I get some time to investigate in more depth - Jesse Squire 2012-04-05 18:50
hrmm. It appears that the overload that I'm referencing takes an array of string, not of characters. I'd speculate that using ToCharArray() is causing it to fall back to the enumerable overload. Answer fixed - Jesse Squire 2012-04-05 18:58


1

The bad performance is not coming from String.Join, but from the way you handle each character. In this case, since characters have to be handled individually, your first method will create much more intermediate strings and the second method suffers from two .Append method calls for each character. Your third method does not involve a lots of intermediate strings or methods calls and that's the reason why your third method is the fastest.

2012-04-05 17:12
by Codism


0

When you have passed an IEnumerable to String.Join, it has no idea on how much memory needs to be allocated. I allocates a chunk of memory, resizes it if it is insufficient and repeats the process until it gets enough memory to accommodate all the strings.

The array version is faster because we know the amount of memory allocated well ahead.

Also please not that when you are running the 1st version, GC might have occurred.

2012-04-05 17:14
by Sandeep
Ads