If you have 65536 random English words with length 1-32 per word that you need to count for appearance and sort based on dictionary or appearance rank, how do you structure the data and what sorting technique would you use to process it the fastest?
65,000 words is, in all seriousness, a minuscule sorting problem. Unless you have to re-sort it many times per minute, I would suggest you just use the qsort()
built into the language. That's what it's for, after all.
I'd suggest using a simple array of char pointers for the alpha order. For maintaining the frequency order, you'll can use a structure such as:
typedef struct {
char *word; // points to one of the strings.
int frequency; // counts the number of occurrences.
} tFreq;
in another array, which you can populate fully whenever you create or modify the alpha-sorted array of pointers (see below for my reasoning why this seemingly inefficient process is suitable).
As an example of the speed, consider the following piece of code:
#include <stdio.h>
#define MAXWDS 66000
static char *words[MAXWDS];
static int compFn (const void *p1, const void *p2) {
return strcmp (*((const char **)p1), *((const char **)p2));
}
int main() {
char *pWord;
int i, j, sz;
time_t t0, t1;
srand (time (0));
for (i = 0; i < MAXWDS; i++) {
sz = rand() % 32 + 1;
pWord = words[i] = malloc (sz + 1);
for (j = 0; j < sz; j++)
pWord[j] = 'A' + (rand() % 26);
pWord[sz] = '\0';
}
t0 = time(0);
qsort (words, MAXWDS, sizeof (char*), compFn);
t1 = time(0);
printf ("Time taken for %7d elements was %2d second(s)\n", MAXWDS, t1 - t0);
return 0;
}
On a 3GHz dual core Intel chip, here's the outputs from a few select values of MAXWDS:
MAXWDS Output
--------- ------
66,000 Time taken for 66000 elements was 0 second(s)
100,000 Time taken for 100000 elements was 0 second(s)
500,000 Time taken for 500000 elements was 0 second(s)
600,000 Time taken for 600000 elements was 1 second(s)
1,000,000 Time taken for 1000000 elements was 1 second(s)
2,000,000 Time taken for 2000000 elements was 2 second(s)
3,000,000 Time taken for 3000000 elements was 5 second(s)
4,000,000 Time taken for 4000000 elements was 7 second(s)
5,000,000 Time taken for 5000000 elements was 9 second(s)
6,000,000 Time taken for 6000000 elements was 10 second(s)
7,000,000 Time taken for 7000000 elements was 11 second(s)
9,999,999 Time taken for 9999999 elements was 21 second(s)
So, as you can see, qsort is fairly efficient for the data set sizes you're talking about.
In fact, implementing the whole process for maintaining two sorted lists, as shown in the code below, shows you exactly how insignificant 66,000 elements are. The basic premise is to:
t0 to t1
).t1 to t2
).t2 to t3
).The following code shows how that's done. The only mildly tricky bit is the transfer from the alpha array to the frequency array.
#include <stdio.h>
#define MAXWDS 66000
typedef struct {
char *word;
int frequency;
} tFreq;
static char *words[MAXWDS];
static tFreq freq[MAXWDS];
static int numFreq;
static int compFn (const void *p1, const void *p2) {
return strcmp (*((const char **)p1), *((const char **)p2));
}
static int compFn2 (const void *p1, const void *p2) {
return ((tFreq*)p2)->frequency - ((tFreq*)p1)->frequency;
}
int main() {
char *pWord;
int i, j, sz;
time_t t0, t1, t2, t3;
// Generate random words.
srand (time (0));
for (i = 0; i < MAXWDS; i++) {
sz = rand() % 32 + 1;
pWord = words[i] = malloc (sz + 1);
for (j = 0; j < sz; j++)
pWord[j] = 'A' + (rand() % 26);
pWord[sz] = '\0';
}
t0 = time(0);
// Alpha sort.
qsort (words, MAXWDS, sizeof (char*), compFn);
t1 = time(0);
// Pre-condition to simplify loop: make first word with zero frequency.
freq[0].word = words[0];
freq[0].frequency = 0;
// Transfer to frequency array.
for (i = numFreq = 0; i < MAXWDS; i++) {
// If new word, add it and set frequency to 0.
if (strcmp (freq[numFreq].word, words[i]) != 0) {
numFreq++;
freq[numFreq].word = words[i];
freq[numFreq].frequency = 0;
}
// Increment frequency (for old and new words).
freq[numFreq].frequency++;
}
numFreq++;
t2 = time(0);
// Sort frequency array.
qsort (freq, numFreq, sizeof (tFreq), compFn2);
t3 = time(0);
// Output stats.
printf ("Time taken for sorting %5d elements was %d seconds.\n",
MAXWDS, t1 - t0);
printf ("Time taken for transferring %5d elements was %d seconds.\n",
numFreq, t2 - t1);
printf ("Time taken for sorting %5d elements was %d seconds.\n",
numFreq, t3 - t2);
printf ("Time taken for everything %5s was %d seconds.\n\n",
"", t3 - t0);
for (i = 0; i < 28; i++) {
printf ("[%-15s] %5d\n", freq[i].word, freq[i].frequency);
}
return 0;
}
And the output for 66,000 random strings is (the first few strings are there so you can see that the sort worked):
Time taken for sorting 66000 elements was 0 seconds.
Time taken for transferring 62422 elements was 0 seconds.
Time taken for sorting 62422 elements was 0 seconds.
Time taken for everything was 0 seconds.
[Z ] 105
[H ] 97
[X ] 95
[P ] 90
[D ] 89
[K ] 87
[T ] 86
[J ] 85
[G ] 84
[F ] 83
[Q ] 83
[W ] 81
[V ] 81
[M ] 80
[I ] 79
[O ] 78
[A ] 78
[B ] 75
[U ] 74
[N ] 73
[C ] 73
[S ] 70
[Y ] 68
[L ] 65
[E ] 60
[R ] 59
[NQ ] 8
[XD ] 8
So, even if you do those operations every single time you insert or delete a value, they will have no discernible impact (unless obviously, if you're doing it more than once every couple of seconds, but then you would think about batching the changes for efficiency).
check out http://www.sorting-algorithms.com/ for a nice visual representation of different sorting methods.
Oh dear, what a poorly worded homework question - the tutor should know better than this. The last part is "to process it the fastest". Unfortunately, it is very, very difficult to determine how long an algorithm will take to execute. Big O notation doesn't help as that is a measure of complexity, not speed (for more information about this, see Raymond Chen's recent blog entry). The only practical way is to implement the algorithm, run it and measure the time taken. Also, the input data can affect the execution time - qsort and binary trees are non-optimal for data that is already sorted.
You could write an entire paper on "fastest" and still not get a concrete answer.
I would use whatever sorting algorithm my runtime library happened to provide. Typically, sort() uses quicksort.
Don't worry about the choice of sort algorithm until you know that your standard one isn't working well enough because you've measured it.
Merge Sort should work well for this and is pretty easy to get working in c.
You can use
struct elem { char *word, int frequency; } // pointer to 'string' word
struct elem dict[1<<16]; // number of words
use the standard qsort to sort by word or by frequency, or use a second array if you need both orders a the same time.
Choosing a sorting algorithm depends on the amount of data (65k aren't many) you have and the tradeoff beetween time and memory you choose. If the data should be retrieved really fast, you will have to use more memory. On the other hand you can't find the records that fast if you decide to save on memory.
The choice of algorithm is pretty easy - use whatever your languages library offers unless you have proof that this doesn't work well enough.
You need the data sorted by two criteria, so you actually need two sorted arrays. Both of them should be arrays of pointers of some kind.
It sounds as though you must sort it in two different ways:
Use a trie. That way, both "sorts" will be simple traversals of the graph.