I have a large set of strings. I want to divide the strings into subsets such that:
For example given the following set of names:
Alan,Larry,Alfred,Barbara,Alphonse,Carl
I can divide this set into two subsets of equal size. Subset 1 defined by the contiguous characters "AL" would be
Alan, Alfred, Alphonse
Subset 2 defined by contiguous characters ar would be
Larry, Barbara, Carl.
I am looking for an algorithm that would do this for any arbitrary set of strings. The resulting set of subsets does not have to equal 2 but it should be the minimum set and the resulting subsets should be approximately equal.
Elliott
Have a look at http://en.wikipedia.org/wiki/Suffix_array. It is possible that what you really want to do is to create a suffix array for each document, and them merge all the suffix arrays, with pointers back to the original versions, so that you can search the collection as one for a string by looking for it as a suffix in the array.
This is tricky. I wonder if there's some higher purpose (like word indexing) or is this just an academic question?
It's not solvable in general, unless you accept the trivial solution of a single set defined by the empty sequence (which occurs in all words). For example, take the strings: a
, ab
, b
.
a
must go into the set defined by a
.b
must go into the set defined by b
.ab
must go into both, because it contains both subsequences.Will a similar example occur with the kind of words you're dealing with? I don't know. Perhaps you can deal with words mapping to more than one set, or you can have a tie-breaking system to determine where to put it.
Assuming this isn't a problem, the burrows-wheeler transform might help with finding good substrings.
Or how about something like:
This algorithm is probably broken but it might give you some ideas about a solution (or at least some idea of the trickiness of your question ;-).