Algorithm for dividing a set of strings into a minimum set of mutually exclusive groups roughly the same size

I have a large set of strings. I want to divide the strings into subsets such that:

Each item in a subset shares 1 or more contiguous characters.
The shared contiguous characters that define a subset are unique for the set of subsets (i.e. the shared characters are sufficient for defining a subset of strings that stands in a mutually exclusive relationship with other subsets).
The subsets are roughly the same size.
The resulting set of subsets is the minimal number of subsets needed that fit the above criteria.

For example given the following set of names:

Alan,Larry,Alfred,Barbara,Alphonse,Carl

I can divide this set into two subsets of equal size. Subset 1 defined by the contiguous characters "AL" would be

Alan, Alfred, Alphonse

Subset 2 defined by contiguous characters ar would be

Larry, Barbara, Carl.

I am looking for an algorithm that would do this for any arbitrary set of strings. The resulting set of subsets does not have to equal 2 but it should be the minimum set and the resulting subsets should be approximately equal.

Elliott

2012-04-05 01:07
by Elliott

For a subset, do the contiguous characters always have to be in the start of the member strings - Chetter Hummin 2012-04-05 01:11

No. The contiguous characters can be anywhere in the string - Elliott 2012-04-05 01:21

Have a look at http://en.wikipedia.org/wiki/Suffix_array. It is possible that what you really want to do is to create a suffix array for each document, and them merge all the suffix arrays, with pointers back to the original versions, so that you can search the collection as one for a string by looking for it as a suffix in the array.

2012-04-05 04:24
by mcdowella

This is tricky. I wonder if there's some higher purpose (like word indexing) or is this just an academic question?

It's not solvable in general, unless you accept the trivial solution of a single set defined by the empty sequence (which occurs in all words). For example, take the strings: a, ab, b.

a must go into the set defined by a.
b must go into the set defined by b.
ab must go into both, because it contains both subsequences.

Will a similar example occur with the kind of words you're dealing with? I don't know. Perhaps you can deal with words mapping to more than one set, or you can have a tie-breaking system to determine where to put it.

Assuming this isn't a problem, the burrows-wheeler transform might help with finding good substrings.

Or how about something like:

Generate all subsequences in the words.
Build an interference graph of the subsequences, with an edge connecting two subsequences if they both occur in a single word.
Colour the graph.
Pick a representative subsequence for each colour.
Make a set defined by each representative subsequence. If all words of that colour have that substring, put them all into that set.
Otherwise, delete that substring from the graph, and repeat from step 3.

This algorithm is probably broken but it might give you some ideas about a solution (or at least some idea of the trickiness of your question ;-).

2012-04-05 01:33
by Edmund

The higher purpose is developing an efficient way to search for a large number of strings in thousands of documents. Grouping the strings into subsets will allow me to rapidly eliminate possibilities. If the contiguous characters that define a subset are not in the document, I need not search for any of the members of the subset. Referring to the example above, if the string "ar" is not in the document, I know I don't need to search for the names Carl, Barbara, or Larr - Elliott 2012-04-05 01:43

Have you thought about partitioning them in levels? E.g. find a substring that occurs in approximately half of the words in your set, and split on that (the set including it, and the set without it). Then do the same for each subset - Edmund 2012-04-05 01:57