Split sentence in words with most weight - 【StackMirror】|nlp|combinatorics

I'm working on a game where I need to find the biggest weight for a specific sentence.

Suppose I have the sentence "the quick brown fox" and assume only single words with their defined weight: "the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8

In this case the problem is trivial, as the solution consists in adding each words' weight.

Now assume we also add double words, so besides the above words, we also have "the quick" -> 5, "quick brown" -> 10, "brown fox" -> 1

I'd like to know which combination of single and double words provides the biggest weight, in this case it would be "the", "quick brown", "fox"

My question is, besides the obvious brute force approach, is there any other possible way to obtain a solution? Needless to say, I'm looking for some optimal way to achive this for larger sentences.

Thank you.

2012-04-03 22:21
by Dan

So the score for the sentence The quick. is 10+5+5 - mbatchkarov 2012-04-04 16:54

First off, the sentence should contain all the words, either single or double. In the case I showed, the total score would be 10+10+8. Please note that scores are applied either to single or double word, not both - Dan 2012-04-04 17:44

You can look at Integer Linear Program libraries like lp_solve. In this case, you will want to maximize the scores, and your objective function will contain the weights. Then you can subject it to constraints, like you cannot have "quick brown" and "brown" at the same time.

For word alignment, this was used in this paper, but your problem is way simpler than that, but you can browse through the paper to get an idea on how ILP was used. There's probably other algorithms other than ILP that can be used to solve this optimally, but ILP can solve it optimally and efficiently for small problems.

2012-04-04 03:17
by Kenston Choi

Thanks, this seems to be very useful to what I want to achieve. Will take a look to the paper and hopefully find out how to map my problem into this method - Dan 2012-04-04 17:42

This feels like a dynamic programming question.

I can imagine the k words of the sentence placed beside each other with a light bulb in between each word (i.e. k-1 light bulbs in total). If a light bulb is switched on, that means that the words adjoining it are part of a single phrase, and if its off, they are not. So any configuration of these light bulbs indicates a possible combination of weights.. of course many configurations are not even possible because we don't have any scores for they phrases they require. So k-1 light bulbs mean there are a max of 2^(k-1) possible answers for us to go through.

Rather than brute forcing it, we can recognize that there are parts of each computation that we can reuse for other computations, so for (The)(quick)(brown fox ... lazy dog) and (The quick)(brown fox ... lazy dog), we can compute the maximum score for (brown fox ... lazy dog) only once, memoize it and re-use it without doing any extra work the next time we see it.

Before we even start, we should first get rid of the light bulbs that can have only 1 possible value (suppose we did not have the phrase 'brown fox' or any bigger phrase with that phrase in it, then the light bulb between 'brown' and 'fox' would always have to be turned off).. Each removed bulb halves the solution space.

If w1, w2, w3 are the words, then the bulbs would be w1w2, w2w3, w3w4, etc. So

Optimal(w1w2 w2w3 w3w4 ...) = max(Optimal(w2w3 w3w4 ...) given w1w2 is on, Optimal(w2w3 w3w4 ...) given w1w2 is off)

(Caveat if we reach something where we have no possible solution, we just return MIN_INT and things should work out)

We can solve the problem like this, but we can probably save even more time if were clever about the order in which we approached the bulbs. Maybe attacking the center bulbs first might help.. I am not sure about this part.

2012-04-04 17:43
by Aditya Mukherji

"the" -> 10, "quick" -> 5, "brown" -> 3, "fox" -> 8 Say for the above individual words , I shall take an array [10,5,3,8] for words 0,1,2,3 Traverse through the list and get if the combination of two scores is less than the combined score for example 10+5 >5 the + quick >the quick 5+3 < 10 quick brown > quick + brown . Mark This and so on

While marking the combined solution mark them along continuous ranges . for example if words scores are words = [1,2,5,3,1,4,6,2,6,8] and [4,6,9,7,8,2,9,1,2] marked ranges (inclusive of both ends) are [0,1],[2,5],[6,7]

Pseudo code is given below

traverse from 0 to word length - 1

if number not in range :
add word[number] to overall sum.
else:
if length of range = 1 :
    add combined_word_score [ lower_end_number]
 else if length of range = 2 :
    add combined_word_score [ lower_end_number+next number]
else if length of range > 2 and is odd number :
    add max (alternate_score_starting at lower_end_number , 
              word[lower_end]+word[higher_end]+alternate_score_starting at 
                next_number)
else if length of range > 2 and is even number :
     add max (alternate_score_starting at lower_end_number +word[higher_end],
              word[lower_end]+alternate_score_starting at 
                next_number).

2019-02-19 10:22
by Manish Mansel