algorithm to extract simple sentences from complex(mixed) sentences?

Go To StackoverFlow.com

4

Is there an algorithm that can be used to extract simple sentences from paragraphs?

My ultimate goal is to later run another algorithm on the resulted simple sentence to determine the author's sentiment.

I've researched this from sources such as Chae-Deug Park but none discuss preparing simple sentences as training data.

Thanks in advance

2012-04-04 22:58
by John Rambo
What exactly do you mean by "simple sentence"? Just a sentence as compared to a paragraph -- in which case your question is about sentence boundary detection. Or a sentence that contains only one main predicate (as opposed to a complex sentence with subordinate clauses etc. in it)? Or something entirely different - jogojapan 2012-04-11 03:19
Hi jogojapan, yes, that is correct, I meant Just a sentence as compared to a paragraph.. - John Rambo 2012-04-14 22:39
You don't properly define what you mean by a simple sentence, so its hard for anybody to answer your question. Maybe you want to use something like the Stanford Parser to get the parse tree for each sentence, and get rid of all sentences which are not of the type 'NP VP' i.e. sentences that constitute of a noun phrase followed by a verb phrase (e.g. '[John] [sat on a bench]', '[Mary and Jill] [ate their sandwiches]', etc - Aditya Mukherji 2012-04-17 07:21
A simple sentence is a well-defined notion in English grammar. I don't see why it needs to be defined in a SO question, especially one tagged nlp. For readers not involved in NLP, I suppose @JohnRambo could provide a link to the definition (e.g. http://grammar.about.com/od/rs/g/simpsenterm.htm - Chthonic Project 2014-10-16 01:08


1

I have just used openNLP for the same.

public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
        InvalidFormatException {

    InputStream is = new FileInputStream("resources/models/en-sent.bin");
    SentenceModel model = new SentenceModel(is);
    SentenceDetectorME sdetector = new SentenceDetectorME(model);

    String[] sentDetect = sdetector.sentDetect(paragraph);
    is.close();
    return Arrays.asList(sentDetect);
}

Example

    //Failed at Hi.
    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone

    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

It failed only when there is a human mistake. Eg. "Dr." abbreviation should have capital D, and there is at least 1 space is expected between 2 sentences.

You can also achieve it using RE in following way;

public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
    List<String> sentences = new ArrayList<String>();
    Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
    Matcher reMatcher = re.matcher(paragraph);
    while (reMatcher.find()) {
        sentences.add(reMatcher.group());
    }
    return sentences;

}

Example

    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr., mrs.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at U.S.
    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

But errors are competitively high. Another way is using BreakIterator;

public static List<String> breakIntoSentencesBreakIterator(String paragraph){
    List<String> sentences = new ArrayList<String>();
    BreakIterator sentenceIterator =
            BreakIterator.getSentenceInstance(Locale.ENGLISH);
    BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
    sentenceInstance.setText(paragraph);

    int end = sentenceInstance.last();
     for (int start = sentenceInstance.previous();
          start != BreakIterator.DONE;
          end = start, start = sentenceInstance.previous()) {
         sentences.add(paragraph.substring(start,end));
     }

     return sentences;
}

Example:

    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));


    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

Benchmarking:

  • custom RE : 7 ms
  • BreakIterator : 143 ms
  • openNlp : 255 ms
2015-07-26 11:13
by Amit Kumar Gupta


2

Take a look at Apache OpenNLP, it has a Sentence Detector module. The documentation has examples of how to use it from command line and from API.

2012-04-17 15:16
by wcolen
Ads