Creating a corpus from data in a custom format - 【StackMirror】|python|nlp|nltk

I have hundreds of files containing text I want to use with NLTK. Here is one such file:

বে,বচা ইয়াণ্ঠা,র্চা ঢার্বিত তোখাটহ নতুন, অ প্রবঃাশিত।
তবে ' এ বং মুশায়েরা ' পত্রিব্যায় প্রকাশিত তিনটি লেখাই বইযে
সংব্যজান ব্যরার জনা বিশেষভাবে পরিবর্ধিত। পাচ দাপনিকেব
ড:বন নিয়ে এই বই তৈরি বাবার পরিব্যল্পনাও ম্ভ্রাসুনতন
সামন্তেরই। তার আর তার সহকারীদেব নিষ্ঠা ছাডা অল্প সময়ে
এই বই প্রব্যাশিত হতে পারত না।,তাঁদের সকলকে আমাধ
নমস্কার জানাই।
বতাব্যাতা শ্রাবন্তা জ্জাণ্ণিক
জানুয়ারি ২ ণ্ট ণ্ট ৮ 
Total characters: 378

Note that each line does not contain a new sentence. Rather, the sentence terminator - the equivalent of the period in English - is the '।' symbol.

Could someone please help me create my corpus? If imported into a variable MyData, I would need to access MyData.words() and MyData.sents(). Also, the last line should not appear in the corpus (it merely contains a character count).

Please note that I will need to run operations on data from all the files at once.

Thanks in advance!

2012-04-04 07:11
by Velvet Ghost

Perhaps if you explain what a corpus is.. - C2H5OH 2012-04-04 07:19

A corpus is a large body of text. I plan to use an NLTK corpus reader (or write one myself if necessary). People using NLTK would know what a corpus is - Velvet Ghost 2012-04-04 07:31

@C2H5OH http://en.wikipedia.org/wiki/Text_corpu - javanna 2012-04-04 08:55

This is very much a "write my code for me" question. Study the NLTK corpus API, try implementing that for your text corpus and ask specific questions when you get stuck - Fred Foo 2012-04-05 08:15

There's nothing wrong with the question. The NLTK is huge and sprawling, and it's not obvious where to start or what the appropriate tools are. Reading the NLTK book would suggest an answer after a few chapters, but that's much too high a standard for most SO questions - alexis 2012-04-07 12:25

Thanks @alexis. Yes, NLTK is huge and it's a big learning curve. Especially for me, because I'm almost new to Python too - Velvet Ghost 2012-04-09 06:22

You don't need to input the files yourself or to provide words and sents methods. Read in your corpus with PlaintextCorpusReader, and it will provide those for you. The corpus reader constructor accepts arguments for the path and filename pattern of the files, and for the input encoding (be sure to specify it).

The constructor also has optional arguments for the sentence and word tokenization functions, so you can pass it your own method to break up the text into sentences. If word and sentence detection is really simple, i.e., if the | character has other uses, you can configure a tokenization function from the nltk's RegexpTokenizer family, or you can write your own from scratch. (Before you write your own, study the docs and code or write a stub to find out what kind of input it's called with.)

If recognizing sentence boundaries is non-trivial, you can later figure out how to train the nltk's PunktSentenceTokenizer, which uses an unsupervized statistical algorithm to learn which uses of the sentence terminator actually end a sentence.

If the configuration of your corpus reader is fairly complex, you may find it useful to create a class that specializes PlaintextCorpusReader. But much of the time that's not necessary. Take a look at the NLTK code to see how the gutenberg corpus is implemented: It's just a PlainTextCorpusReader instance with appropriate arguments for the constructor.

2012-04-07 12:33
by alexis

Thanks a lot! That should be very helpful. I'll try it out and get back to you. The sentence rules are simple - it's always the | character, which has no other uses. So I don't think I'll need to train a PunktSentenceTokenizer - Velvet Ghost 2012-04-09 06:27

1) to get rid of the last line is rather straightforward.

f = open('corpus.txt', 'r')
for l in f.readlines()[:-1]:
   ....

The [:-1] in the for loop will skip the last line for you.

2) The built-in readlines() function of a file object breaks the content in the file into lines by using the newline character as a delimiter. So you need to write some code to cache the lines until the '|' is seen. When a '|' is encountered, treat the cached lines as one single sentence and put it in your MyData class

2012-04-05 08:02
by Anthony Kong

Unless you know that you need random access on the lines, it would be cleaner to do for l in itertools.islice(f, -1):aaronasterling 2012-04-07 22:34