I have hundreds of files containing text I want to use with NLTK. Here is one such file:
বে,বচা ইয়াণ্ঠা,র্চা ঢার্বিত তোখাটহ নতুন, অ প্রবঃাশিত। তবে ' এ বং মুশায়েরা ' পত্রিব্যায় প্রকাশিত তিনটি লেখাই বইযে সংব্যজান ব্যরার জনা বিশেষভাবে পরিবর্ধিত। পাচ দাপনিকেব ড:বন নিয়ে এই বই তৈরি বাবার পরিব্যল্পনাও ম্ভ্রাসুনতন সামন্তেরই। তার আর তার সহকারীদেব নিষ্ঠা ছাডা অল্প সময়ে এই বই প্রব্যাশিত হতে পারত না।,তাঁদের সকলকে আমাধ নমস্কার জানাই। বতাব্যাতা শ্রাবন্তা জ্জাণ্ণিক জানুয়ারি ২ ণ্ট ণ্ট ৮ Total characters: 378
Note that each line does not contain a new sentence. Rather, the sentence terminator - the equivalent of the period in English - is the '।' symbol.
Could someone please help me create my corpus? If imported into a variable MyData, I would need to access MyData.words() and MyData.sents(). Also, the last line should not appear in the corpus (it merely contains a character count).
Please note that I will need to run operations on data from all the files at once.
Thanks in advance!
You don't need to input the files yourself or to provide words
and sents
methods.
Read in your corpus with PlaintextCorpusReader
, and it will provide those for you.
The corpus reader constructor accepts arguments for the path and filename pattern of the files, and for the input encoding (be sure to specify it).
The constructor also has optional arguments for the sentence and word tokenization functions, so you can pass it your own method to break up the text into sentences. If word and sentence detection is really simple, i.e., if the | character has other uses, you can configure a tokenization function from the nltk's RegexpTokenizer family, or you can write your own from scratch. (Before you write your own, study the docs and code or write a stub to find out what kind of input it's called with.)
If recognizing sentence boundaries is non-trivial, you can later figure out how to train the nltk's PunktSentenceTokenizer, which uses an unsupervized statistical algorithm to learn which uses of the sentence terminator actually end a sentence.
If the configuration of your corpus reader is fairly complex, you may find it useful to create a class that specializes PlaintextCorpusReader
. But much of the time that's not necessary. Take a look at the NLTK code to see how the gutenberg corpus is implemented: It's just a PlainTextCorpusReader
instance with appropriate arguments for the constructor.
1) to get rid of the last line is rather straightforward.
f = open('corpus.txt', 'r')
for l in f.readlines()[:-1]:
....
The [:-1] in the for loop will skip the last line for you.
2) The built-in readlines() function of a file object breaks the content in the file into lines by using the newline character as a delimiter. So you need to write some code to cache the lines until the '|' is seen. When a '|' is encountered, treat the cached lines as one single sentence and put it in your MyData
class
for l in itertools.islice(f, -1):
aaronasterling 2012-04-07 22:34