Python: How to determine the language? - 【StackMirror】|python|string|parsing

I want to get this:

Input text: "ру́сский язы́к"
Output text: "Russian" 

Input text: "中文"
Output text: "Chinese" 

Input text: "にほんご"
Output text: "Japanese" 

Input text: "العَرَبِيَّة"
Output text: "Arabic"

How can I do it in python? Thanks.

2016-08-25 10:26
by Rita

What did you try - Raskayu 2016-08-25 10:27

this may help http://stackoverflow.com/questions/4545977/python-can-i-detect-unicode-string-language-cod - Sardorbek Imomaliev 2016-08-25 10:34

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

2016-08-25 10:38
by dheiberg

Not very accurate - detects language of text 'anatomical structure' as ro(Romanian). Multiple language output required for such cases. polyglot performs much better - Yuriy Petrovskiy 2018-06-20 10:41

Interesting, for the same example langdetect can determine different languages :- - Denis Kuzin 2018-06-27 10:12

TextBlob. Requires NLTK package, uses Google.

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

pip install textblob

Polyglot. Requires numpy and some arcane libraries, ~~unlikely to get it work for Windows~~. (For Windows, get an appropriate versions of PyICU, Morfessor and PyCLD2 from here, then just pip install downloaded_wheel.whl.) Able to detect texts with mixed languages.

from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
        print(language)

# name: English     code: en       confidence:  87.0 read bytes:  1154
# name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
# name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

To install the dependencies, run: sudo apt-get install python-numpy libicu-dev

chardet has also a feature of detecting languages if there are character bytes in range (127-255]:

>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

langdetect Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:
```
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')
```

pip install langdetect

guess_language Can detect very short samples by using this spell checker with dictionaries.

pip install guess_language-spirit

langid provides both module

import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)

and a command-line tool:

    $ langid < README.md

pip install langid

2017-11-04 02:32
by Rabash

detectlang is way faster than TextblobAnwarvic 2018-04-24 14:18

@Anwarvic TextBlob uses Google API (https://github.com/sloria/TextBlob/blob/dev/textblob/translate.py#L33)! that why it's slow - Thomas Decaux 2019-01-14 17:59

polyglot ended up being the most performant for my use case. langid came in secon - jamescampbell 2019-02-23 13:19

You can try determining the Unicode group of chars in input string to point out type of language, (Cyrillic for Russian, for example), and then search for language-specific symbols in text.

2016-08-25 11:10
by Kerbiter