I want to get this:
Input text: "ру́сский язы́к"
Output text: "Russian"
Input text: "中文"
Output text: "Chinese"
Input text: "にほんご"
Output text: "Japanese"
Input text: "العَرَبِيَّة"
Output text: "Arabic"
How can I do it in python? Thanks.
Have you had a look at langdetect?
from langdetect import detect
lang = detect("Ein, zwei, drei, vier")
print lang
#output: de
ro
(Romanian). Multiple language output required for such cases. polyglot performs much better - Yuriy Petrovskiy 2018-06-20 10:41
langdetect
can determine different languages :- - Denis Kuzin 2018-06-27 10:12
TextBlob. Requires NLTK package, uses Google.
from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()
pip install textblob
Polyglot. Requires numpy and some arcane libraries, unlikely to get it work for Windows. (For Windows, get an appropriate versions of PyICU, Morfessor and PyCLD2 from here, then just pip install downloaded_wheel.whl
.) Able to detect texts with mixed languages.
from polyglot.detect import Detector
mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
print(language)
# name: English code: en confidence: 87.0 read bytes: 1154
# name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755
# name: un code: un confidence: 0.0 read bytes: 0
pip install polyglot
To install the dependencies, run:
sudo apt-get install python-numpy libicu-dev
chardet has also a feature of detecting languages if there are character bytes in range (127-255]:
>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}
pip install chardet
langdetect Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')
pip install langdetect
pip install guess_language-spirit
langid provides both module
import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)
and a command-line tool:
$ langid < README.md
pip install langid
detectlang
is way faster than Textblob
Anwarvic 2018-04-24 14:18
polyglot
ended up being the most performant for my use case. langid
came in secon - jamescampbell 2019-02-23 13:19
You can try determining the Unicode group of chars in input string to point out type of language, (Cyrillic for Russian, for example), and then search for language-specific symbols in text.