检测文本是否为非英语

Mon*_*lal -5 php python text nlp language-detection

检测文本(特别是Instagram评论)是非英语的最准确方法是什么?我很乐意使用任何高级语言,例如Python,PHP等.

$ sudo pip2 install guess_language
>>> from guess_language import guessLanguage
>>> guessLanguage('la vita e bella')
'UNKNOWN'
>>> guessLanguage('today is a good day')
'UNKNOWN'
>>> guessLanguage('????????????????(???)')
'ja'
Run Code Online (Sandbox Code Playgroud)

$ sudo apt-get install php5.6-mbstring

      if(strlen($comment->text) == mb_strlen($comment->text, 'utf-8')) {
         echo '- '.$comment->text."\n";
    }
Run Code Online (Sandbox Code Playgroud)

我得到许多不是英文的英文字符:例子:

- Khoda be khanevadehashon sabr bede tahamol konan
- Akhey...
- Eshghi
- K
- :-)
- Ey khodaa
- ...
- @samaneaghazamani1990 vaaaaay khoda chejoori payam dadan?
- :(
- Elahiiiii
- May Allah please with them and grant higher rank of jannah salutes to the  bravehearts @taraneh_alidoosti @fanpagemostafazamani
- Elaaaahiii
- Roohetoon shad.
- :'(
- Roheshon shad!! Yadeshon gerami!!
- .:'(
- :-(
- Oooo
- Awli
Run Code Online (Sandbox Code Playgroud)

我不想使用谷歌翻译之类的东西,因为我正在处理大量数据.

更新:

$ sudo pip2 install langdetect
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'
Run Code Online (Sandbox Code Playgroud)

'so'是指未知?我原以为会today is a good day被认为是en!

Roh*_*nil 5

您可以使用ployglot包的语言检测功能.

>>> from polyglot.detect import Detector
>>> print(Detector('today is a good day.').language)
name: English     code: en       confidence:  95.0 read bytes:  1792
Run Code Online (Sandbox Code Playgroud)

  • 我不认为他们中的任何一个都能完美地检测语言.要确定哪一个更好,需要对两者进行一些测试.Polyglot有一些额外的功能,如通过混合文本检测等.我不确定它是否可以由langid完成. (3认同)