oce*_*800 10 python nlp nltk scikit-learn
我正在使用Nltk和Scikit Learn进行一些文本处理.但是,在我的文件清单中,我有一些非英文文件.例如,以下可能是真的:
[ "this is some text written in English",
"this is some more text written in English",
"Ce n'est pas en anglais" ]
Run Code Online (Sandbox Code Playgroud)
出于我的分析目的,我希望将所有非英语句子作为预处理的一部分删除.但是,有一个很好的方法吗?我一直在谷歌搜索,但找不到任何具体的东西,让我能够识别字符串是否为英文.这是不是作为功能提供的东西Nltk
或Scikit learn
?编辑我见过两个这样的问题这个和这个,但都是个别单词...不是一个"文件".我是否必须遍历句子中的每个单词以检查整个句子是否是英文的?
我正在使用Python,所以Python中的库会更受欢迎,但我可以根据需要切换语言,只是认为Python是最好的.
有一个名为langdetect的库.它来自谷歌的语言检测:
https://pypi.python.org/pypi/langdetect
它支持55种开箱即用的语言.
您可能对我的论文《用于书面语言识别的WiLI基准数据集》感兴趣。我还对一些工具进行了基准测试。
TL; DR:
您可以安装lidtk
和分类语言:
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"
fra
Run Code Online (Sandbox Code Playgroud)
我带着非常相似的需求提出了你的问题。我很欣赏马丁·托马的回答。但是,我从 Rabash 的回答第 7 部分中找到了最大的帮助。
在尝试找到最适合我的需求的方法(确保 60,000 多个文本文件中的文本文件为英文)后,我发现fasttext是一个出色的工具。
经过一些工作,我有了一个可以快速处理许多文件的工具。下面是带注释的代码。我相信您和其他人将能够修改此代码以满足您更具体的需求。
class English_Check:
def __init__(self):
# Don't need to train a model to detect languages. A model exists
# that is very good. Let's use it.
pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
self.model = fasttext.load_model(pretrained_model_path)
def predictionict_languages(self, text_file):
this_D = {}
with open(text_file, 'r') as f:
fla = f.readlines() # fla = file line array.
# fasttext doesn't like newline characters, but it can take
# an array of lines from a file. The two list comprehensions
# below, just clean up the lines in fla
fla = [line.rstrip('\n').strip(' ') for line in fla]
fla = [line for line in fla if len(line) > 0]
for line in fla: # Language predict each line of the file
language_tuple = self.model.predictionict(line)
# The next two lines simply get at the top language prediction
# string AND the confidence value for that prediction.
prediction = language_tuple[0][0].replace('__label__', '')
value = language_tuple[1][0]
# Each top language prediction for the lines in the file
# becomes a unique key for the this_D dictionary.
# Everytime that language is found, add the confidence
# score to the running tally for that language.
if prediction not in this_D.keys():
this_D[prediction] = 0
this_D[prediction] += value
self.this_D = this_D
def determine_if_file_is_english(self, text_file):
self.predictionict_languages(text_file)
# Find the max tallied confidence and the sum of all confidences.
max_value = max(self.this_D.values())
sum_of_values = sum(self.this_D.values())
# calculate a relative confidence of the max confidence to all
# confidence scores. Then find the key with the max confidence.
confidence = max_value / sum_of_values
max_key = [key for key in self.this_D.keys()
if self.this_D[key] == max_value][0]
# Only want to know if this is english or not.
return max_key == 'en'
Run Code Online (Sandbox Code Playgroud)
下面是根据我的需要应用/实例化和使用上述类。
file_list = # some tool to get my specific list of files to check for English
en_checker = English_Check()
for file in file_list:
check = en_checker.determine_if_file_is_english(file)
if not check:
print(file)
Run Code Online (Sandbox Code Playgroud)