如何简化斯坦福法国POS标签器返回的部分语音标签?将英文句子读入NLTK相当容易,找到每个单词的词性,然后使用map_tag()来简化标签集:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
from nltk.tag.stanford import POSTagger
from nltk.tokenize import word_tokenize
from nltk.tag import map_tag
#set java_home path from within script. Run os.getenv("JAVA_HOME") to test java_home
os.environ["JAVA_HOME"] = "C:\\Program Files\\Java\\jdk1.7.0_25\\bin"
english = u"the whole earth swarms with living beings, every plant, every grain and leaf, supports the life of thousands."
path_to_english_model = "C:\\Text\\Professional\\Digital Humanities\\Packages and Tools\\Stanford Packages\\stanford-postagger-full-2014-08-27\\stanford-postagger-full-2014-08-27\\models\\english-bidirectional-distsim.tagger"
path_to_jar = "C:\\Text\\Professional\\Digital Humanities\\Packages and Tools\\Stanford Packages\\stanford-postagger-full-2014-08-27\\stanford-postagger-full-2014-08-27\\stanford-postagger.jar"
#define english and french taggers
english_tagger = POSTagger(path_to_english_model, path_to_jar, encoding="utf-8") …Run Code Online (Sandbox Code Playgroud) 我正在使用 NLTK 和 TextBlob 在文本中查找名词和名词短语:
from textblob import TextBlob
import nltk
blob = TextBlob(text)
print(blob.noun_phrases)
tokenized = nltk.word_tokenize(text)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print(nouns)
Run Code Online (Sandbox Code Playgroud)
如果我的文字是英文的,这很好用,但如果我的文字是法文,那就不行了。
我无法找到如何将此代码调整为法语,我该怎么做?
是否有可以解析的所有语言的列表?
我正在尝试使用 Hugging Face Transformers 库来 POS_TAG 法语。在英语中,我可以通过如下句子来做到这一点:
天气真是太好了。那么让我们去散步吧。
结果是:
token feature
0 The DET
1 weather NOUN
2 is AUX
3 really ADV
4 great ADJ
5 . PUNCT
6 So ADV
7 let VERB
8 us PRON
9 go VERB
10 for ADP
11 a DET
12 walk NOUN
13 . PUNCT
Run Code Online (Sandbox Code Playgroud)
有谁知道法语如何实现类似的目标吗?
这是我在 Jupyter 笔记本中用于英文版本的代码:
!git clone https://github.com/bhoov/spacyface.git
!python -m spacy download en_core_web_sm
from transformers import pipeline
import numpy as np
import pandas as pd
nlp = pipeline('feature-extraction') …Run Code Online (Sandbox Code Playgroud)