使用 Hugging Face Transformers 库如何可以 POS_TAG 法语文本

Question

使用 Hugging Face Transformers 库如何可以 POS_TAG 法语文本

gil*_*des 2 python nlp bert-language-model huggingface-transformers

我正在尝试使用 Hugging Face Transformers 库来 POS_TAG 法语。在英语中，我可以通过如下句子来做到这一点：

天气真是太好了。那么让我们去散步吧。

结果是：

    token   feature
0   The     DET
1   weather NOUN
2   is      AUX
3   really  ADV
4   great   ADJ
5   .       PUNCT
6   So      ADV
7   let     VERB
8   us      PRON
9   go      VERB
10  for     ADP
11  a       DET
12  walk    NOUN
13  .       PUNCT

Run Code Online (Sandbox Code Playgroud)

有谁知道法语如何实现类似的目标吗？

这是我在 Jupyter 笔记本中用于英文版本的代码：

!git clone https://github.com/bhoov/spacyface.git
!python -m spacy download en_core_web_sm

from transformers import pipeline
import numpy as np
import pandas as pd

nlp = pipeline('feature-extraction')
sequence = "The weather is really great. So let us go for a walk."
result = nlp(sequence)
# Just displays the size of the embeddings. The sequence
# In this case there are 16 tokens and the embedding size is 768
np.array(result).shape

import sys
sys.path.append('spacyface')

from spacyface.aligner import BertAligner

alnr = BertAligner.from_pretrained("bert-base-cased")
tokens = alnr.meta_tokenize(sequence)
token_data = [{'token': tok.token, 'feature': tok.pos} for tok in tokens]
pd.DataFrame(token_data)

Run Code Online (Sandbox Code Playgroud)

该笔记本的输出如上。

Answer 1

gil*_*des 7

我们最终使用Hugging Face Transformers库训练了 POS 标记（词性标记）模型。生成的模型可在此处获取：

\n

https://huggingface.co/gilf/french-postag-model?text=En+Turquie%2C+Recep+Tayyip+Erdogan+ordonne+la+reconversion+de+Sainte-Sophie+en+mosqu%C3%A9e

\n

你基本上可以在上面提到的网页上看到它是如何分配POS标签的。如果您安装了 Hugging Face Transformers 库，您可以使用以下代码在 Jupyter 笔记本中尝试一下：

\n

from transformers import AutoTokenizer, AutoModelForTokenClassification\nfrom transformers import pipeline\n\ntokenizer = AutoTokenizer.from_pretrained("gilf/french-postag-model")\nmodel = AutoModelForTokenClassification.from_pretrained("gilf/french-postag-model")\n\nnlp_token_class = pipeline(\'ner\', model=model, tokenizer=tokenizer, grouped_entities=True)\nnlp_token_class(\'En Turquie, Recep Tayyip Erdogan ordonne la reconversion de Sainte-Sophie en mosqu\xc3\xa9e\')\n

Run Code Online (Sandbox Code Playgroud)\n

这是控制台上的结果：

\n

[{\'entity_group\': \'PONCT\', \'score\': 0.11994100362062454, \'word\': \'[CLS]\'},\n{\'entity_group\': \'P\', \'score\': 0.9999570250511169, \'word\': \'En\'}, \n{\'entity_group\': \'NPP\', \'score\': 0.9998692870140076, \'word\': \'Turquie\'},\n{\'entity_group\': \'PONCT\', \'score\': 0.9999769330024719, \'word\': \',\'},\n{\'entity_group\': \'NPP\',   \'score\': 0.9996993020176888,  \'word\': \'Recep Tayyip Erdogan\'},\n{\'entity_group\': \'V\', \'score\': 0.9997997283935547, \'word\': \'ordonne\'},  \n{\'entity_group\': \'DET\', \'score\': 0.9999586343765259, \'word\': \'la\'},\n{\'entity_group\': \'NC\', \'score\': 0.9999251365661621, \'word\': \'reconversion\'},  \n{\'entity_group\': \'P\', \'score\': 0.9999709129333496, \'word\': \'de\'},\n{\'entity_group\': \'NPP\', \'score\': 0.9985082149505615, \'word\': \'Sainte\'},  \n{\'entity_group\': \'PONCT\', \'score\': 0.9999614357948303, \'word\': \'-\'},\n{\'entity_group\': \'NPP\', \'score\': 0.9461128115653992, \'word\': \'Sophie\'},\n{\'entity_group\': \'P\', \'score\': 0.9999079704284668, \'word\': \'en\'},\n{\'entity_group\': \'NC\', \'score\': 0.8998225331306458, \'word\': \'mosqu\xc3\xa9e [SEP]\'}]\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	5 年，6 月前
查看次数：	4398 次
最近记录：	5 年，6 月前