gil*_*des 2 python nlp bert-language-model huggingface-transformers
我正在尝试使用 Hugging Face Transformers 库来 POS_TAG 法语。在英语中,我可以通过如下句子来做到这一点:
天气真是太好了。那么让我们去散步吧。
结果是:
token feature
0 The DET
1 weather NOUN
2 is AUX
3 really ADV
4 great ADJ
5 . PUNCT
6 So ADV
7 let VERB
8 us PRON
9 go VERB
10 for ADP
11 a DET
12 walk NOUN
13 . PUNCT
Run Code Online (Sandbox Code Playgroud)
有谁知道法语如何实现类似的目标吗?
这是我在 Jupyter 笔记本中用于英文版本的代码:
!git clone https://github.com/bhoov/spacyface.git
!python -m spacy download en_core_web_sm
from transformers import pipeline
import numpy as np
import pandas as pd
nlp = pipeline('feature-extraction')
sequence = "The weather is really great. So let us go for a walk."
result = nlp(sequence)
# Just displays the size of the embeddings. The sequence
# In this case there are 16 tokens and the embedding size is 768
np.array(result).shape
import sys
sys.path.append('spacyface')
from spacyface.aligner import BertAligner
alnr = BertAligner.from_pretrained("bert-base-cased")
tokens = alnr.meta_tokenize(sequence)
token_data = [{'token': tok.token, 'feature': tok.pos} for tok in tokens]
pd.DataFrame(token_data)
Run Code Online (Sandbox Code Playgroud)
该笔记本的输出如上。
我们最终使用Hugging Face Transformers库训练了 POS 标记(词性标记)模型。生成的模型可在此处获取:
\n\n你基本上可以在上面提到的网页上看到它是如何分配POS标签的。如果您安装了 Hugging Face Transformers 库,您可以使用以下代码在 Jupyter 笔记本中尝试一下:
\nfrom transformers import AutoTokenizer, AutoModelForTokenClassification\nfrom transformers import pipeline\n\ntokenizer = AutoTokenizer.from_pretrained("gilf/french-postag-model")\nmodel = AutoModelForTokenClassification.from_pretrained("gilf/french-postag-model")\n\nnlp_token_class = pipeline(\'ner\', model=model, tokenizer=tokenizer, grouped_entities=True)\nnlp_token_class(\'En Turquie, Recep Tayyip Erdogan ordonne la reconversion de Sainte-Sophie en mosqu\xc3\xa9e\')\nRun Code Online (Sandbox Code Playgroud)\n这是控制台上的结果:
\n[{\'entity_group\': \'PONCT\', \'score\': 0.11994100362062454, \'word\': \'[CLS]\'},\n{\'entity_group\': \'P\', \'score\': 0.9999570250511169, \'word\': \'En\'}, \n{\'entity_group\': \'NPP\', \'score\': 0.9998692870140076, \'word\': \'Turquie\'},\n{\'entity_group\': \'PONCT\', \'score\': 0.9999769330024719, \'word\': \',\'},\n{\'entity_group\': \'NPP\', \'score\': 0.9996993020176888, \'word\': \'Recep Tayyip Erdogan\'},\n{\'entity_group\': \'V\', \'score\': 0.9997997283935547, \'word\': \'ordonne\'}, \n{\'entity_group\': \'DET\', \'score\': 0.9999586343765259, \'word\': \'la\'},\n{\'entity_group\': \'NC\', \'score\': 0.9999251365661621, \'word\': \'reconversion\'}, \n{\'entity_group\': \'P\', \'score\': 0.9999709129333496, \'word\': \'de\'},\n{\'entity_group\': \'NPP\', \'score\': 0.9985082149505615, \'word\': \'Sainte\'}, \n{\'entity_group\': \'PONCT\', \'score\': 0.9999614357948303, \'word\': \'-\'},\n{\'entity_group\': \'NPP\', \'score\': 0.9461128115653992, \'word\': \'Sophie\'},\n{\'entity_group\': \'P\', \'score\': 0.9999079704284668, \'word\': \'en\'},\n{\'entity_group\': \'NC\', \'score\': 0.8998225331306458, \'word\': \'mosqu\xc3\xa9e [SEP]\'}]\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
4398 次 |
| 最近记录: |