我正在为已经训练好的 NER 模型编写推理脚本,但我在将编码标记(它们的 id)转换为原始单词时遇到了麻烦。
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})
# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')
# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
predictions = []
df = df[['_id', input_col]].copy()
dataset = Dataset.from_pandas(df)
# tokenization, padding, truncation:
encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col],
padding='max_length', truncation=True, max_length=512), batched=True)
encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
dataloader …Run Code Online (Sandbox Code Playgroud) python pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets