通过 Huggingface 标记器映射文本数据

Question

通过 Huggingface 标记器映射文本数据

sac*_*ruk 5 tensorflow tensorflow-datasets huggingface-transformers

我的编码函数如下所示：

from transformers import BertTokenizer, BertModel

MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)

def encode(texts, tokenizer=tokenizer, maxlen=10):
#     import pdb; pdb.set_trace()
    inputs = tokenizer.encode_plus(
        texts,
        return_tensors='tf',
        return_attention_masks=True, 
        return_token_type_ids=True,
        pad_to_max_length=True,
        max_length=maxlen
    )

    return inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]

Run Code Online (Sandbox Code Playgroud)

我想通过这样做来动态编码我的数据：

x_train = (tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values)
           .map(encode))

Run Code Online (Sandbox Code Playgroud)

然而，这会消除错误：

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Run Code Online (Sandbox Code Playgroud)

现在根据我的理解，当我在其中设置断点时，encode是因为我发送了一个非 numpy 数组。如何让 Huggingface 变压器与张量流字符串作为输入配合良好？

如果您需要一个虚拟数据框，那么它是：

df_train = pd.DataFrame({'comment_text': ['Today was a good day']*5})

Run Code Online (Sandbox Code Playgroud)

我尝试过的

所以我尝试使用from_generator这样我就可以将字符串解析到函数中encode_plus。然而，这不适用于 TPU。

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

Run Code Online (Sandbox Code Playgroud)

版本信息：

transformers.__version__, tf.__version__=>('2.7.0', '2.1.0')

Answer 1

小智 6

bert 的分词器适用于字符串、字符串列表/元组或整数列表/元组。因此，检查您的数据是否转换为字符串。为了在整个数据集上应用分词器，我使用了 Dataset.map，但这在图形模式下运行。所以，我需要将它包装在 tf.py_function 中。tf.py_function 将常规张量（带有一个值和一个 .numpy() 方法来访问它）传递给包装的 python 函数。使用 py_function 后我的数据被转换为字节，因此我应用 tf.compat.as_str 将字节转换为字符串。

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def encode(lang1, lang2):
    lang1 = tokenizer.encode(tf.compat.as_str(lang1.numpy()), add_special_tokens=True)
    lang2 = tokenizer.encode(tf.compat.as_str(lang2.numpy()), add_special_tokens=True)
    return lang1, lang2
def tf_encode(pt, en):
    result_pt, result_en = tf.py_function(func = encode, inp = [pt, en], Tout=[tf.int64, tf.int64])
    result_pt.set_shape([None])
    result_en.set_shape([None])
    return result_pt, result_en
train_dataset = dataset3.map(tf_encode)
BUFFER_SIZE = 200
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE, 
                                                           padded_shapes=(60, 60))
a,p = next(iter(train_dataset))

Run Code Online (Sandbox Code Playgroud)

Answer 2

cro*_*oik 2

当您使用以下命令创建张量流数据集时：tf.data.Dataset.from_tensor_slices(df_train.comment_text.astype(str).values) 张量流将您的字符串转换为字符串类型的张量，这不是tokenizer.encode_plus可接受的输入。正如错误消息所示，它只接受a string, a list/tuple of strings or a list/tuple of integers. 您可以通过print(type(texts))在编码函数中添加一个来验证这一点（输出：）<class 'tensorflow.python.framework.ops.Tensor'>。

我不确定您的后续计划是什么以及为什么需要 a tf.data.Dataset，但您必须在将输入转换为 a 之前对其进行编码tf.data.Dataset：

import tensorflow as tf
from transformers import BertTokenizer, BertModel

MODEL = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(MODEL)

texts = ['Today was a good day', 'Today was a bad day',
       'Today was a rainy day', 'Today was a sunny day',
       'Today was a cloudy day']


#inputs['input_ids'], inputs["token_type_ids"], inputs["attention_mask"]
inputs = tokenizer.batch_encode_plus(
        texts,
        return_tensors='tf',
        return_attention_masks=True, 
        return_token_type_ids=True,
        pad_to_max_length=True,
        max_length=10
    )

dataset = tf.data.Dataset.from_tensor_slices((inputs['input_ids'],
                                              inputs['attention_mask'],
                                              inputs['token_type_ids']))
print(type(dataset))

Run Code Online (Sandbox Code Playgroud)

输出：

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，8 月前
查看次数：	9226 次
最近记录：	4 年，3 月前