如何从 tf.py_function 返回张量字典？

Question

如何从 tf.py_function 返回张量字典？

Cel*_*nça 4 python-3.x tensorflow2.0 huggingface-transformers

通常，变压器分词器将输入编码为字典。

{"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}

Run Code Online (Sandbox Code Playgroud)

为了对大型数据集进行更好的性能处理，最好实现一个管道，其中包括将Dataset.map分词器函数应用于输入数据集的每个元素。与 Tensorflow 教程中所做的完全相同：加载文本。

但是，tf.py_function（用于包装 map python 函数）不支持返回张量字典，如上所示。

例如，如果加载文本中的分词器（编码器）返回以下字典：

{
    "input_ids": [ 101, 13366,  2131,  1035,  6819,  2094,  1035,  102 ],
    "attention_mask": [ 1, 1, 1, 1, 1, 1, 1, 1 ]
}

Run Code Online (Sandbox Code Playgroud)

如何设置的Tout参数tf.py_function来获取所需的张量字典：

{
    'input_ids': <tf.Tensor: shape=(16,), dtype=int32, numpy = array(
    [ 101, 13366,  2131,  1035,  6819,  2094,  1035,  102 ], dtype=int32)>

    'attention_mask': <tf.Tensor: shape=(16,), dtype=int32, numpy=array(
     [ 1, 1, 1, 1, 1, 1, 1, 1 ], dtype=int32)>
}

Run Code Online (Sandbox Code Playgroud)

？

Answer 1

小智 8

tf.py_function不允许 python dict 作为返回类型。https://github.com/tensorflow/tensorflow/issues/36276

作为您的情况的解决方法，您可以在您的数据转换中进行数据转换py_function ，然后调用另一个 tf.map 而不使用py_function返回字典。

def gen():
  yield 1

def process_data(x):
  return ([ 101, 13366,  2131,  1035,  6819,  2094,  1035,  102 ],
          [ 1, 1, 1, 1, 1, 1, 1, 1 ])

def create_dict(input_ids, attention_mask):
  return {"input_ids": tf.convert_to_tensor(input_ids),
          "attention_mask": tf.convert_to_tensor(attention_mask)}

ds = (tf.data.Dataset
      .from_generator(gen, (tf.int32))
      .map(lambda x: tf.py_function(process_data, inp=[x], 
                                    Tout=(tf.int32, tf.int32)))
      .map(create_dict)
      .repeat())

for x in ds:
  print(x)
  break

Run Code Online (Sandbox Code Playgroud)

输出：

{'input_ids': <tf.Tensor: shape=(8,), dtype=int32, numpy=
array([  101, 13366,  2131,  1035,  6819,  2094,  1035,   102],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(8,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)>}

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，8 月前
查看次数：	2871 次
最近记录：	5 年，8 月前