tf.SequenceExample with multidimensional array

Tor*_*oal 20 python protocol-buffers multidimensional-array tensorflow

在Tensorflow中,我想将多维数组保存到TFRecord.例如:

[[1, 2, 3], [1, 2], [3, 2, 1]]
Run Code Online (Sandbox Code Playgroud)

由于我要解决的任务是顺序的,我正在尝试使用Tensorflow,tf.train.SequenceExample()并且在编写数据时,我成功地将数据写入TFRecord文件.但是,当我尝试使用TFRecord文件加载数据时tf.parse_single_sequence_example,我遇到了大量的神秘错误:

W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Name: , Key: input_characters, Index: 1.  Number of int64 values != expected.  values size: 6 but output shape: []
E tensorflow/core/client/tensor_c_api.cc:485] Name: , Key: input_characters, Index: 1.  Number of int64 values != expected.  values size: 6 but output shape: []
Run Code Online (Sandbox Code Playgroud)

我用来尝试加载数据的函数如下:

def read_and_decode_single_example(filename):

    filename_queue = tf.train.string_input_producer([filename],
                                                num_epochs=None)

    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    context_features = {
         "length": tf.FixedLenFeature([], dtype=tf.int64)
    }

    sequence_features = {
         "input_characters": tf.FixedLenSequenceFeature([],           dtype=tf.int64),
         "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
    }

    context_parsed, sequence_parsed = tf.parse_single_sequence_example(
    serialized=serialized_example,
    context_features=context_features,
    sequence_features=sequence_features
)

context = tf.contrib.learn.run_n(context_parsed, n=1, feed_dict=None)
print context
Run Code Online (Sandbox Code Playgroud)

我用来保存数据的功能在这里:

# http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/
def make_example(input_sequence, output_sequence):
    """
    Makes a single example from Python lists that follows the
    format of tf.train.SequenceExample.
    """

    example_sequence = tf.train.SequenceExample()

    # 3D length
    sequence_length = sum([len(word) for word in input_sequence])
    example_sequence.context.feature["length"].int64_list.value.append(sequence_length)

    input_characters = example_sequence.feature_lists.feature_list["input_characters"]
    output_characters = example_sequence.feature_lists.feature_list["output_characters"]

    for input_character, output_character in izip_longest(input_sequence,
                                                          output_sequence):

        # Extend seems to work, therefore it replaces append.
        if input_sequence is not None:
            input_characters.feature.add().int64_list.value.extend(input_character)

        if output_characters is not None:
            output_characters.feature.add().int64_list.value.extend(output_character)

    return example_sequence
Run Code Online (Sandbox Code Playgroud)

任何帮助都会受到欢迎.

Mul*_*ter 7

我有同样的问题.我认为它完全可以解决,但你必须决定输出格式,然后弄清楚你将如何使用它.

首先 你的错误是什么?

错误消息告诉您,您尝试阅读的内容不符合您指定的功能大小.那你在哪里指定它?就在这儿:

sequence_features = {
    "input_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64),
    "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
}
Run Code Online (Sandbox Code Playgroud)

这表示"我的input_characters是一系列单值",但事实并非如此; 你所拥有的是一系列单个值的序列,因此是一个错误.

第二 ,你能做什么?

如果您改为使用:

a = [[1,2,3], [2,3,1], [3,2,1]] 
sequence_features = {
    "input_characters": tf.FixedLenSequenceFeature([3], dtype=tf.int64),
    "output_characters": tf.FixedLenSequenceFeature([3], dtype=tf.int64)
}
Run Code Online (Sandbox Code Playgroud)

您的代码不会出错,因为您已指定顶级序列的每个元素长度为3个元素.

或者,如果您没有固定长度的序列,那么您将不得不使用不同类型的功能.

sequence_features = {
    "input_characters": tf.VarLenFeature(tf.int64),
    "output_characters": tf.VarLenFeature(tf.int64)
}
Run Code Online (Sandbox Code Playgroud)

VarLenFeature在阅读之前告诉它长度未知.不幸的是,这意味着您的input_characters不能再在一个步骤中作为密集向量读取.相反,它默认为SparseTensor.您可以使用tf.sparse_tensor_to_dense将其转换为密集张量,例如:

input_densified = tf.sparse_tensor_to_dense(sequence_parsed['input_characters'])
Run Code Online (Sandbox Code Playgroud)

正如您一直在查看的文章中所提到,如果您的数据并不总是具有相同的长度,则您的词汇表中必须包含"not_really_a_word"单词,并将其用作默认索引.例如,假设您将索引0映射到"not_really_a_word"字,然后使用您的

a = [[1,2,3],  [2,3],  [3,2,1]]
Run Code Online (Sandbox Code Playgroud)

python list最终将成为一个

array((1,2,3),  (2,3,0),  (3,2,1))
Run Code Online (Sandbox Code Playgroud)

张量.

被警告; 我不确定反向传播"只适用于SparseTensors",就像它对密集张量一样.该wildml文章约填充0掩盖了"not_actually_a_word"字的损失会谈每个序列(参见:"边注:适用于0在你的词汇量/ classes中小心"在他们的文章).这似乎表明第一种方法将更容易实现.

注意,这与这里描述的情况不同,其中每个示例是序列序列.根据我的理解,这种方法得不到充分支持的原因是因为滥用案件是为了支持这种方法; 直接加载固定大小的嵌入.


我将假设您要做的下一件事就是将这些数字转换为单词嵌入.您可以将索引列表转换为嵌入列表tf.nn.embedding_lookup


Max*_*ers 5

使用提供的代码,我无法重现您的错误,但进行一些有根据的猜测给出了以下工作代码.

import tensorflow as tf
import numpy as np
import tempfile

tmp_filename = 'tf.tmp'

sequences = [[1, 2, 3], [1, 2], [3, 2, 1]]
label_sequences = [[0, 1, 0], [1, 0], [1, 1, 1]]

def make_example(input_sequence, output_sequence):
    """
    Makes a single example from Python lists that follows the
    format of tf.train.SequenceExample.
    """

    example_sequence = tf.train.SequenceExample()

    # 3D length
    sequence_length = len(input_sequence)

    example_sequence.context.feature["length"].int64_list.value.append(sequence_length)

    input_characters = example_sequence.feature_lists.feature_list["input_characters"]
    output_characters = example_sequence.feature_lists.feature_list["output_characters"]

    for input_character, output_character in zip(input_sequence,
                                                          output_sequence):

        if input_sequence is not None:
            input_characters.feature.add().int64_list.value.append(input_character)

        if output_characters is not None:
            output_characters.feature.add().int64_list.value.append(output_character)

    return example_sequence

# Write all examples into a TFRecords file
def save_tf(filename):
    with open(filename, 'w') as fp:
        writer = tf.python_io.TFRecordWriter(fp.name)
        for sequence, label_sequence in zip(sequences, label_sequences):
            ex = make_example(sequence, label_sequence)
            writer.write(ex.SerializeToString())
        writer.close()

def read_and_decode_single_example(filename):

    filename_queue = tf.train.string_input_producer([filename],
                                                num_epochs=None)

    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    context_features = {
         "length": tf.FixedLenFeature([], dtype=tf.int64)
    }

    sequence_features = {
         "input_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64),
         "output_characters": tf.FixedLenSequenceFeature([], dtype=tf.int64)
    }


    return serialized_example, context_features, sequence_features

save_tf(tmp_filename)
ex,context_features,sequence_features = read_and_decode_single_example(tmp_filename)
context_parsed, sequence_parsed = tf.parse_single_sequence_example(
    serialized=ex,
    context_features=context_features,
    sequence_features=sequence_features
)

sequence = tf.contrib.learn.run_n(sequence_parsed, n=1, feed_dict=None)
#check if the saved data matches the input data
print(sequences[0] in sequence[0]['input_characters'])
Run Code Online (Sandbox Code Playgroud)

所需的更改是:

  1. sequence_length = sum([len(word) for word in input_sequence])sequence_length = len(input_sequence)

否则它不适用于您的示例数据

  1. extend 被改为 append