将多个文件输入Tensorflow数据集

Question

将多个文件输入Tensorflow数据集

use*_*109 4 tensorflow tensorflow-serving tensorflow-datasets tensorflow-estimator

我有以下input_fn.

def input_fn(filenames, batch_size):
    # Create a dataset containing the text lines.
    dataset = tf.data.TextLineDataset(filenames).skip(1)

    # Parse each line.
    dataset = dataset.map(_parse_line)

    # Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(10000).repeat().batch(batch_size)

    # Return the dataset.
    return dataset

Run Code Online (Sandbox Code Playgroud)

如果filenames=['file1.csv']或者它很好用filenames=['file2.csv'].它给了我一个错误,如果filenames=['file1.csv', 'file2.csv'].在Tensorflow 文档中,它表示包含一个或多个文件名filenames的tf.string张量.我该如何导入多个文件？

以下是错误.它似乎忽略.skip(1)了input_fn上面的内容:

InvalidArgumentError: Field 0 in record 0 is not a valid int32: row_id
 [[Node: DecodeCSV = DecodeCSV[OUT_TYPE=[DT_INT32, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, ..., DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], field_delim=",", na_value="", use_quote_delim=true](arg0, DecodeCSV/record_defaults_0, DecodeCSV/record_defaults_1, DecodeCSV/record_defaults_2, DecodeCSV/record_defaults_3, DecodeCSV/record_defaults_4, DecodeCSV/record_defaults_5, DecodeCSV/record_defaults_6, DecodeCSV/record_defaults_7, DecodeCSV/record_defaults_8, DecodeCSV/record_defaults_9, DecodeCSV/record_defaults_10, DecodeCSV/record_defaults_11, DecodeCSV/record_defaults_12, DecodeCSV/record_defaults_13, DecodeCSV/record_defaults_14, DecodeCSV/record_defaults_15, DecodeCSV/record_defaults_16, DecodeCSV/record_defaults_17, DecodeCSV/record_defaults_18)]]
 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?], [?], [?], [?], [?], ..., [?], [?], [?], [?], [?]], output_types=[DT_FLOAT, DT_INT32, DT_INT32, DT_STRING, DT_STRING, ..., DT_INT32, DT_FLOAT, DT_INT32, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

Run Code Online (Sandbox Code Playgroud)

Answer 1

mik*_*ola 7

你有正确的想法使用tf.data.TextLineDataset.但是,当前实现的作用是在文件名的输入张量中生成每个文件的每一行,除了第一个文件的第一个文件.跳过第一行的方式现在只影响第一个文件中的第一行.在第二个文件中,不跳过第一行.

根据数据集指南中的示例,您应该调整代码以首先Dataset从文件名创建常规,然后flat_map在每个文件名上运行以使用它来读取它TextLineDataset,同时跳过第一行:

d = tf.data.Dataset.from_tensor_slices(filenames) 
# get dataset from each file, skipping first line of each file
d = d.flat_map(lambda filename: tf.data.TextLineDataset(filename).skip(1))
d = d.map(_parse_line) # And whatever else you need to do

Run Code Online (Sandbox Code Playgroud)

在这里,flat_map通过读取文件的内容并跳过第一行,从原始数据集的每个元素创建一个新数据集.

归档时间：	7 年，11 月前
查看次数：	2676 次
最近记录：	7 年，11 月前