用于图像数据的 TFX StatisticsGen

Question

用于图像数据的 TFX StatisticsGen

嗨，我真的试图将 TFX 管道作为一项练习。我正在使用从磁盘ImportExampleGen加载TFRecords。每个Example在TFRecord包含在一个字节串，高度，宽度，深度，转向和油门标签的形式一个jpg。

我正在尝试使用，StatisticsGen但收到此警告； WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded as a UTF-8 string.并使我的 Colab Notebook 崩溃。据我所知，TFRecord 中的所有字节字符串图像都没有损坏。

我找不到有关StatisticsGen和处理图像数据的具体示例。根据文档Tensorflow Data Validation 可以处理图像数据。

除了计算一组默认的数据统计数据外，TFDV 还可以计算语义域（例如，图像、文本）的统计数据。要启用语义域统计的计算，请将 enable_semantic_domain_stats 设置为 True 的 tfdv.StatsOptions 对象传递给 tfdv.generate_statistics_from_tfrecord。

但我不确定这与StatisticsGen.

这是实例化ImportExampleGen然后的代码StatisticsGen

from tfx.utils.dsl_utils import tfrecord_input
from tfx.components.example_gen.import_example_gen.component import ImportExampleGen
from  tfx.proto import example_gen_pb2

examples = tfrecord_input(_tf_record_dir)
# https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split
# has a good explanation of splitting the data the 'output_config' param

# Input train split is _tf_record_dir/*'
# Output 2 splits: train:eval=8:2.
train_ratio = 8
eval_ratio  = 10-train_ratio
output = example_gen_pb2.Output(
             split_config=example_gen_pb2.SplitConfig(splits=[
                 example_gen_pb2.SplitConfig.Split(name='train',
                                                   hash_buckets=train_ratio),
                 example_gen_pb2.SplitConfig.Split(name='eval',
                                                   hash_buckets=eval_ratio)
             ]))
example_gen = ImportExampleGen(input=examples,
                               output_config=output)
context.run(example_gen)

statistics_gen = StatisticsGen(
    examples=example_gen.outputs['examples'])
context.run(statistics_gen)

Run Code Online (Sandbox Code Playgroud)

提前致谢。

Answer 1

Jos*_*son 5

来自git 问题响应感谢Evan Rosen

嗨伙计，

您看到的警告表明 StatisticsGen 正在尝试将您的原始图像特征视为分类字符串特征。图像字节被解码得很好。问题是，当写入统计数据（包括前 K 个示例）时，输出 proto 需要一个 UTF-8 有效字符串，而是获取原始图像字节。据我所知，您的设置没有任何问题，但这只是在您具有无法序列化的分类字符串功能的情况下善意警告的意外副作用。我们将寻找一个更好的默认值来更优雅地处理图像数据。

同时，要告诉 StatisticsGen 此功能实际上是一个不透明的 blob，您可以按照 StatsGen 文档中的描述传入用户修改的架构。要生成此模式，您可以运行 StatisticsGen 和 SchemaGen 一次（对数据样本），然后修改推断的模式以注释该图像特征。这是来自@tall-josh 的 colab 的修改版本：

在 Colab 中打开

额外的步骤有点冗长，但出于其他原因，拥有一个精心策划的模式通常是一个很好的做法。这是我添加到笔记本的单元格：

from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2

# Load autogenerated schema (using stats from small batch)

schema = tfx.utils.io_utils.SchemaReader().read(
    tfx.utils.io_utils.get_only_uri_in_dir(
        tfx.types.artifact_utils.get_single_uri(schema_gen.outputs['schema'].get())))

# Modify schema to indicate which string features are images.
# Ideally you would persist a golden version of this schema somewhere rather
# than regenerating it on every run.
for feature in schema.feature:
  if feature.name == 'image/raw':
    feature.image_domain.SetInParent()

# Write modified schema to local file
user_schema_dir ='/tmp/user-schema/'
tfx.utils.io_utils.write_pbtxt_file(
    os.path.join(user_schema_dir, 'schema.pbtxt'), schema)

# Create ImportNode to make modified schema available to other components
user_schema_importer = tfx.components.ImporterNode(
    instance_name='import_user_schema',
    source_uri=user_schema_dir,
    artifact_type=tfx.types.standard_artifacts.Schema)

# Run the user schema ImportNode
context.run(user_schema_importer)

Run Code Online (Sandbox Code Playgroud)

希望您发现此解决方法很有用。同时，我们将研究更好的图像值特征的默认体验。

归档时间：	6 年，2 月前
查看次数：	1033 次
最近记录：	6 年前