当使用 TFX 生成数据集时，如何将 tf.Dataset 适合 Keras 自动编码器模型？

Question

当使用 TFX 生成数据集时，如何将 tf.Dataset 适合 Keras 自动编码器模型？

JCh*_*ler 2 deep-learning keras tensorflow tfx

问题

正如标题所示，我一直在尝试创建一个使用 TFX 训练自动编码器模型的管道。我遇到的问题是将对象返回的 tf.Dataset 拟合DataAccessor.tf_dataset_factory到自动编码器。

Below I summarise the steps I've taken through this project, and have some Questions at the bottom if you wish to skip the background information.

Intro

TFX Pipeline

The TFX components I have used so far have been:

CsvExampleGenerator (the dataset has 82 columns, all numeric, and the sample csv has 739 rows)
StatisticsGenerator / SchemaGenerator, the schema has been edited as is now loaded in using an Importer
Transform
Trainer (this is the component I am currently having problems with)

Model

The model that I am attempting to train is based off of the example laid out here https://www.tensorflow.org/tutorials/generative/autoencoder. However, my model is being trained on tabular data, searching for anomalous results, as opposed to image data.

As I have tried a couple of solutions I have tried using both the Keras.layers and Keras.model format for defining the model and I outline both below:

Subclassing Keras.Model

class Autoencoder(keras.models.Model):
    def __init__(self, features):
        super(Autoencoder, self).__init__()
        
        self.encoder = tf.keras.Sequential([
            keras.layers.Dense(82, activation = 'relu'),
            keras.layers.Dense(32, activation = 'relu'),
            keras.layers.Dense(16, activation = 'relu'),
            keras.layers.Dense(8, activation = 'relu')
        ])
        
        self.decoder = tf.keras.Sequential([
            keras.layers.Dense(16, activation = 'relu'),
            keras.layers.Dense(32, activation = 'relu'),
            keras.layers.Dense(len(features), activation = 'sigmoid')
        ])

    def call(self, x):
        inputs = [keras.layers.Input(shape = (1,), name = f) for f in features]
        dense = keras.layers.concatenate(inputs)
        
        encoded = self.encoder(dense)
        decoded = self.decoder(encoded)
    
        return decoded

Run Code Online (Sandbox Code Playgroud)

Subclassing Keras.Layers

def _build_keras_model(features: List[str]) -> tf.keras.Model:
    inputs = [keras.layers.Input(shape = (1,), name = f) for f in features]
    dense = keras.layers.concatenate(inputs)

    dense = keras.layers.Dense(32, activation = 'relu')(dense)
    dense = keras.layers.Dense(16, activation = 'relu')(dense)
    dense = keras.layers.Dense(8, activation = 'relu')(dense)
    dense = keras.layers.Dense(16, activation = 'relu')(dense)
    dense = keras.layers.Dense(32, activation = 'relu')(dense)
    outputs = keras.layers.Dense(len(features), activation = 'sigmoid')(dense)
    
    model = keras.Model(inputs = inputs, outputs = outputs)
    model.compile(
        optimizer = 'adam',
        loss = 'mae'
    )

    return model

Run Code Online (Sandbox Code Playgroud)

TFX Trainer Component

For creating the Trainer Component I have been mainly following the implementation details laid out here: https://www.tensorflow.org/tfx/guide/trainer

As well as following the default penguins example: https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple#write_model_training_code

run_fn defintion

def run_fn(fn_args: tfx.components.FnArgs) -> None:
    tft_output = tft.TFTransformOutput(fn_args.transform_output)
    
    train_dataset = _input_fn(
        file_pattern = fn_args.train_files,
        data_accessor = fn_args.data_accessor,
        tf_transform_output = tft_output,
        batch_size = fn_args.train_steps
    )

    eval_dataset = _input_fn(
        file_pattern = fn_args.eval_files,
        data_accessor = fn_args.data_accessor,
        tf_transform_output = tft_output,
        batch_size = fn_args.custom_config['eval_batch_size']
    )

#   model = Autoencoder(
#       features = fn_args.custom_config['features']
#   )
    model = _build_keras_model(features = fn_args.custom_config['features'])
        
    model.compile(optimizer = 'adam', loss = 'mse')
    
    model.fit(
        train_dataset,
        steps_per_epoch = fn_args.train_steps,
        validation_data = eval_dataset,
        validation_steps = fn_args.eval_steps
    )
    
    ...

Run Code Online (Sandbox Code Playgroud)

_input_fn definition

def _apply_preprocessing(raw_features, tft_layer):
    transformed_features = tft_layer(raw_features)
    return transformed_features

def _input_fn(
    file_pattern,
    data_accessor: tfx.components.DataAccessor,
    tf_transform_output: tft.TFTransformOutput,
    batch_size: int) -> tf.data.Dataset:
    """
    Generates features and label for tuning/training.
      Args:
        file_pattern: List of paths or patterns of input tfrecord files.
        data_accessor: DataAccessor for converting input to RecordBatch.
        tf_transform_output: A TFTransformOutput.
        batch_size: representing the number of consecutive elements of returned
          dataset to combine in a single batch
      Returns:
        A dataset that contains features where features is a
          dictionary of Tensors.
    """
    dataset = data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(batch_size = batch_size),
        tf_transform_output.transformed_metadata.schema
     )
    
    transform_layer = tf_transform_output.transform_features_layer()
    def apply_transform(raw_features):
        return _apply_preprocessing(raw_features, transform_layer)
    
    return dataset.map(apply_transform).repeat()

Run Code Online (Sandbox Code Playgroud)

This differs from the _input_fn example given above as I was following the example in the next tfx tutorial found here: https://www.tensorflow.org/tfx/tutorials/tfx/penguin_tft#run_fn

Also for reference, there is no Target within the example data so there is no label_key to be passed to the tfxio.TensorFlowDatasetOptions object.

Error

When trying to run the Trainer component using a TFX InteractiveContext object I receive the following error.

ValueError: No gradients provided for any variable: ['dense_460/kernel:0', 'dense_460/bias:0', 'dense_461/kernel:0', 'dense_461/bias:0', 'dense_462/kernel:0', 'dense_462/bias:0', 'dense_463/kernel:0', 'dense_463/bias:0', 'dense_464/kernel:0', 'dense_464/bias:0', 'dense_465/kernel:0', 'dense_465/bias:0'].

Run Code Online (Sandbox Code Playgroud)

From my own attempts to solve this I believe the problem lies in the way that an Autoencoder is trained. From the Autoencoder example linked here https://www.tensorflow.org/tutorials/generative/autoencoder the data is fitted like so:

autoencoder.fit(x_train, x_train,
                epochs=10,
                shuffle=True,
                validation_data=(x_test, x_test))

Run Code Online (Sandbox Code Playgroud)

therefore it stands to reason that the tf.Dataset should also mimic this behaviour and when testing with plain Tensor objects I have been able to recreate the error above and then solve it when adding the target to be the same as the training data in the .fit() function.

Things I've Tried So Far

Duplicating Train Dataset

    model.fit(
        train_dataset,
        train_dataset,
        steps_per_epoch = fn_args.train_steps,
        validation_data = eval_dataset,
        validation_steps = fn_args.eval_steps
    )

Run Code Online (Sandbox Code Playgroud)

Raises error due to Keras not accepting a 'y' value when a dataset is passed.

ValueError: `y` argument is not supported when using dataset as input.

Run Code Online (Sandbox Code Playgroud)

Returning a dataset that is a tuple with itself

def _input_fn(...


    dataset = data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(batch_size = batch_size),
        tf_transform_output.transformed_metadata.schema
     )
    
    transform_layer = tf_transform_output.transform_features_layer()
    def apply_transform(raw_features):
        return _apply_preprocessing(raw_features, transform_layer)
    
    dataset = dataset.map(apply_transform)
    
    return dataset.map(lambda x: (x, x))

Run Code Online (Sandbox Code Playgroud)

This raises an error where the keys from the features dictionary don't match the output of the model.

ValueError: Found unexpected keys that do not correspond to any Model output: dict_keys(['feature_string', ...]). Expected: ['dense_477']

Run Code Online (Sandbox Code Playgroud)

At this point I switched to using the keras.model Autoencoder subclass and tried to add output keys to the Model using an output which I tried to create dynamically in the same way as the inputs.

    def call(self, x):
        inputs = [keras.layers.Input(shape = (1,), name = f) for f in x]
        dense = keras.layers.concatenate(inputs)
        
        encoded = self.encoder(dense)
        decoded = self.decoder(encoded)
    
        outputs = {}
        for feature_name in x:
            outputs[feature_name] = keras.layers.Dense(1, activation = 'sigmoid')(decoded)

        return outputs

Run Code Online (Sandbox Code Playgroud)

This raises the following error:

TypeError: Cannot convert a symbolic Keras input/output to a numpy array. This error may indicate that you're trying to pass a symbolic value to a NumPy call, which is not supported. Or, you may be trying to pass Keras symbolic inputs/outputs to a TF API that does not register dispatching, preventing Keras from automatically converting the API call to a lambda layer in the Functional Model.

Run Code Online (Sandbox Code Playgroud)

I've been looking into solving this issue but am no longer sure if the data is being passed correctly and am beginning to think I'm getting side-tracked from the actual problem.

Questions

Has anyone managed to get an Autoencoder working when connected via TFX examples?
Did you alter the tf.Dataset or handled the examples in a different way to the _input_fn demonstrated?

Answer 1

JCh*_*ler 5

因此，我设法找到了这个问题的答案，并希望将我在这里找到的内容留下，以防其他人偶然发现类似的问题。

事实证明，我对错误的感觉是正确的，解决方案确实在于 tf.Dataset 对象的呈现方式。

当我运行一些使用随机生成的张量模拟传入数据的代码时，可以证明这一点。

tensors = [tf.random.uniform(shape = (1, 82)) for i in range(739)]
# This gives us a list of 739 tensors which hold 1 value for 82 'features' simulating the dataset I had

dataset = tf.data.Dataset.from_tensor_slices(tensors)
dataset = dataset.map(lambda x : (x, x))
# This returns a dataset which marks the training set and target as the same
# which is what the Autoecnoder model is looking for

model.fit(dataset ...)

Run Code Online (Sandbox Code Playgroud)

接下来，我继续对 _input_fn 返回的数据集执行相同的操作。鉴于 tfx DataAccessor 对象返回 features_dict 但我需要将该字典中的张量组合在一起以创建单个张量。

这就是我的 _input_fn 现在的样子：

def create_target_values(features_dict: Dict[str, tf.Tensor]) -> tuple:
    value_tensor = tf.concat(list(features_dict.values()), axis = 1)
    return (features_dict, value_tensor)

def _input_fn(
    file_pattern,
    data_accessor: tfx.components.DataAccessor,
    tf_transform_output: tft.TFTransformOutput,
    batch_size: int) -> tf.data.Dataset:
    """
    Generates features and label for tuning/training.
      Args:
        file_pattern: List of paths or patterns of input tfrecord files.
        data_accessor: DataAccessor for converting input to RecordBatch.
        tf_transform_output: A TFTransformOutput.
        batch_size: representing the number of consecutive elements of returned
          dataset to combine in a single batch
      Returns:
        A dataset that contains (features, target_tensor) tuple where features is a
          dictionary of Tensors, and target_tensor is a single Tensor that is a concatenated tensor of all the
          feature values.
    """
    dataset = data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(batch_size = batch_size),
        tf_transform_output.transformed_metadata.schema
    )
    
    dataset = dataset.map(lambda x: create_target_values(features_dict = x))
    return dataset.repeat()

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，8 月前
查看次数：	1227 次
最近记录：	2 年，7 月前