And*_*rdo 7 nlp deep-learning keras tensorflow tensorflow-datasets
老实说,我想弄清楚如何将数据集(格式:pandasDataFrame或 numpy 数组)转换为一种简单的文本分类 tensorflow 模型可以训练进行情感分析的形式。我使用的数据集类似于 IMDB(包含文本和标签(正面或负面))。我看过的每个教程要么以不同的方式准备数据,要么不理会数据准备而将其留给您的想象。(例如,所有 IMDB 教程都BatchDataset从导入了一个预处理的 Tensorflow tensorflow_datasets,这在我使用自己的数据集时没有帮助)。我自己将 Pandas 转换DataFrame为 TensorflowDataset类型的尝试导致了 ValueErrors 或训练期间的负损失。任何帮助,将不胜感激。
我最初按如下方式准备了我的数据,其中training和validation已经打乱了DataFrame包含text和label列的Pandas :
# IMPORT STUFF
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf # (I'm using tensorflow 2.0)
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
# ... [code for importing and preparing the pandas dataframe omitted]
# TOKENIZE
train_text = training['text'].to_numpy()
tok = Tokenizer(oov_token='<unk>')
tok.fit_on_texts(train_text)
tok.word_index['<pad>'] = 0
tok.index_word[0] = '<pad>'
train_seqs = tok.texts_to_sequences(train_text)
train_seqs = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
train_labels = training['label'].to_numpy().flatten()
valid_text = validation['text'].to_numpy()
valid_seqs = tok.texts_to_sequences(valid_text)
valid_seqs = tf.keras.preprocessing.sequence.pad_sequences(valid_seqs, padding='post')
valid_labels = validation['label'].to_numpy().flatten()
# CONVERT TO TF DATASETS
train_ds = tf.data.Dataset.from_tensor_slices((train_seqs,train_labels))
valid_ds = tf.data.Dataset.from_tensor_slices((valid_seqs,valid_labels))
train_ds = train_ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
valid_ds = valid_ds.batch(BATCH_SIZE)
# PREFETCH
train_ds = train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
valid_ds = valid_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
Run Code Online (Sandbox Code Playgroud)
这导致 train_ds 和 valid_ds 被标记化并且类型为PrefetchDatasetor <PrefetchDataset shapes: ((None, None, None, 118), (None, None, None)), types: (tf.int32, tf.int64)>。
然后我进行了如下训练,但得到了很大的负损失和 0 的准确度。
model = keras.Sequential([
layers.Embedding(vocab_size, embedding_dim),
layers.GlobalAveragePooling1D(),
layers.Dense(1, activation='sigmoid') # also tried activation='softmax'
])
model.compile(optimizer='adam',
loss='binary_crossentropy', # binary_crossentropy
metrics=['accuracy'])
history = model.fit(
train_ds,
epochs=1,
validation_data=valid_ds, validation_steps=1, steps_per_epoch=BUFFER_SIZE)
Run Code Online (Sandbox Code Playgroud)
如果我不做花哨的预取东西,train_ds将是BatchDatasetor类型<BatchDataset shapes: ((None, 118), (None,)), types: (tf.int32, tf.int64)>,但这也会给我带来负损失和 0 的准确度。
如果我只是执行以下操作:
x, y = training['text'].to_numpy(), training['label'].to_numpy()
x, y = tf.convert_to_tensor(x),tf.convert_to_tensor(y)
Run Code Online (Sandbox Code Playgroud)
thenx和y是 类型EagerTensor,但我似乎无法弄清楚如何将EagerTensor.
我真正需要什么类型和形状train_ds?我错过了什么或做错了什么?
所述text_classification_with_hub教程列车已经制备IMDB数据集,如下所示:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(train_data.shuffle(10000).batch(512),
epochs=20,
validation_data=validation_data.batch(512),
verbose=1)
Run Code Online (Sandbox Code Playgroud)
在这个例子中,train_datais 的形式是tensorflow.python.data.ops.dataset_ops._OptionsDataset,并且train_data.shuffle(1000).batch(512)是tensorflow.python.data.ops.dataset_ops.BatchDataset(或<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int64)>)。
他们显然不关心这个数据集的标记化,但我怀疑标记化是我的问题。为什么他们的train_data.shuffle(10000).batch(512)工作,但我的train_ds不工作?
问题可能出在模型设置、Embedding层或标记化上,但我不太确定是这种情况。我已经看过以下教程以获得灵感:
https://www.tensorflow.org/tutorials/keras/text_classification_with_hub
https://www.kaggle.com/drscarlat/imdb-sentiment-analysis-keras-and-tensorflow
https://www.tensorflow.org/tutorials/text/image_captioning
https://www.tensorflow.org/tutorials/text/word_embeddings#learning_embeddings_from_scratch
| 归档时间: |
|
| 查看次数: |
2700 次 |
| 最近记录: |