为多个时间序列创建 Tensorflow 数据集

Myk*_*tko 7 python time-series deep-learning tensorflow tensorflow-datasets

我有多个时间序列数据,如下所示:

df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                   'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                   'Feature1': np.random.randint(10, size=10),
                   'Feature2': np.random.randint(10, size=10)})

   Time  Object  Feature1  Feature2
0     0       1         3         3
1     1       1         9         2
2     2       1         6         6
3     3       1         4         0
4     4       1         7         7
5     0       2         4         8
6     1       2         3         7
7     2       2         1         1
8     3       2         7         5
9     4       2         1         7
Run Code Online (Sandbox Code Playgroud)

其中每个对象(1 和 2)都有自己的数据(实际数据中大约有 2000 个对象)。我想将这些数据分块输入 RNN/LSTM,使用tf.data.Dataset.window一种不同的对象数据不会出现在一个窗口中的方式,如下例所示:

dataset = tf.data.Dataset.from_tensor_slices(df)

for w in dataset.window(3, shift=1, drop_remainder=True):
  print(list(w.as_numpy_iterator()))
Run Code Online (Sandbox Code Playgroud)

输出:

[array([0, 1, 3, 3]), array([1, 1, 9, 2]), array([2, 1, 6, 6])]
[array([1, 1, 9, 2]), array([2, 1, 6, 6]), array([3, 1, 4, 0])]
[array([2, 1, 6, 6]), array([3, 1, 4, 0]), array([4, 1, 7, 7])]
[array([3, 1, 4, 0]), array([4, 1, 7, 7]), array([0, 2, 4, 8])] # Mixed data from both objects
[array([4, 1, 7, 7]), array([0, 2, 4, 8]), array([1, 2, 3, 7])] # Mixed data from both objects
[array([0, 2, 4, 8]), array([1, 2, 3, 7]), array([2, 2, 1, 1])]
[array([1, 2, 3, 7]), array([2, 2, 1, 1]), array([3, 2, 7, 5])]
[array([2, 2, 1, 1]), array([3, 2, 7, 5]), array([4, 2, 1, 7])]
Run Code Online (Sandbox Code Playgroud)

预期输出:

[array([0, 1, 3, 3]), array([1, 1, 9, 2]), array([2, 1, 6, 6])]
[array([1, 1, 9, 2]), array([2, 1, 6, 6]), array([3, 1, 4, 0])]
[array([2, 1, 6, 6]), array([3, 1, 4, 0]), array([4, 1, 7, 7])]
[array([0, 2, 4, 8]), array([1, 2, 3, 7]), array([2, 2, 1, 1])]
[array([1, 2, 3, 7]), array([2, 2, 1, 1]), array([3, 2, 7, 5])]
[array([2, 2, 1, 1]), array([3, 2, 7, 5]), array([4, 2, 1, 7])]
Run Code Online (Sandbox Code Playgroud)

也许还有另一种方法可以做到。主要要求是我的模型应该看到非混合数据块来自不同的对象(可能通过嵌入)。

Alo*_*her 3

嗯,也许只是创建两个单独的数据帧,然后在窗口后连接。这样,你就不会有任何重叠:

import tensorflow as tf
import pandas as pd
import numpy as np


df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                   'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                   'Feature1': np.random.randint(10, size=10),
                   'Feature2': np.random.randint(10, size=10)})

df1 = df[df['Object'] == 1]
df2 = df[df['Object'] == 2]

dataset = tf.data.Dataset.from_tensor_slices(df1).window(3, shift=1, drop_remainder=True).concatenate(tf.data.Dataset.from_tensor_slices(df2).window(3, shift=1, drop_remainder=True))

for w in dataset:
  print(list(w.as_numpy_iterator()))
Run Code Online (Sandbox Code Playgroud)
[array([0, 1, 3, 3]), array([1, 1, 9, 2]), array([2, 1, 6, 6])]
[array([1, 1, 9, 2]), array([2, 1, 6, 6]), array([3, 1, 4, 0])]
[array([2, 1, 6, 6]), array([3, 1, 4, 0]), array([4, 1, 7, 7])]
[array([0, 2, 4, 8]), array([1, 2, 3, 7]), array([2, 2, 1, 1])]
[array([1, 2, 3, 7]), array([2, 2, 1, 1]), array([3, 2, 7, 5])]
[array([2, 2, 1, 1]), array([3, 2, 7, 5]), array([4, 2, 1, 7])]
Run Code Online (Sandbox Code Playgroud)

更新1

另一种方法是tf.data.Dataset.filter像这样使用:

import tensorflow as tf
import pandas as pd
import numpy as np

df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                   'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                   'Feature1': np.random.randint(10, size=10),
                   'Feature2': np.random.randint(10, size=10)})

objects = df['Object'].unique()
dataset = tf.data.Dataset.from_tensor_slices(df)
new_dataset = None

for o in objects:
  temp_dataset = dataset.filter(lambda x: tf.math.equal(x[1], tf.constant(o))).window(3, shift=1, drop_remainder=True)
  if new_dataset:
    new_dataset = new_dataset.concatenate(temp_dataset)
  else:
    new_dataset = temp_dataset

for w in new_dataset:
  print(list(w.as_numpy_iterator()))
Run Code Online (Sandbox Code Playgroud)

更新 2:另一个选择是排除/删除重叠序列。这样您就可以灵活地决定如何处理重叠:

import tensorflow as tf
import pandas as pd
import numpy as np


df = pd.DataFrame({'Time': np.tile(np.arange(5), 2),
                   'Object': np.concatenate([[i] * 5 for i in [1, 2]]),
                   'Feature1': np.random.randint(10, size=10),
                   'Feature2': np.random.randint(10, size=10)})

dataset = tf.data.Dataset.from_tensor_slices(df).window(3, shift=1, drop_remainder=True).flat_map(lambda x: x.batch(3)).filter(lambda y: tf.reduce_all(tf.unique(y[..., 1])[1] == 0))

for w in dataset:
  print(w)
Run Code Online (Sandbox Code Playgroud)