eli*_*A92 6 python time-series dataframe pandas
我必须使用从某些CSV导入的时间序列数据,如下所示:
import pandas as pd
csv_a = [["Sensor_1", '2019-05-25 10:00', 25, 60],
["Sensor_2", '2019-05-25 10:00', 30, 45],
["Sensor_1", '2019-05-25 10:05', 26, None],
["Sensor_2", '2019-05-25 10:05', 30, 46],
["Sensor_1", '2019-05-25 10:10', 27, 63],
["Sensor_1", '2019-05-25 10:20', 28, 62]]
df_a = pd.DataFrame(csv_a, columns=["Sensor", "Timestamp", "Temperature", "Humidity"])
df_a["Timestamp"] = (pd.to_datetime(df_a["Timestamp"]))
csv_b = [["Sensor_1", '2019-05-25 10:05', 1020],
["Sensor_2", '2019-05-25 10:05', 956],
["Sensor_3", '2019-05-25 10:05', 990],
["Sensor_1", '2019-05-25 10:10', 1021],
["Sensor_2", '2019-05-25 10:10', 957],
["Sensor_3", '2019-05-25 10:10', 992],
["Sensor_1", '2019-05-25 10:15', 1019]]
df_b = pd.DataFrame(csv_b, columns=["Sensor", "Timestamp", "Pressure"])
df_b["Timestamp"] = (pd.to_datetime(df_b["Timestamp"]))
Run Code Online (Sandbox Code Playgroud)
如您所见,我们有3个传感器。每个传感器都有自己的时间序列,可以测量温度,湿度和压力。但是,数据被分为两个CSV片段,并且可能有很多空白等。
目标是将所有数据合并到一个有序的常规数据框中,如下所示:
Timestamp Sensor Temperature Humidity Pressure
0 2019-05-25 10:00:00 Sensor_1 25.0 60.0 NaN
1 2019-05-25 10:00:00 Sensor_2 30.0 45.0 NaN
2 2019-05-25 10:00:00 Sensor_3 NaN NaN NaN
3 2019-05-25 10:05:00 Sensor_1 26.0 NaN 1020.0
4 2019-05-25 10:05:00 Sensor_2 30.0 46.0 956.0
5 2019-05-25 10:05:00 Sensor_3 NaN NaN 990.0
6 2019-05-25 10:10:00 Sensor_1 27.0 63.0 1021.0
7 2019-05-25 10:10:00 Sensor_2 NaN NaN 957.0
8 2019-05-25 10:10:00 Sensor_3 NaN NaN 992.0
9 2019-05-25 10:15:00 Sensor_1 NaN NaN 1019.0
10 2019-05-25 10:15:00 Sensor_2 NaN NaN NaN
11 2019-05-25 10:15:00 Sensor_3 NaN NaN NaN
12 2019-05-25 10:20:00 Sensor_1 28.0 62.0 NaN
13 2019-05-25 10:20:00 Sensor_2 NaN NaN NaN
14 2019-05-25 10:20:00 Sensor_3 NaN NaN NaN
Run Code Online (Sandbox Code Playgroud)
这样做的逻辑是,从总体上来说,CSV中的数据始于10:00,始于10:20。并且我们为3个不同的传感器提供3个可能的变量。因此,我希望前两列(时间戳和传感器)保持规则,有序且无间隙。剩下的列(温度,湿度和压力)将在可能的情况下用CSV数据填充。
我试图以多种不同的方式使用pandas合并功能执行此操作,但是我无法获得想要的结果。我希望有经验的人可以帮助我。
首先通过with将两个DataFrames 连接在一起,如果可能的话,重复项使用 sum 来表示由时间戳和s 创建的唯一值。concatDataFrame.set_indexMultiIndexSensor
DataFrame.reindex然后使用byMultiIndex.from_product以及最小和最大日期 by来添加缺失的行date_range:
df = (pd.concat([df_a.set_index(['Timestamp','Sensor']),
df_b.set_index(['Timestamp','Sensor'])], sort=True)
.sum(level=[0,1],min_count=1))
d = df.index.get_level_values(0)
mux = pd.MultiIndex.from_product([pd.date_range(d.min(), d.max(), freq='5Min'),
df.index.get_level_values(1).unique()], names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)
Timestamp Sensor Humidity Pressure Temperature
0 2019-05-25 10:00:00 Sensor_1 60.0 NaN 25.0
1 2019-05-25 10:00:00 Sensor_2 45.0 NaN 30.0
2 2019-05-25 10:00:00 Sensor_3 NaN NaN NaN
3 2019-05-25 10:05:00 Sensor_1 NaN 1020.0 26.0
4 2019-05-25 10:05:00 Sensor_2 46.0 956.0 30.0
5 2019-05-25 10:05:00 Sensor_3 NaN 990.0 NaN
6 2019-05-25 10:10:00 Sensor_1 63.0 1021.0 27.0
7 2019-05-25 10:10:00 Sensor_2 NaN 957.0 NaN
8 2019-05-25 10:10:00 Sensor_3 NaN 992.0 NaN
9 2019-05-25 10:15:00 Sensor_1 NaN 1019.0 NaN
10 2019-05-25 10:15:00 Sensor_2 NaN NaN NaN
11 2019-05-25 10:15:00 Sensor_3 NaN NaN NaN
12 2019-05-25 10:20:00 Sensor_1 62.0 NaN 28.0
13 2019-05-25 10:20:00 Sensor_2 NaN NaN NaN
14 2019-05-25 10:20:00 Sensor_3 NaN NaN NaN
Run Code Online (Sandbox Code Playgroud)