rey*_*n64 3 python datetime apply pandas
基于此示例数据:
data = """value
"2020-03-02" 2
"2020-03-03" 4
"2020-03-01" 3
"2020-03-04" 0
"2020-03-08" 0
"2020-03-06" 0
"2020-03-07" 2"""
Run Code Online (Sandbox Code Playgroud)
value按日期排序作为日期时间索引valuei 列计算一个新的cum_value累计值列;vc{i from 0 to n}的value_cum,vc'{j from 0 to i}切割系列cum_valuevc{i} / vc'{j} >= 2 最后,我每天都会得到实际日期和使谓词最大化的日期之间的增量。对于这些数据,我得到:
value value_cum computeValue delta
2020-03-01 3 3 NaN NaN
2020-03-02 2 5 NaN NaN
2020-03-03 4 9 3.0 2.0
2020-03-04 0 9 3.0 2.0
2020-03-06 0 9 3.0 2.0
2020-03-07 2 11 2.2 5.0
2020-03-08 0 11 2.2 5.0
Run Code Online (Sandbox Code Playgroud)
编辑:此处有更多上下文信息
实际上,这是一个代码,用于查找 Covid19 累计死亡人数的第一个倍增率。:
value 是我白天的死亡, value_cum 是日积月累的死亡。对于每一天,当累计死亡人数的比率乘以 2 时,我会搜索现有系列。这就是我削减系列的原因,为了计算我的比率,我只需要实际日期之前的 n 个日期/行(过去一天)想测试。
我在数据图表中发现了关于COVID 19 我们世界的这个计算,但我想为一个国家和每一天计算这个指标,而不仅仅是最后一天,如图所示:)
例如,对于日期 2020-03-04,我只需要计算 2020-03-04 和 2020-03-01 / 02 / 03 之间的比率即可找到比率 >=2 的第一个日期
在这个例子中 2020-03-04 没有比 2020-03-03 更多的死亡,所以我们不想计算一个新的增量(死亡前的天数乘以 >=2 与 2020-03- 03 !)。我在本文末尾存档的 Edit1/2 中对此进行了解释。
我们使用字典来存储每个累积值的第一次出现,因此当我看到 cum_value = value 时,我在字典中搜索以获得正确的日期(9 返回 2020-03-03)进行比率计算。
这是我的实际工作代码来做到这一点:
import pandas as pd
import io
from dfply import *
data = """value
"2020-03-02" 2
"2020-03-03" 4
"2020-03-01" 3
"2020-03-04" 0
"2020-03-08" 0
"2020-03-06" 0
"2020-03-07" 2"""
df = pd.read_table(io.StringIO(data), delim_whitespace=True)
df.index = pd.to_datetime(df.index)
def f(x, **kwargs):
# get numerical index of row
numericIndex = kwargs["df"].index.get_loc(x.name)
dict_inverted = kwargs["dict"]
# Skip the first line, returning Nan
if numericIndex == 0:
return np.NaN, np.NaN
# If value_cum is the same than the previous row (nothing changed),
# we need some tweaking (compute using the datebefore) to return same data
ilocvalue = kwargs["df"].iloc[[numericIndex - 1]]["value_cum"][0]
if x['value_cum'] == ilocvalue:
name = dict_inverted[x['value_cum']]
else:
name = x.name
# Series to compare with actual row
series = kwargs["value_cum"]
# Cut this series by taking in account only the days before actual date
cutedSeries = series[series.index < name]
rowValueToCompare = float(x['value_cum'])
# User query to filter rows
# /sf/ask/2812004891/
result = cutedSeries.to_frame().query(f'({rowValueToCompare} / value_cum) >= 2.0')
# If empty return Nan
if result.empty:
return np.NaN, np.NaN
# Get the last result
oneResult = result.tail(1).iloc[:, 0]
# Compute values to return
value = (rowValueToCompare/oneResult.values[0])
idx = oneResult.index[0]
# Delta between the actual row day, and the >=2 day
delta = name - idx
# return columns
return value, delta.days
df_cases = df >> arrange(X.index, ascending=True) \
>> mutate(value_cum=cumsum(X.value))
df_map_value = df_cases.drop_duplicates(["value_cum"])
dict_value = df_map_value["value_cum"].to_dict()
dict_value_inverted = {v: k for k, v in dict_value.items()}
print(dict_value_inverted)
df_cases[["computeValue", "delta"]] = df_cases.apply(f, result_type="expand", dict=dict_value_inverted, df=df_cases, value_cum= df_cases['value_cum'],axis=1)
print(df_cases)
Run Code Online (Sandbox Code Playgroud)
我对这段代码不太满意,我发现将整个 DF 传递给我的 apply 方法很奇怪。
我确信 Panda 中有一些更好的代码可以在更少的行中做到这一点,而且更优雅,使用可能的嵌套 apply 方法,但我没有找到方法。
存储第一个重复日期的字典方法也很奇怪,我不知道是否可以使用 apply (在应用期间重用先前计算的结果)或者是否唯一的方法是编写递归函数。
问题已更新,编辑 1/2/3,使用重复值
编辑存档
编辑1:
data = """value
"2020-03-02" 1
"2020-03-03" 0
"2020-03-01" 1
"2020-03-04" 0
"2020-03-05" 4"""
Run Code Online (Sandbox Code Playgroud)
我看到当值等于零时,我的代码没有考虑在内。
value value_cum computeValue delta
2020-03-01 1 1 NaN NaN
2020-03-02 1 2 2.0 1.0
2020-03-03 0 2 2.0 2.0
2020-03-04 0 2 2.0 3.0
2020-03-05 4 6 3.0 1.0
Run Code Online (Sandbox Code Playgroud)
2020-03-03 computeValue 等于 3.0 而不是 2.0,dela 等于 2.0 天而不是 1.0 天(如 2020-03-02)
在应用计算期间我无法访问以前的值,所以我搜索了另一种方法来做到这一点。
编辑2:
找到了一种通过预先计算的字典的方法:
df_map_value = df_cases.drop_duplicates(["value_cum"])
dict_value = df_map_value["value_cum"].to_dict()
dict_value_inverted = {v: k for k, v in dict_value.items()}
print(dict_value_inverted)
Run Code Online (Sandbox Code Playgroud)
现在,当我发现 cum_value 等于某个值时,我返回用于计算的索引。
几点
你给出的例子有点简单,我相信在更通用的情况下思考会有点困难。然后我使用 numpy 生成了 30 天的随机数据。
通过查看您发送的链接,我认为他们向我们展示了“当前日期与 current_day 相比最近的一天是多少天”。
为了明确显示这一点,我将在 Pandas 中使用非常冗长的列名,在计算您想要的指标之前,我将在数据框中构建一个名为days_current_day_is_double_ofwich的参考列表,将为每一行(天)计算一个当前 Death_cum 为双倍的天数列表当天的dies_cum。
如果您不想在数据框中保留引用列表,则此列以后可以替换为一个简单的 np.where() 操作,每次您想为一行查找此操作时。我认为保留它更清楚。
生成数据
import pandas as pd
import numpy as np
import io
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#n_of_days = 30
#random_data = np.random.randint(0,100,size=n_of_days)
#date_range = pd.date_range(start="2020-03-02",freq="D",periods=n_of_days)
#random_data = pd.DataFrame({"deaths":random_data})
#random_data.index = pd.to_datetime(date_range)
#df= random_data
import requests
import json
response = requests.get("https://api-covid.unthinkingdepths.fr/covid19/ecdc?type=cum")
data = json.loads(response.text)["data"]
deaths_cums = [x["deaths_cum"] for x in data]
dates = [x["dateRep"] for x in data]
df = pd.DataFrame({"deaths_cum":deaths_cums})
df.index = pd.to_datetime(dates)
Run Code Online (Sandbox Code Playgroud)
熊猫中的详细解决方案
这里的关键是:
使用 apply() 遍历列
使用 np.where 显式地进行向后搜索
我在辅助函数中使用 np.wherecheck_condition(row)来创建天引用一次,然后find_index(list_of_days, idx)随时再次搜索
代码大图
# create helper functions
def check_condition(row):
+--- 7 lines: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
def delta_fromlast_day_currDay_is_double_of(row):
+--- 12 lines: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
def how_many_days_fromlast_day_currDay_is_double_of(row):
+--- 11 lines: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
def find_index(list_of_days,index):
+--- 4 lines: {-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
# use apply here with lambda functions
+--- 23 lines: df['deaths_cum'] = np.cumsum(df['deaths'])------------------------------------------------------------------------------------------------------------------------------------------------
print(df)
Run Code Online (Sandbox Code Playgroud)
完整的解决方案代码
def check_condition(row):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
rows_before_current_deaths_cum = df.iloc[:row_idx]['deaths_cum']
currRow_is_more_thanDobuleOf = np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= 2)[0]
return currRow_is_more_thanDobuleOf
def delta_fromlast_day_currDay_is_double_of(row):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
list_of_days = df.iloc[row_idx]['days_current_day_is_double_of']
last_day_currDay_is_double_of = find_index(list_of_days,-1)
if last_day_currDay_is_double_of is np.nan:
delta = np.nan
else:
last_day_currDay_is_double_of_deaths_cum = df.iloc[last_day_currDay_is_double_of]["deaths_cum"]
delta = currRow_deaths_cum - last_day_currDay_is_double_of_deaths_cum
return delta
def how_many_days_fromlast_day_currDay_is_double_of(row):
row_idx = df.index.get_loc(row.name)
list_of_days = df.iloc[row_idx]['days_current_day_is_double_of']
last_day_currDay_is_double_of = find_index(list_of_days,-1)
if last_day_currDay_is_double_of is np.nan:
delta = np.nan
else:
delta = row_idx - last_day_currDay_is_double_of
return delta
def find_index(list_of_days,index):
if list_of_days.any(): return list_of_days[index]
else: return np.nan
# use apply here with lambda functions
#df['deaths_cum'] = np.cumsum(df['deaths'])
df['deaths_cum_ratio_from_day0'] = df['deaths_cum'].apply(
lambda cum_deaths: cum_deaths/df['deaths_cum'].iloc[0]
if df['deaths_cum'].iloc[0] != 0
else np.nan
)
#df['increase_in_deaths_cum'] = df['deaths_cum'].diff().cumsum() <- this mmight be interesting for you to use for other analyses
df['days_current_day_is_double_of'] = df.apply(
lambda row:check_condition(row),
axis=1
)
df['first_day_currDay_is_double_of'] = df['days_current_day_is_double_of'].apply(lambda list_of_days: find_index(list_of_days,0))
df['last_day_currDay_is_double_of'] = df['days_current_day_is_double_of'].apply(lambda list_of_days: find_index(list_of_days,-1))
df['delta_fromfirst_day'] = df['deaths_cum'] - df['deaths_cum'].iloc[0]
df['delta_fromlast_day_currDay_is_double_of'] = df.apply(
lambda row: delta_fromlast_day_currDay_is_double_of(row),
axis=1
)
df['how_many_days_fromlast_day_currDay_is_double_of'] = df.apply(
lambda row: how_many_days_fromlast_day_currDay_is_double_of(row),
axis=1
)
print(df[-30:])
Run Code Online (Sandbox Code Playgroud)
熊猫解决方案输出
deaths_cum deaths_cum_ratio_from_day0 \
2020-03-22 562 NaN
2020-03-23 674 NaN
2020-03-24 860 NaN
2020-03-25 1100 NaN
2020-03-26 1331 NaN
2020-03-27 1696 NaN
2020-03-28 1995 NaN
2020-03-29 2314 NaN
2020-03-30 2606 NaN
2020-03-31 3024 NaN
2020-04-01 3523 NaN
2020-04-02 4032 NaN
2020-04-03 4503 NaN
2020-04-04 6507 NaN
2020-04-05 7560 NaN
2020-04-06 8078 NaN
2020-04-07 8911 NaN
2020-04-08 10328 NaN
2020-04-09 10869 NaN
2020-04-10 12210 NaN
2020-04-11 13197 NaN
2020-04-12 13832 NaN
2020-04-13 14393 NaN
2020-04-14 14967 NaN
2020-04-15 15729 NaN
2020-04-16 17167 NaN
2020-04-17 17920 NaN
2020-04-18 18681 NaN
2020-04-19 19323 NaN
2020-04-20 19718 NaN
days_current_day_is_double_of \
2020-03-22 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-23 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-24 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-25 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-26 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-27 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-28 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-29 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-30 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-03-31 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-01 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-02 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-03 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-04 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-05 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-06 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-07 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-08 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-09 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-10 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-11 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-12 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-13 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-14 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-15 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-16 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-17 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-18 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-19 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2020-04-20 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
first_day_currDay_is_double_of last_day_currDay_is_double_of \
2020-03-22 0.0 79.0
2020-03-23 0.0 79.0
2020-03-24 0.0 80.0
2020-03-25 0.0 81.0
2020-03-26 0.0 82.0
2020-03-27 0.0 83.0
2020-03-28 0.0 84.0
2020-03-29 0.0 85.0
2020-03-30 0.0 85.0
2020-03-31 0.0 86.0
2020-04-01 0.0 87.0
2020-04-02 0.0 88.0
2020-04-03 0.0 88.0
2020-04-04 0.0 91.0
2020-04-05 0.0 92.0
2020-04-06 0.0 93.0
2020-04-07 0.0 93.0
2020-04-08 0.0 94.0
2020-04-09 0.0 94.0
2020-04-10 0.0 94.0
2020-04-11 0.0 95.0
2020-04-12 0.0 95.0
2020-04-13 0.0 95.0
2020-04-14 0.0 95.0
2020-04-15 0.0 96.0
2020-04-16 0.0 97.0
2020-04-17 0.0 98.0
2020-04-18 0.0 98.0
2020-04-19 0.0 98.0
2020-04-20 0.0 98.0
delta_fromfirst_day delta_fromlast_day_currDay_is_double_of \
2020-03-22 562 318.0
2020-03-23 674 430.0
2020-03-24 860 488.0
2020-03-25 1100 650.0
2020-03-26 1331 769.0
2020-03-27 1696 1022.0
2020-03-28 1995 1135.0
2020-03-29 2314 1214.0
2020-03-30 2606 1506.0
2020-03-31 3024 1693.0
2020-04-01 3523 1827.0
2020-04-02 4032 2037.0
2020-04-03 4503 2508.0
2020-04-04 6507 3483.0
2020-04-05 7560 4037.0
2020-04-06 8078 4046.0
2020-04-07 8911 4879.0
2020-04-08 10328 5825.0
2020-04-09 10869 6366.0
2020-04-10 12210 7707.0
2020-04-11 13197 6690.0
2020-04-12 13832 7325.0
2020-04-13 14393 7886.0
2020-04-14 14967 8460.0
2020-04-15 15729 8169.0
2020-04-16 17167 9089.0
2020-04-17 17920 9009.0
2020-04-18 18681 9770.0
2020-04-19 19323 10412.0
2020-04-20 19718 10807.0
how_many_days_fromlast_day_currDay_is_double_of
2020-03-22 3.0
2020-03-23 4.0
2020-03-24 4.0
2020-03-25 4.0
2020-03-26 4.0
2020-03-27 4.0
2020-03-28 4.0
2020-03-29 4.0
2020-03-30 5.0
2020-03-31 5.0
2020-04-01 5.0
2020-04-02 5.0
2020-04-03 6.0
2020-04-04 4.0
2020-04-05 4.0
2020-04-06 4.0
2020-04-07 5.0
2020-04-08 5.0
2020-04-09 6.0
2020-04-10 7.0
2020-04-11 7.0
2020-04-12 8.0
2020-04-13 9.0
2020-04-14 10.0
2020-04-15 10.0
2020-04-16 10.0
2020-04-17 10.0
2020-04-18 11.0
2020-04-19 12.0
2020-04-20 13.0
Run Code Online (Sandbox Code Playgroud)
如果您检查how_many_days_fromlast_day_currDay_is_double_of与XDeltaapi完全匹配:)
如果您想真正概括您的代码,有很多小建议。我不认为这就是你要找的,但我会列出一些:
def check_growth_condition(row, growth_factor):
....
np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0] # <----- then just change 2 by the growth factor
....
Run Code Online (Sandbox Code Playgroud)
days current day is double of到当前日期的两倍,因为最新日期之前的所有天数也将是两倍。为了显示“天数范围”,我将保留第一个和最后一个。def check_growth_condition(row, growth_factor):
...
# doing backwards search with np.where
currRow_is_more_thanDoubleOf = np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0]
if currRow_is_more_thanDobuleOf.any():
return np.array([currRow_is_more_thanDobuleOf[0],currRow_is_more_thanDobuleOf[-1]]) # <------ return just first and last
else:
return currRow_is_more_thanDobuleOf # empty list
...
Run Code Online (Sandbox Code Playgroud)
另请注意,如果您想摆脱引用列,只需np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0]在我使用该check_growth_condition函数的任何地方使用即可。再次 np.where 总是在做搜索。
delta_from_any_day而不是仅仅减去您将函数作为输入传递,例如np.divide计算比率或计算np.subtract增量,就像我在示例中所做的那样def delta_from_any_day(row, day_idx,
column_name='deaths_cum',func=np.subtract):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx][column_name]
if day_idx is np.nan:
delta = np.nan
else:
day_idx_deaths_cum = df.iloc[day_idx][column_name]
delta = func(currRow_deaths_cum, day_idx_deaths_cum)
return delta
Run Code Online (Sandbox Code Playgroud)
清洁熊猫解决方案
请注意,我们只是重用check_growth_condition、find_index进行反向搜索和delta_from_any_day计算增量。我们只是在所有其他辅助函数中重用这三个来计算内容。
def check_growth_condition(row, growth_factor):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
rows_before_current_deaths_cum = df.iloc[:row_idx]['deaths_cum']
currRow_is_more_thanDoubleOf = np.where((currRow_deaths_cum/rows_before_current_deaths_cum) >= growth_factor)[0]
if currRow_is_more_thanDoubleOf.any():
return np.array([currRow_is_more_thanDoubleOf[0], currRow_is_more_thanDoubleOf[-1]])
else:
return currRow_is_more_thanDoubleOf # empty list
def find_index(list_of_days,index):
if list_of_days.any(): return list_of_days[index]
else: return np.nan
def delta_from_any_day(row, day_idx, column_name='deaths_cum',func=np.subtract):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx][column_name]
if day_idx is np.nan:
delta = np.nan
else:
day_idx_deaths_cum = df.iloc[day_idx][column_name]
delta = func(currRow_deaths_cum, day_idx_deaths_cum)
return delta
def delta_fromlast_day_currDay_is_double_of(row):
row_idx = df.index.get_loc(row.name)
currRow_deaths_cum = df.iloc[row_idx]['deaths_cum']
list_of_days = df.iloc[row_idx]['rangeOf_days_current_day_is_double_of']
last_day_currDay_is_double_of = find_index(list_of_days,-1)
delta = delta
| 归档时间: |
|
| 查看次数: |
238 次 |
| 最近记录: |