Sat*_*tya 4 python traversal nodes dataframe pandas
我有一个数据帧,我需要根据以下条件过滤它
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ROMANCE' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'Hindi' & count_LANGUAGE >= 1 & GENRE == 'ACTION'
Run Code Online (Sandbox Code Playgroud)
当我试图这样做的时候
df1 = df.query(condition1)
df2 = df.query(condition2)
Run Code Online (Sandbox Code Playgroud)
我收到内存错误(因为我的数据框大小是巨大的).
所以我计划通过过滤主要条件然后子条件,这样负载会更少,性能会更好.
通过解析上述条件,以某种方式设法获得
main_filter = "CITY == 'Mumbai'"
sub_cond1 = "LANGUAGE == 'English'"
sub_cond1_cond1 = "GENRE == 'ACTION' & count_GENRE >= 1"
sub_cond1_cond2 = "GENRE == 'ROMANCE' & count_GENRE >= 1"
sub_cond2 = "LANGUAGE == 'Hindi' & count_LANGUGE >= 1"
sub_cond2_cond1 = "GENRE == 'COMEDY'"
Run Code Online (Sandbox Code Playgroud)
因此,将其视为树结构(当然不是二元结构,实际上它根本不是树).
现在我想遵循一个多处理方法(子进程下的深 - 子进程)
现在我想要类似的东西
on level 1
df = df_main.query(main_filter)
on level 2
df1 = df.query(sub_cond1)
df2 = df.query(sub_cond2)
onlevel 3
df11 = df1.query(sub_cond1_cond1)
df12 = df1.query(sub_cond1_cond2)
df21 = df2.query(sub_cond2_cond1) ######like this
Run Code Online (Sandbox Code Playgroud)
所以问题是如何将条件正确地传递到每个级别(如果我要将所有条件存储在列表中(实际上甚至没有考虑过)).
注意:每次过滤的结果应该导出到单独的单独的csvs.
例如:
df11.to_csv('CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1')
Run Code Online (Sandbox Code Playgroud)
作为入门者,我不知道如何遵循多处理(其语法和执行方式等,特别是对于此).但不幸的是得到了这个任务.因此无法发布任何代码.
所以任何人都可以给出一个代码行示例来实现这一点.
如果你有更好的想法(类对象或节点遍历),请建议.
Ped*_*rte 16
这看起来像一个适合的问题dask,python模块可以帮助您处理大于内存的数据.
我将展示如何使用这个解决这个问题dask.dataframe.让我们从创建一些数据开始:
import pandas as pd
from collections import namedtuple
Record = namedtuple('Record', "CITY LANGUAGE GENRE count_GENRE count_LANGUAGE")
cities = ['Mumbai', 'Chennai', 'Bengalaru', 'Kolkata']
languages = ['English', 'Hindi', 'Spanish', 'French']
genres = ['Action', 'Romance', 'Comedy', 'Drama']
import random
df = pd.DataFrame([Record(random.choice(cities),
random.choice(languages),
random.choice(genres),
random.choice([1,2,3]),
random.choice([1,2,3])) for i in range(4000000)])
df.to_csv('temp.csv', index=False)
print(df.head())
CITY LANGUAGE GENRE count_GENRE count_LANGUAGE
0 Chennai Spanish Action 2 1
1 Bengalaru English Drama 2 3
2 Kolkata Spanish Action 2 1
3 Mumbai French Romance 1 2
4 Chennai French Action 2 3
Run Code Online (Sandbox Code Playgroud)
上面创建的数据有400万行,占用107 MB.它不是大于内存,但足以在本例中使用.
下面我展示了一个python会话的记录,我根据问题中的标准过滤了数据:
>>> import dask.dataframe as dd
>>> dask_df = dd.read_csv('temp.csv', header=0)
>>> dask_df.npartitions
4
# We see above that dask.dataframe has decided to split the
# data into 4 partitions
# We now execute the query:
>>> result = dask_df[(dask_df['CITY'] == 'Mumbai') &
... (dask_df['LANGUAGE'] == 'English') &
... (dask_df['GENRE'] == 'Action') &
... (dask_df['count_GENRE'] > 1)]
>>>
# The line above takes very little time to execute. In fact, nothing has
# really been computed yet. Behind the scenes dask has create a plan to
# execute the query, but has not yet pulled the trigger.
# The result object is a dask dataframe:
>>> type(result)
<class 'dask.dataframe.core.DataFrame'>
>>> result
dd.DataFrame<series-slice-read-csv-temp.csv-fc62a8c019c213f4cd106801b9e45b29[elemwise-cea80b0dd8dd29ae325a9db1896b027c], divisions=(None, None, None, None, None)>
# We now pull the trigger by calling the compute() method on the dask
# dataframe. The execution of the line below takes a few seconds:
>>> dfout = result.compute()
# The result is a regular pandas dataframe:
>>> type(dfout)
<class 'pandas.core.frame.DataFrame'>
# Of our 4 million records, only ~40k match the query:
>>> len(dfout)
41842
>>> dfout.head()
CITY LANGUAGE GENRE count_GENRE count_LANGUAGE
225 Mumbai English Action 2 3
237 Mumbai English Action 3 2
306 Mumbai English Action 3 3
335 Mumbai English Action 2 2
482 Mumbai English Action 2 3
Run Code Online (Sandbox Code Playgroud)
我希望这可以让您开始解决您的问题.有关更多信息,dask请参阅教程和示例.