使用多处理过滤pandas中的大型数据帧

Question

使用多处理过滤pandas中的大型数据帧

Sat*_*tya 4 python traversal nodes dataframe pandas

我有一个数据帧,我需要根据以下条件过滤它

CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ROMANCE' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'Hindi' & count_LANGUAGE >= 1 & GENRE == 'ACTION'

Run Code Online (Sandbox Code Playgroud)

当我试图这样做的时候

  df1 = df.query(condition1)
  df2 = df.query(condition2)

Run Code Online (Sandbox Code Playgroud)

我收到内存错误(因为我的数据框大小是巨大的).

所以我计划通过过滤主要条件然后子条件,这样负载会更少,性能会更好.

通过解析上述条件,以某种方式设法获得

main_filter = "CITY == 'Mumbai'"
sub_cond1 = "LANGUAGE == 'English'"
sub_cond1_cond1 = "GENRE == 'ACTION' & count_GENRE >= 1"
sub_cond1_cond2 = "GENRE == 'ROMANCE' & count_GENRE >= 1"
sub_cond2 = "LANGUAGE == 'Hindi' & count_LANGUGE >= 1"
sub_cond2_cond1 = "GENRE == 'COMEDY'"

Run Code Online (Sandbox Code Playgroud)

因此,将其视为树结构(当然不是二元结构,实际上它根本不是树).

现在我想遵循一个多处理方法(子进程下的深 - 子进程)

现在我想要类似的东西

on level 1
 df = df_main.query(main_filter)
on level 2
 df1 = df.query(sub_cond1)
 df2 = df.query(sub_cond2)
onlevel 3
  df11 = df1.query(sub_cond1_cond1)
  df12 = df1.query(sub_cond1_cond2)
  df21 = df2.query(sub_cond2_cond1)  ######like this

Run Code Online (Sandbox Code Playgroud)

所以问题是如何将条件正确地传递到每个级别(如果我要将所有条件存储在列表中(实际上甚至没有考虑过)).

注意:每次过滤的结果应该导出到单独的单独的csvs.

例如:

df11.to_csv('CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1')

Run Code Online (Sandbox Code Playgroud)

作为入门者,我不知道如何遵循多处理(其语法和执行方式等,特别是对于此).但不幸的是得到了这个任务.因此无法发布任何代码.

所以任何人都可以给出一个代码行示例来实现这一点.

如果你有更好的想法(类对象或节点遍历),请建议.

Answer 1

Ped*_*rte 16

这看起来像一个适合的问题dask,python模块可以帮助您处理大于内存的数据.

我将展示如何使用这个解决这个问题dask.dataframe.让我们从创建一些数据开始:

import pandas as pd
from collections import namedtuple
Record = namedtuple('Record', "CITY LANGUAGE GENRE count_GENRE count_LANGUAGE")

cities = ['Mumbai', 'Chennai', 'Bengalaru', 'Kolkata']
languages = ['English', 'Hindi', 'Spanish', 'French']
genres = ['Action', 'Romance', 'Comedy', 'Drama']

import random

df = pd.DataFrame([Record(random.choice(cities), 
                          random.choice(languages), 
                          random.choice(genres), 
                          random.choice([1,2,3]), 
                          random.choice([1,2,3])) for i in range(4000000)])

df.to_csv('temp.csv', index=False)    
print(df.head())

        CITY LANGUAGE    GENRE  count_GENRE  count_LANGUAGE
0    Chennai  Spanish   Action            2               1
1  Bengalaru  English    Drama            2               3
2    Kolkata  Spanish   Action            2               1
3     Mumbai   French  Romance            1               2
4    Chennai   French   Action            2               3

Run Code Online (Sandbox Code Playgroud)

上面创建的数据有400万行,占用107 MB.它不是大于内存,但足以在本例中使用.

下面我展示了一个python会话的记录,我根据问题中的标准过滤了数据:

>>> import dask.dataframe as dd
>>> dask_df = dd.read_csv('temp.csv', header=0)
>>> dask_df.npartitions
4

# We see above that dask.dataframe has decided to split the 
# data into 4 partitions

# We now execute the query:
>>> result = dask_df[(dask_df['CITY'] == 'Mumbai') &
...                  (dask_df['LANGUAGE'] == 'English') &
...                  (dask_df['GENRE'] == 'Action') &
...                  (dask_df['count_GENRE'] > 1)]
>>>

# The line above takes very little time to execute.  In fact, nothing has
# really been computed yet.  Behind the scenes dask has create a plan to  
# execute the query, but has not yet pulled the trigger.

# The result object is a dask dataframe:
>>> type(result)
<class 'dask.dataframe.core.DataFrame'>
>>> result
dd.DataFrame<series-slice-read-csv-temp.csv-fc62a8c019c213f4cd106801b9e45b29[elemwise-cea80b0dd8dd29ae325a9db1896b027c], divisions=(None, None, None, None, None)>

# We now pull the trigger by calling the compute() method on the dask
# dataframe.  The execution of the line below takes a few seconds:
>>> dfout = result.compute()

# The result is a regular pandas dataframe:
>>> type(dfout)
<class 'pandas.core.frame.DataFrame'>

# Of our 4 million records, only ~40k match the query:
>>> len(dfout)
41842

>>> dfout.head()
       CITY LANGUAGE   GENRE  count_GENRE  count_LANGUAGE
225  Mumbai  English  Action            2               3
237  Mumbai  English  Action            3               2
306  Mumbai  English  Action            3               3
335  Mumbai  English  Action            2               2
482  Mumbai  English  Action            2               3

Run Code Online (Sandbox Code Playgroud)

我希望这可以让您开始解决您的问题.有关更多信息,dask请参阅教程和示例.

归档时间：	10 年，1 月前
查看次数：	3630 次
最近记录：	10 年，1 月前