Rus*_*ord 5 python python-3.x concurrent.futures
我有一个 xml 列表和一个 for 循环,可将 xml 展平为 pandas 数据框。
for 循环工作得很好,但是需要很长时间才能压平 xml,而且随着时间的推移,xml 会变得越来越大。
如何包装下面的 for 循环以executor.map在不同内核之间分配工作负载?我正在关注这篇文章https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a
for 循环压平 xml:
df1 = pd.DataFrame()
for i in lst:
print('i am working')
soup = BeautifulSoup(i, "xml")
# Get Attributes from all nodes
attrs = []
for elm in soup(): # soup() is equivalent to soup.find_all()
attrs.append(elm.attrs)
# Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
fields_attribute_list= [x for x in attrs if 'Id' in x.keys()]
other_attribute_list = [x for x in attrs if 'Id' not in x.keys() and x != {}]
# Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
attribute_dict = {}
for d in other_attribute_list:
for k, v in d.items():
attribute_dict.setdefault(k, v)
# Update each field row with attributes from all other nodes.
full_list = []
for field in fields_attribute_list:
field.update(attribute_dict)
full_list.append(field)
# Make Dataframe
df = pd.DataFrame(full_list)
df1 = df1.append(df)
Run Code Online (Sandbox Code Playgroud)
for循环需要转化为函数吗?
是的,您确实需要将循环转换为函数。该函数必须能够仅接受一个参数。该参数可以是任何东西,例如列表、元组、字典或其他任何东西。具有多个参数的函数放入方法中有点复杂concurrent.futures.*Executor。
下面的这个例子应该适合你。
from bs4 import BeautifulSoup
import pandas as pd
from concurrent import futures
def create_dataframe(xml):
soup = BeautifulSoup(xml, "xml")
# Get Attributes from all nodes
attrs = []
for elm in soup(): # soup() is equivalent to soup.find_all()
attrs.append(elm.attrs)
# Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
fields_attribute_list = [x for x in attrs if 'FieldId' in x.keys()]
other_attribute_list = [x for x in attrs if 'FieldId' not in x.keys() and x != {}]
# Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
attribute_dict = {}
for d in other_attribute_list:
for k, v in d.items():
attribute_dict.setdefault(k, v)
# Update each field row with attributes from all other nodes.
full_list = []
for field in fields_attribute_list:
field.update(attribute_dict)
full_list.append(field)
print(len(full_list))
# Make Dataframe
df = pd.DataFrame(full_list)
# print(df)
return df
with futures.ThreadPoolExecutor() as executor: # Or use ProcessPoolExecutor
df_list = executor.map(create_dataframe, lst)
df_list = list(df_list)
full_df = pd.concat(list(df_list))
print(full_df)
Run Code Online (Sandbox Code Playgroud)