如何从pandas read_html重新索引格式错误的列？

Question

如何从pandas read_html重新索引格式错误的列？

tum*_*eed 5 python multiprocessing dataframe python-3.x pandas

我正在从一个网站中检索一些内容,该网站有几个具有相同列数的表,带有pandas read_html.当我读取一个实际上有几个具有相同列数的表的链接时,pandas有效地将所有表读为一个(类似于平面/规范化表).但是,我有兴趣对网站的链接列表(即几个链接的单个平面表)做同样的事情,所以我尝试了以下方法:

在:

import multiprocessing
def process(url):
    df_url = pd.read_html(url)
    df = pd.concat(df_url, ignore_index=False) 
    return df_url

links = ['link1.com','link2.com','link3.com',...,'linkN.com']

pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df

Run Code Online (Sandbox Code Playgroud)

尽管如此,我想我并没有指定corecctly read_html()哪个列,所以我得到这个列表格式错误:

日期:

[[                Form     Disponibility  \
  0  290090 01780-500-01)  Unavailable - no product available for release.   

                             Relation  \

     Relation drawbacks  
  0                  NaN                        Removed 
  1                  NaN                        Removed ],
 [                                        Form  \

                                   Relation  \
  0  American Regent is currently releasing the 0.4...   
  1  American Regent is currently releasing the 1mg...   

     drawbacks  
  0  Demand increase for the drug  
  1                         Removed ,
                                          Form  \
  0  0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...   

    Disponibility  Relation  \
  0                            Product available                  NaN   
  2                        Removed 
  3                        Removed ]]

Run Code Online (Sandbox Code Playgroud)

所以我的问题是我应该移动哪个参数才能从上面的嵌套列表中获得平坦的pandas数据帧？当我用？创建pandas数据帧时,我试图header=0,但是index_col=0,match='"columns"'它们都没有工作或我需要做平整pd.Dataframe().我的主要目标是拥有一个像这个列一样的pandas数据框:

form, Disponibility, Relation, drawbacks
1 
2
...
n

Run Code Online (Sandbox Code Playgroud)

Answer 1

Max*_*axU 3

IIUC 你可以这样做：

首先，您要返回串联的 DF，而不是 DF 列表（read_html返回DF列表）：

def process(url):
    return pd.concat(pd.read_html(url), ignore_index=False)

Run Code Online (Sandbox Code Playgroud)

然后将它们连接到所有 URL：

df = pd.concat(pool.map(process, links), ignore_index=True)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，12 月前
查看次数：	77 次
最近记录：	8 年，11 月前