Har*_*ley 6 python web-scraping pandas
我正在从几个网站上抓取一些数据,并使用 Pandas 对其进行修改。
在前几个数据块上它运行良好,但后来我收到此错误消息:
Traceback(most recent call last):
File "data.py", line 394 in <module> df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)
File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2326, in __setitem__ self._setitem_array(key,value)
File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2350, in _setitem_array
raise ValueError("Columns must be same length as key') ValueError: Columns must be same length as key
Run Code Online (Sandbox Code Playgroud)
我的代码在这里:
df2 = pd.DataFrame(datatable,columns = cols)
df2['FLIGHT_ID_1'] = df2['FLIGHT'].str[:3]
df2['FLIGHT_ID_2'] = df2['FLIGHT'].str[3:].str.zfill(4)
df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)
Run Code Online (Sandbox Code Playgroud)
EDIT-jezrael :我使用了你的代码,并从中打印出来:我希望通过这个我们可以找到问题出在哪里..因为当脚本遇到这个拆分问题时,它似乎是随机的..
0 1
2 Landed 8:33 AM
3 Landed 9:37 AM
4 Landed 9:10 AM
5 Landed 9:57 AM
6 Landed 9:36 AM
8 Landed 8:51 AM
9 Landed 9:18 AM
11 Landed 8:53 AM
12 Landed 7:59 AM
13 Landed 7:52 AM
14 Landed 8:56 AM
15 Landed 8:09 AM
18 Landed 8:42 AM
19 Landed 9:39 AM
20 Landed 9:45 AM
21 Landed 7:44 AM
23 Landed 8:36 AM
27 Landed 9:53 AM
29 Landed 9:26 AM
30 Landed 8:23 AM
35 Landed 9:59 AM
36 Landed 8:38 AM
37 Landed 9:38 AM
38 Landed 9:37 AM
40 Landed 9:27 AM
43 Landed 9:14 AM
44 Landed 9:22 AM
45 Landed 8:18 AM
46 Landed 10:01 AM
47 Landed 10:21 AM
.. ... ...
316 Delayed 5:00 PM
317 Delayed 4:34 PM
319 Estimated 2:58 PM
320 Estimated 3:02 PM
321 Delayed 4:47 PM
323 Estimated 3:08 PM
325 Delayed 3:52 PM
326 Estimated 3:09 PM
327 Estimated 2:37 PM
328 Estimated 3:17 PM
329 Estimated 3:20 PM
330 Estimated 2:39 PM
331 Delayed 4:04 PM
332 Delayed 4:36 PM
337 Estimated 3:47 PM
339 Estimated 3:37 PM
341 Delayed 4:32 PM
345 Estimated 3:34 PM
349 Estimated 3:24 PM
356 Delayed 4:56 PM
358 Estimated 3:45 PM
367 Estimated 4:09 PM
370 Estimated 4:04 PM
371 Estimated 4:11 PM
373 Delayed 5:21 PM
382 Estimated 3:56 PM
384 Delayed 4:28 PM
389 Delayed 4:41 PM
393 Estimated 4:02 PM
397 Delayed 5:23 PM
[240 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)
您需要一些修改解决方案,因为有时它返回 2,有时只返回一列:
df2 = pd.DataFrame({'STATUS':['Estimated 3:17 PM','Delayed 3:00 PM']})
df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
print (df3)
STATUS_ID1 STATUS_ID2
0 Estimated 3:17 PM
1 Delayed 3:00 PM
df2 = df2.join(df3)
print (df2)
STATUS STATUS_ID1 STATUS_ID2
0 Estimated 3:17 PM Estimated 3:17 PM
1 Delayed 3:00 PM Delayed 3:00 PM
Run Code Online (Sandbox Code Playgroud)
另一种可能的数据 - 所有数据都没有空格,解决方案也有效:
df2 = pd.DataFrame({'STATUS':['Canceled','Canceled']})
Run Code Online (Sandbox Code Playgroud)
和解决方案返回:
print (df2)
STATUS STATUS_ID1
0 Canceled Canceled
1 Canceled Canceled
Run Code Online (Sandbox Code Playgroud)
全部一起:
df3 = df2['STATUS'].str.split(n=1, expand=True)
df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
df2 = df2.join(df3)
Run Code Online (Sandbox Code Playgroud)
要解决此错误,请检查您尝试分配 df 列的对象的形状(使用np.shape)。第二个(或最后一个)维度必须与您尝试分配到的列数匹配。例如,如果您尝试将 2 列 numpy 数组分配给 3 列,您将看到此错误。
一般的解决方法(对于下面的情况 1和情况 2)是将您尝试分配给 DataFrame 的对象强制转换join()为df,即使用 (2) 代替 (1)。
df[cols] = vals # (1)
df = df.join(vals) if isinstance(vals, pd.DataFrame) else df.join(pd.DataFrame(vals)) # (2)
Run Code Online (Sandbox Code Playgroud)
如果您尝试替换现有列中的值并遇到此错误(下面的情况 3(a)),请将对象转换为列表并分配。
df[cols] = vals.values.tolist()
Run Code Online (Sandbox Code Playgroud)
如果您有重复的列(下面的情况 3(b)),则没有简单的修复方法。您必须手动使尺寸匹配。
此错误发生在 3 种情况:
情况 1:当您尝试将类似列表的对象(例如列表、元组、集合、numpy 数组和 pandas 系列)作为新数组1分配给 DataFrame 列列表,但列数不匹配时np.shape类列表对象的第二个(或最后一个)维度(使用 找到)。因此,以下内容重现了此错误:
df = pd.DataFrame({'A': [0, 1]})
cols, vals = ['B'], [[2], [4, 5]]
df[cols] = vals # number of columns is 1 but the list has shape (2,)
Run Code Online (Sandbox Code Playgroud)
请注意,如果列不是以列表、pandas Series、numpy array 或 Pandas Index 形式给出,则不会发生此错误。因此,以下内容不会重现该错误:
df[('B',)] = vals # the column is given as a tuple
Run Code Online (Sandbox Code Playgroud)
当类似列表的对象是多维的(但不是 numpy 数组)时,会出现一种有趣的边缘情况。在这种情况下,在底层,对象首先被转换为 pandas DataFrame,并检查其最后一个维度是否与列数匹配。这会产生以下有趣的情况:
# the error occurs below because pd.DataFrame(vals1) has shape (2, 2) and len(['B']) != 2
vals1 = [[[2], [3]], [[4], [5]]]
df[cols] = vals1
# no error below because pd.DataFrame(vals2) has shape (2, 1) and len(['B']) == 1
vals2 = [[[[2], [3]]], [[[4], [5]]]]
df[cols] = vals2
Run Code Online (Sandbox Code Playgroud)
情况 2:当您尝试将 DataFrame 分配给列列表(或 pandas Series 或 numpy array 或 pandas Index)但相应的列数不匹配时。这种情况就是导致OP错误的原因。以下重现该错误:
df = pd.DataFrame({'A': [0, 1]})
df[['B']] = pd.DataFrame([[2, 3], [4]]) # a 2-column df is trying to be assigned to a single column
df[['B', 'C']] = pd.DataFrame([[2], [4]]) # a single column df is trying to be assigned to 2 columns
Run Code Online (Sandbox Code Playgroud)
情况 3:当您尝试用 DataFrame(或类似列表的对象)替换现有列的值时,其列数与要替换的列数不匹配。因此,以下重现错误:
# case 3(a)
df1 = pd.DataFrame({'A': [0, 1]})
df1['A'] = pd.DataFrame([[2, 3], [4, 5]]) # df1 has a single column named 'A' but a 2-column-df is trying to be assigned
# case 3(b): duplicate column names matter too
df2 = pd.DataFrame([[0, 1], [2, 3]], columns=['A','A'])
df2['A'] = pd.DataFrame([[2], [4]]) # df2 has 2 columns named 'A' but a single column df is being assigned
Run Code Online (Sandbox Code Playgroud)
1:df.loc[:, cols] = vals可能会就地覆盖数据,因此这不会产生错误,但会创建 NaN 值列。
| 归档时间: |
|
| 查看次数: |
50063 次 |
| 最近记录: |