熊猫读取csv无法正确读取文件。未拆分为适当的列

Nik*_*hil 5 python csv pandas

因此,我试图从Kaggle读取此数据集。

https://www.kaggle.com/gmadevs/atp-matches-dataset#atp_matches_2016.csv

我正在使用pandas的read_csv函数,但没有正确拆分列。我已经试过这段代码

df_2016 = pd.read_csv("Path/to/file/atp_matches_2016.csv")
Run Code Online (Sandbox Code Playgroud)

打印出的数据框虽然给了我

                                                                                                                                         tourney_id  ... l_bpFaced
2016-M020 Brisbane Hard 32.0 A 20160104.0 300.0 105683.0 4.0 NaN Milos Raonic  R 196.0 CAN 25.021218 14.0 2170.0 103819.0 1.0  NaN    Roger Federer  ...       NaN
                                          299.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0  8265.0 106233.0 8.0  NaN    Dominic Thiem  ...       NaN
                                          298.0 105683.0 4.0 NaN Milos Raonic  R 196.0 CAN 25.021218 14.0 2170.0 106071.0 7.0  NaN    Bernard Tomic  ...       NaN
                                          297.0 103819.0 1.0 NaN Roger Federer R 185.0 SUI 34.406571 3.0  8265.0 105777.0 NaN  NaN  Grigor Dimitrov  ...       NaN
                                          296.0 106233.0 8.0 NaN Dominic Thiem R NaN   AUT 22.335387 20.0 1600.0 105227.0 3.0  NaN      Marin Cilic  ...       NaN

Run Code Online (Sandbox Code Playgroud)

为什么在拆分列时遇到问题?

我期待它的输出,出于某种原因,这是除2016年和2017年之外我每年获得的输出。

  tourney_id tourney_name surface  ...  l_SvGms l_bpSaved  l_bpFaced
0   2015-329        Tokyo    Hard  ...     10.0       2.0        5.0
1   2015-329        Tokyo    Hard  ...     13.0      12.0       19.0
2   2015-329        Tokyo    Hard  ...     18.0       9.0       11.0
3   2015-329        Tokyo    Hard  ...     13.0       4.0        8.0
4   2015-329        Tokyo    Hard  ...     10.0       1.0        5.0
Run Code Online (Sandbox Code Playgroud)

实际的csv文件看起来状态良好,并且格式与其他年份相同。我还尝试在read_csv函数中使用columns参数指定列,但这给了我相同的输出。

Chr*_*ris 3

我能想到的最安全的方法是读取 csv 两次:

rows = pd.read_csv('path/to/atp_matches_2016.csv', skiprows=[0], header = None)
# skip header line
rows = rows.dropna(axis=1, how='all')
# drop columns that only have NaNs

rows.columns = pd.read_csv('path/to/atp_matches_2016.csv', nrows=0).columns
print(rows.head(5))
Run Code Online (Sandbox Code Playgroud)

输出:

  tourney_id tourney_name surface  draw_size tourney_level  tourney_date  \
0  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
1  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
2  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
3  2016-M020     Brisbane    Hard       32.0             A    20160104.0   
4  2016-M020     Brisbane    Hard       32.0             A    20160104.0 



   match_num  winner_id  winner_seed winner_entry  ... w_bpFaced l_ace  l_df  \
0      300.0   105683.0          4.0          NaN  ...       1.0   7.0   3.0   
1      299.0   103819.0          1.0          NaN  ...       1.0   2.0   4.0   
2      298.0   105683.0          4.0          NaN  ...       4.0  10.0   3.0   
3      297.0   103819.0          1.0          NaN  ...       1.0   8.0   2.0   
4      296.0   106233.0          8.0          NaN  ...       2.0  11.0   2.0   

  l_svpt  l_1stIn  l_1stWon  l_2ndWon  l_SvGms  l_bpSaved l_bpFaced  
0   61.0     34.0      25.0      14.0     10.0        3.0       5.0  
1   55.0     31.0      18.0       9.0      8.0        2.0       6.0  
2   84.0     54.0      41.0      16.0     12.0        2.0       2.0  
3  104.0     62.0      46.0      21.0     16.0        8.0      11.0  
4   98.0     52.0      41.0      27.0     15.0        7.0       8.0  
Run Code Online (Sandbox Code Playgroud)