How to delete rows having bad error lines and read the remaining csv file using pandas or numpy?

Question

How to delete rows having bad error lines and read the remaining csv file using pandas or numpy?

I am not able to read my dataset.csv file due to following Parser Error.

Error tokenizing data. C error: Expected 1 fields in line 8, saw 4

The CSV file is generated through another program. Basically I want to skip the character rows which are iterating after certain intervals and want only the integer and float values in my dataset. I tried this:

df = pd.read_csv('Dataset.csv')

Run Code Online (Sandbox Code Playgroud)

I also tried this, but i am only getting the bad lines as output. But I want to skip all these bad error lines and only show the remaining values in my dataset.

df = pd.read_csv('Dataset.csv',error_bad_lines=False, engine='python')

Run Code Online (Sandbox Code Playgroud)

Dataset:

The pch2csv utility program
This file contains the pch2csv


$TITLE   =
$SUBTITLE=
$LABEL   = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06

$TITLE   =
$SUBTITLE=
$LABEL   = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06

Run Code Online (Sandbox Code Playgroud)

Expected dataset:

*Even the blank rows may be deleted if possible The 1st column should be set as index, and the final dataset must contain the 1st and 3rd columnn only as shown. The Column label must be set as '1'

Answer 1

jez*_*ael 5

You can add parameter names to read_csv for new columns names - then get some rows with missing values, so added DataFrame.dropna:

import pandas as pd
from io import StringIO


temp="""The pch2csv utility program
This file contains the pch2csv


$TITLE   =
$SUBTITLE=
$LABEL   = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06

$TITLE   =
$SUBTITLE=
$LABEL   = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06"""

Run Code Online (Sandbox Code Playgroud)

#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
                 error_bad_lines=False, 
                 engine='python', 
                 names=['a','b','c','d'])

df = df.dropna(subset=['b','c','d'])  
print (df)
    a    b         c             d
0   1  0.0  0.000004 -1.063093e-06
1   2  0.0 -0.000001  4.711234e-06
2   3  0.0  0.000003 -5.669502e-07
3   4  0.0  0.000004 -1.094726e-06
4   5  0.0  0.000004 -1.107476e-06
8   1  0.0 -0.000006  1.127179e-06
9   2  0.0  0.000003 -8.840886e-06
10  3  0.0 -0.000002  3.867732e-07
11  4  0.0 -0.000006  1.864081e-06
12  5  0.0 -0.000031  9.339538e-06

Run Code Online (Sandbox Code Playgroud)

EDIT:

For set first column to index and another columns names:

#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
                 error_bad_lines=False, 
                 engine='python', 
                 index_col=[0],
                 names=['idx','col1','col2','col3'])

#check all columns, first column is set to index, so not tested
df = df.dropna() 

#if need test if all values in row has NaNs
#df = df.dropna(how='all')  
print (df)
     col1      col2          col3
idx                              
1     0.0  0.000004 -1.063093e-06
2     0.0 -0.000001  4.711234e-06
3     0.0  0.000003 -5.669502e-07
4     0.0  0.000004 -1.094726e-06
5     0.0  0.000004 -1.107476e-06
1     0.0 -0.000006  1.127179e-06
2     0.0  0.000003 -8.840886e-06
3     0.0 -0.000002  3.867732e-07
4     0.0 -0.000006  1.864081e-06
5     0.0 -0.000031  9.339538e-06

Run Code Online (Sandbox Code Playgroud)

EDIT1:

If need remove all columns filled by 0 only:

df = df.loc[:, df.ne(0).any()]
print (df)
         col2          col3
idx                        
1    0.000004 -1.063093e-06
2   -0.000001  4.711234e-06
3    0.000003 -5.669502e-07
4    0.000004 -1.094726e-06
5    0.000004 -1.107476e-06
1   -0.000006  1.127179e-06
2    0.000003 -8.840886e-06
3   -0.000002  3.867732e-07
4   -0.000006  1.864081e-06
5   -0.000031  9.339538e-06

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，1 月前
查看次数：	47 次
最近记录：	6 年，1 月前