Pandas:read_csv 表示“空格分隔”

Mic*_*due 4 python io pandas

我有以下 file.txt(已删节):

SICcode        Catcode        Category                              SICname        MultSIC
0111        A1500        Wheat, corn, soybeans and cash grain        Wheat        X
0112        A1600        Other commodities (incl rice, peanuts)      Rice        X
0115        A1500        Wheat, corn, soybeans and cash grain        Corn        X
0116        A1500        Wheat, corn, soybeans and cash grain        Soybeans        X
0119        A1500        Wheat, corn, soybeans and cash grain        Cash grains, NEC        X
0131        A1100        Cotton        Cotton        X
0132        A1300        Tobacco & Tobacco products                  Tobacco        X
Run Code Online (Sandbox Code Playgroud)

我在将它读入熊猫 df 时遇到了一些问题。我尝试pd.read_csv了以下规范,engine='python', sep='Tab'但它在一列中返回了文件:

    ?SICcode Catcode Category SICname MultSIC
0   0111 A1500 Wheat, corn, soybeans...
1   0112 A1600 Other commodities (in...
2   0115 A1500 Wheat, corn, soybeans...
3   0116 A1500 Wheat, corn, soybeans...
Run Code Online (Sandbox Code Playgroud)

然后我尝试使用“tab”作为分隔符将它放入一个 gnumeric 文件中,但它将文件作为一列读取。有没有人对此有想法?

unu*_*tbu 5

如果df = pd.read_csv('file.txt', sep='\t')返回一个包含一列的 DataFrame,那么显然file.txt没有使用制表符作为分隔符。您的数据可能只是将空格作为分隔符。在这种情况下,你可以尝试

df = pd.read_csv('data', sep=r'\s{2,}')
Run Code Online (Sandbox Code Playgroud)

它使用正则表达式模式\s{2,}作为分隔符。此正则表达式匹配 2 个或更多空白字符。

In [8]: df
Out[8]: 
   SICcode Catcode                                Category           SICname  \
0      111   A1500    Wheat, corn, soybeans and cash grain             Wheat   
1      112   A1600  Other commodities (incl rice, peanuts)              Rice   
2      115   A1500    Wheat, corn, soybeans and cash grain              Corn   
3      116   A1500    Wheat, corn, soybeans and cash grain          Soybeans   
4      119   A1500    Wheat, corn, soybeans and cash grain  Cash grains, NEC   
5      131   A1100                                  Cotton            Cotton   
6      132   A1300              Tobacco & Tobacco products           Tobacco   

  MultSIC  
0       X  
1       X  
2       X  
3       X  
4       X  
5       X  
6       X  
Run Code Online (Sandbox Code Playgroud)

如果这不起作用,请发布print(repr(open(file.txt, 'rb').read(100))。这将向我们展示 . 的前 100 个字节的明确表示file.txt