pandas to_csv:ascii无法编码字符

ale*_*e19 8 python unicode utf-8 pandas

我正在尝试读取和写入数据帧到管道分隔文件.一些字符是非罗马字母(`,ç,ñ等).但是当我尝试将重音写为ASCII时,它就会中断.

df = pd.read_csv('filename.txt',sep='|', encoding='utf-8')
<do stuff>
newdf.to_csv('output.txt', sep='|', index=False, encoding='ascii')

-------

  File "<ipython-input-63-ae528ab37b8f>", line 21, in <module>
    newdf.to_csv(filename,sep='|',index=False, encoding='ascii')

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1344, in to_csv
    formatter.save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1551, in save
    self._save()

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1652, in _save
    self._save_chunk(start_i, end_i)

  File "C:\Users\aliceell\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\formats\format.py", line 1678, in _save_chunk
    lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

  File "pandas\lib.pyx", line 1075, in pandas.lib.write_csv_rows (pandas\lib.c:19767)

UnicodeEncodeError: 'ascii' codec can't encode character '\xb4' in position 7: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

如果我将to_csv改为utf-8编码,那么我无法正确读取该文件:

newdf.to_csv('output.txt',sep='|',index=False,encoding='utf-8')
pd.read_csv('output.txt', sep='|')

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 2: invalid start byte
Run Code Online (Sandbox Code Playgroud)

我的目标是使用竖线分隔的文件来保留重音符和特殊字符.

还有,有一种简单的方法可以找出read_csv在哪一行中断?现在我不知道如何让它向我展示坏人物.

Oha*_*dok 37

这里查看答案

这是一个必须简单的解决方案:

newdf.to_csv("C:/tweetDF", sep='\t', encoding = 'utf-8')
Run Code Online (Sandbox Code Playgroud)


Ale*_*exG 7

您有一些不是ASCII的字符,因此无法按照您的尝试进行编码.我会utf-8按照评论中的建议使用.

要检查哪些行导致问题,您可以尝试这样的事情:

def is_not_ascii(string):
    return string is not None and any([ord(s) >= 128 for s in string])

df[df[col].apply(is_not_ascii)]
Run Code Online (Sandbox Code Playgroud)

您需要指定col要测试的列.