如何删除非utf 8代码并保存为csv文件python

Jas*_*ine 1 python encoding utf-8

我有一些亚马逊评论数据,我已成功从文本格式转换为CSV格式,现在的问题是当我尝试使用pandas将其读入数据帧时,我收到错误消息: UnicodeDecodeError:'utf-8'codec can' t解码位置13中的字节0xf8:无效的起始字节

我理解在审查原始数据中必须有一些非utf-8,如何删除非UTF-8并保存到另一个CSV文件?

谢谢!

EDIT1:这是我将文本转换为csv的代码:

import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
    "product/productId",
    "review/userId",
    "review/profileName",
    "review/helpfulness",
    "review/score",
    "review/time",
    "review/summary",
    "review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")

outfile = open(OUTPUT_FILE_NAME,"w")

outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:

   line = line.strip()  
   #need to reomve the , so that the comment review text won't be in many columns
   line = line.replace(',','')

   if line == "":
      outfile.write(",".join(currentLine))
      outfile.write("\n")
      currentLine = []
      continue
   parts = line.split(":",1)
   currentLine.append(parts[1])

if currentLine != []:
    outfile.write(",".join(currentLine))
f.close()
outfile.close()
Run Code Online (Sandbox Code Playgroud)

EDIT2:

感谢你们所有人试图帮助我.所以我通过修改代码中的输出格式解决了这个问题:

 outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
Run Code Online (Sandbox Code Playgroud)

Ser*_*sta 5

如果输入文件不是utf-8编码的话,尝试在utf-8中读取它可能不是一个好主意...

您基本上有两种方法来处理解码错误:

  • 使用一个可以接受任何字节的字符集,例如iso-8859-15,也称为latin9
  • 如果输出应为utf-8但包含错误,请使用errors=ignore- >静默删除非utf-8字符,或者errors=replace- >用替换标记替换非utf-8字符(通常?)

例如:

f = open(INPUT_FILE_NAME,encoding="latin9")
Run Code Online (Sandbox Code Playgroud)

要么

f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
Run Code Online (Sandbox Code Playgroud)