在打开的文件上使用 Pandas read_csv() 两次

Question

在打开的文件上使用 Pandas read_csv() 两次

当我尝试使用 pandas 时，我注意到 pandas.read_csv 的一些奇怪行为，并想知道是否有更多经验的人可以解释可能导致它的原因。

首先，这是我从 .csv 文件创建新的 pandas.dataframe 的基本类定义：

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath  # File path to the target .csv file.
        self.csvfile = open(filepath)  # Open file.
        self.csvdataframe = pd.read_csv(self.csvfile)

Run Code Online (Sandbox Code Playgroud)

现在，这很有效，并且在我的 __ main __.py 中调用该类成功创建了一个 Pandas 数据框：

From dataMatrix.py import dataMatrix

testObject = dataMatrix('/path/to/csv/file')

Run Code Online (Sandbox Code Playgroud)

但我注意到这个过程会自动将 .csv 的第一行设置为 pandas.dataframe.columns 索引。相反，我决定对列进行编号。由于我不想假设我事先知道列数，因此我采用了打开文件，将其加载到数据帧中，计算列数，然后使用范围（使用适当的列数重新加载数据帧）的方法（）。

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)
        # Re-load the .csv file, manually setting the column names to their 
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

Run Code Online (Sandbox Code Playgroud)

保持我在 __ main __.py 中的处理相同，我得到了一个具有正确名称（0...499）的正确列数（在这种情况下为 500）的数据框，但它是空的（没有行数据） .

我挠头，决定关闭 self.csvfile 并像这样重新加载它：

import pandas as pd

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath
        self.csvfile = open(filepath)

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(self.csvfile)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)

        # Close the .csv file.         #<---- +++++++
        self.csvfile.close()           #<----  Added
        # Re-open file.                #<----  Block
        self.csvfile = open(filepath)  #<---- +++++++

        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(self.csvfile, 
                                        names=range(self.numcolumns))

Run Code Online (Sandbox Code Playgroud)

关闭文件并重新打开它会正确返回一个 pandas.dataframe 列编号为 0...499 和所有 255 行数据。

我的问题是为什么关闭文件并重新打开它会有所不同？

Answer 1

unu*_*tbu 7

当你打开一个文件时

open(filepath)

Run Code Online (Sandbox Code Playgroud)

返回文件句柄迭代器。迭代器适合一次遍历其内容。所以

self.csvdataframe = pd.read_csv(self.csvfile)

Run Code Online (Sandbox Code Playgroud)

读取内容并耗尽迭代器。后续调用pd.read_csv认为迭代器为空。

请注意，您只需将文件路径传递给pd.read_csv：

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        # Count the columns.
        self.numcolumns = len(self.csvdataframe.columns)


        # Re-load the .csv file, manually setting the column names to their
        # number.
        self.csvdataframe = pd.read_csv(filepath, 
                                        names=range(self.numcolumns))

Run Code Online (Sandbox Code Playgroud)

pd.read_csv 然后将为您打开（和关闭）文件。

附注。另一种选择是通过调用将文件句柄重置到文件的开头self.csvfile.seek(0)，但使用起来pd.read_csv(filepath, ...)仍然更容易。

更好的pd.read_csv是，您可以像这样重命名列，而不是调用两次（这是低效的）：

class dataMatrix:
    def __init__(self, filepath):
        self.path = filepath

        # Load the .csv file to count the columns.
        self.csvdataframe = pd.read_csv(filepath)
        self.numcolumns = len(self.csvdataframe.columns)
        self.csvdataframe.columns = range(self.numcolumns)

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，1 月前
查看次数：	6884 次
最近记录：	11 年，1 月前