读取多个csv文件并在pandas中添加文件名作为新列

Question

读取多个csv文件并在pandas中添加文件名作为新列

amw*_*de2 8 python csv operating-system glob pandas

我在一个文件夹中有几个csv文件,我想在一个数据框中打开它们并插入一个带有相关文件名的新列.到目前为止,我编写了以下代码:

import pandas as pd
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('path/*.csv'))))
df['filename']= os.path.basename(csv)
df

Run Code Online (Sandbox Code Playgroud)

这给了我想要的数据帧但是在新列'filename'中它只列出了每行的文件夹中的最后一个文件名.我正在寻找每一行用它的相关csv文件填充.不只是文件夹中的最后一个文件.

对这位新手的任何帮助都非常感谢.

Answer 1

jez*_*ael 12

我认为您需要assign添加新列loop,还ignore_index=True添加了参数以concat删除重复项index:

用于测试的文件是a.csv,b.csv,c.csv.

import pandas as pd
import glob, os

files = glob.glob('files/*.csv')
print (files)
['files\\a.csv', 'files\\b.csv', 'files\\c.csv']

files = glob.glob('files/*.csv')
print (files)
['files\\a.csv', 'files\\b.csv', 'files\\c.csv']

df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
print (df)
   a  b  c  d    New
0  0  1  2  5  a.csv
1  1  5  8  3  a.csv
2  0  9  6  5  b.csv
3  1  6  4  2  b.csv
4  0  7  1  7  c.csv
5  1  3  2  6  c.csv

Run Code Online (Sandbox Code Playgroud)

files = glob.glob('files/*.csv')
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files])
print (df)
   a  b  c  d New
0  0  1  2  5   a
1  1  5  8  3   a
2  0  9  6  5   b
3  1  6  4  2   b
4  0  7  1  7   c
5  1  3  2  6   c

Run Code Online (Sandbox Code Playgroud)

Answer 2

Abi*_*san 1

首先，您没有定义 csv 变量。

但无论如何，这种行为是有道理的，因为您在最后使用了 csv，因此它将被设置为最后一个文件。理想情况下，您可以再次使用 glob 来获取所有文件名，然后将其设置为新列。

#this is a Python list containing filenames
csvs = glob.glob(os.path.join('path/*.csv'))

#now set the csv into a pd series
csv_paths = pd.Series(csvs)

df['file_name'] = csv_paths.values

Run Code Online (Sandbox Code Playgroud)

我得到“ValueError：值的长度与索引的长度不匹配”，因为每个文件都有多个数据行。 (7认同)

归档时间：	8 年，11 月前
查看次数：	7077 次
最近记录：	7 年前