问题背后的一般用例是将目标目录中的多个CSV日志文件读入单个Python Pandas DataFrame,以便快速进行周转统计分析和制图.使用Pandas vs MySQL的想法是在一天中定期进行数据导入或附加+ stat分析.
下面的脚本尝试将所有CSV(相同文件布局)文件读入单个Pandas数据帧,并添加与每个文件读取关联的年份列.
# Assemble all of the data files into a single DataFrame & add a year field
# 2010 is the last available year
years = range(1880, 2011)
for year in years:
path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
pieces.append(frame)
# Concatenates everything into a single Dataframe
names = pd.concat(pieces, ignore_index=True)
# Expected row total should be 1690784
names
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33838 entries, 0 to 33837
Data columns:
name 33838 non-null values
sex 33838 non-null values
births 33838 non-null values
year 33838 non-null values
dtypes: int64(2), object(2)
# Start aggregating the data at the year & gender level using groupby or pivot
total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)
# Prints pivot table
total_births.tail()
Out[35]:
sex F M
year
2010 1759010 1898382
Run Code Online (Sandbox Code Playgroud)
Gre*_*eda 13
在append对数据帧的实例方法不起作用一样append在列表的实例方法. Dataframe.append()不会就地发生而是返回一个新对象.
years = range(1880, 2011)
names = pd.DataFrame()
for year in years:
path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
names = names.append(frame, ignore_index=True)
Run Code Online (Sandbox Code Playgroud)
或者您可以使用concat:
years = range(1880, 2011)
names = pd.DataFrame()
for year in years:
path ='C:\\Documents and Settings\\Foo\\My Documents\\pydata-book\\pydata-book-master`\\ch02\\names\\yob%d.txt' % year
frame = pd.read_csv(path, names=columns)
frame['year'] = year
names = pd.concat(names, frame, ignore_index=True)
Run Code Online (Sandbox Code Playgroud)