yos*_*rry 241 python list pandas
我有一个excel文档,看起来像这样..
cluster load_date budget actual fixed_price
A 1/1/2014 1000 4000 Y
A 2/1/2014 12000 10000 Y
A 3/1/2014 36000 2000 Y
B 4/1/2014 15000 10000 N
B 4/1/2014 12000 11500 N
B 4/1/2014 90000 11000 N
C 7/1/2014 22000 18000 N
C 8/1/2014 30000 28960 N
C 9/1/2014 53000 51200 N
Run Code Online (Sandbox Code Playgroud)
我希望能够返回第1列的内容 - 群集作为列表,因此我可以在其上运行for循环,并为每个群集创建一个excel工作表.
是否也可以将整行的内容返回到列表中?例如
list = [], list[column1] or list[df.ix(row1)]
Run Code Online (Sandbox Code Playgroud)
Ben*_*Ben 395
Pandas DataFrame列是Pandas系列,当你将它们拉出来时,你可以调用x.tolist()它们将它们变成Python列表.或者你用它投射list(x).
import pandas as pd
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print("Starting with this dataframe\n", df)
print("The first column is a", type(df['one']), "\nconsisting of\n", df['one'])
dfToList = df['one'].tolist()
dfList = list(df['one'])
dfValues = df['one'].values
print("dfToList is", dfToList, "and it's a", type(dfToList))
print("dfList is ", dfList, "and it's a", type(dfList))
print("dfValues is", dfValues, "and it's a", type(dfValues))
Run Code Online (Sandbox Code Playgroud)
最后一行返回:
dfToList is [1.0, 2.0, 3.0, nan] and it's a <class 'list'>
dfList is [1.0, 2.0, 3.0, nan] and it's a <class 'list'>
dfValues is [ 1. 2. 3. nan] and it's a <class 'numpy.ndarray'>
Run Code Online (Sandbox Code Playgroud)
这个问题可能会有所帮助.一旦你了解他们的风格,熊猫文档实际上是相当不错的.
所以在你的情况下你可以:
my_list = df["cluster"].tolist()
然后从那里开始
小智 44
这将返回一个numpy数组:
my_list = df["cluster"].values
Run Code Online (Sandbox Code Playgroud)
这将返回唯一值的numpy数组:
my_list = df["cluster"].values
uniqueVals = np.unique(my_list)
Run Code Online (Sandbox Code Playgroud)
或者:
uniqueVals = df["cluster"].unique()
Run Code Online (Sandbox Code Playgroud)
Dav*_*han 13
这是一个简单的衬里:
list(df['load_date'])
更新: toList() 不起作用。它应该全部小写 .tolist()
Har*_*vey 11
Numpy Array -> Panda Data Frame -> 来自一个 Panda 列的列表
Numpy 数组
data = np.array([[10,20,30], [20,30,60], [30,60,90]])
Run Code Online (Sandbox Code Playgroud)
将 numpy 数组转换为 Panda 数据框
dataPd = pd.DataFrame(data = data)
print(dataPd)
0 1 2
0 10 20 30
1 20 30 60
2 30 60 90
Run Code Online (Sandbox Code Playgroud)
pdToList = list(dataPd['2'])
cot*_*ail 10
.tolist(). 不要使用list()如果我们查看 的源代码.tolist(),在幕后,list()正在对数据帧中的基础数据调用函数,因此两者应该产生相同的输出。
然而,它看起来tolist()针对 Python 标量列进行了优化,因为我发现调用list()列比调用 慢 10 倍tolist()。根据记录,我试图将一个非常大的数据帧中的一列 json 字符串转换为一个列表,并且list()正在享受它的甜蜜时光。这激发了我测试这两种方法的运行时间。
仅供参考,无需调用.to_numpy()或获取.values属性,因为数据框列/系列对象已经实现了.tolist()方法。另外,由于 numpy 数组的存储方式,list()并且tolist()会为数字列提供不同类型的标量(至少)。例如,
type(list(df['budget'].values)[0]) # numpy.int64
type(df['budget'].values.tolist()[0]) # int
Run Code Online (Sandbox Code Playgroud)
以下 perfplot 显示了两种方法在各种 pandas dtype Series 对象上的运行时差异。基本上,它显示了以下两种方法之间的运行时差异:
list(df['some_col']) # list()
df['some_col'].tolist() # .tolist()
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,无论列/系列的大小,对于数字和对象数据类型列/系列,.tolist()方法都比list(). float此处未包含,但和dtype 列的图表bool与此处显示的 dtype 列的图表非常相似int。此外,包含列表的对象数据类型列的图表与此处显示的字符串列的图表非常相似。诸如'Int64Dtype'、'StringDtype'等扩展数据类型'Float64Dtype'也表现出类似的模式。
另一方面,对于datetime、timedelta和Categoricaldtype 列的两种方法实际上没有区别。
用于生成上述绘图的代码:
from perfplot import plot
kernels = [lambda s: list(s), lambda s: s.tolist()]
labels = ['list()', '.tolist()']
n_range = [2**k for k in range(4, 20)]
xlabel = 'Rows in DataFrame'
eq_chk = lambda x,y: all([x,y])
numeric = lambda n: pd.Series(range(5)).repeat(n)
string = lambda n: pd.Series(['some word', 'another word', 'a word']).repeat(n)
datetime = lambda n: pd.to_datetime(pd.Series(['2012-05-14', '2046-12-31'])).repeat(n)
timedelta = lambda n: pd.to_timedelta(pd.Series([1,2]), unit='D').repeat(n)
categorical = lambda n: pd.Series(pd.Categorical([1, 2, 3, 1, 2, 3])).repeat(n)
for n, f in [('Numeric', numeric), ('Object dtype', string),
('Datetime', datetime), ('Timedelta', timedelta),
('Categorical', categorical)]:
plot(setup=f, kernels=kernels, labels=labels, n_range=n_range,
xlabel=xlabel, title=f'{n} column', equality_check=eq_chk);
Run Code Online (Sandbox Code Playgroud)
由于这个问题引起了很多关注,并且有多种方法可以完成您的任务,让我提供几种选择。
顺便说一下,这些都是单行的;)
从...开始:
df
cluster load_date budget actual fixed_price
0 A 1/1/2014 1000 4000 Y
1 A 2/1/2014 12000 10000 Y
2 A 3/1/2014 36000 2000 Y
3 B 4/1/2014 15000 10000 N
4 B 4/1/2014 12000 11500 N
5 B 4/1/2014 90000 11000 N
6 C 7/1/2014 22000 18000 N
7 C 8/1/2014 30000 28960 N
8 C 9/1/2014 53000 51200 N
Run Code Online (Sandbox Code Playgroud)
潜在操作概述:
ser_aggCol (collapse each column to a list)
cluster [A, A, A, B, B, B, C, C, C]
load_date [1/1/2014, 2/1/2014, 3/1/2...
budget [1000, 12000, 36000, 15000...
actual [4000, 10000, 2000, 10000,...
fixed_price [Y, Y, Y, N, N, N, N, N, N]
dtype: object
ser_aggRows (collapse each row to a list)
0 [A, 1/1/2014, 1000, 4000, Y]
1 [A, 2/1/2014, 12000, 10000...
2 [A, 3/1/2014, 36000, 2000, Y]
3 [B, 4/1/2014, 15000, 10000...
4 [B, 4/1/2014, 12000, 11500...
5 [B, 4/1/2014, 90000, 11000...
6 [C, 7/1/2014, 22000, 18000...
7 [C, 8/1/2014, 30000, 28960...
8 [C, 9/1/2014, 53000, 51200...
dtype: object
df_gr (here you get lists for each cluster)
load_date budget actual fixed_price
cluster
A [1/1/2014, 2/1/2014, 3/1/2... [1000, 12000, 36000] [4000, 10000, 2000] [Y, Y, Y]
B [4/1/2014, 4/1/2014, 4/1/2... [15000, 12000, 90000] [10000, 11500, 11000] [N, N, N]
C [7/1/2014, 8/1/2014, 9/1/2... [22000, 30000, 53000] [18000, 28960, 51200] [N, N, N]
a list of separate dataframes for each cluster
df for cluster A
cluster load_date budget actual fixed_price
0 A 1/1/2014 1000 4000 Y
1 A 2/1/2014 12000 10000 Y
2 A 3/1/2014 36000 2000 Y
df for cluster B
cluster load_date budget actual fixed_price
3 B 4/1/2014 15000 10000 N
4 B 4/1/2014 12000 11500 N
5 B 4/1/2014 90000 11000 N
df for cluster C
cluster load_date budget actual fixed_price
6 C 7/1/2014 22000 18000 N
7 C 8/1/2014 30000 28960 N
8 C 9/1/2014 53000 51200 N
just the values of column load_date
0 1/1/2014
1 2/1/2014
2 3/1/2014
3 4/1/2014
4 4/1/2014
5 4/1/2014
6 7/1/2014
7 8/1/2014
8 9/1/2014
Name: load_date, dtype: object
just the values of column number 2
0 1000
1 12000
2 36000
3 15000
4 12000
5 90000
6 22000
7 30000
8 53000
Name: budget, dtype: object
just the values of row number 7
cluster C
load_date 8/1/2014
budget 30000
actual 28960
fixed_price N
Name: 7, dtype: object
============================== JUST FOR COMPLETENESS ==============================
you can convert a series to a list
['C', '8/1/2014', '30000', '28960', 'N']
<class 'list'>
you can convert a dataframe to a nested list
[['A', '1/1/2014', '1000', '4000', 'Y'], ['A', '2/1/2014', '12000', '10000', 'Y'], ['A', '3/1/2014', '36000', '2000', 'Y'], ['B', '4/1/2014', '15000', '10000', 'N'], ['B', '4/1/2014', '12000', '11500', 'N'], ['B', '4/1/2014', '90000', '11000', 'N'], ['C', '7/1/2014', '22000', '18000', 'N'], ['C', '8/1/2014', '30000', '28960', 'N'], ['C', '9/1/2014', '53000', '51200', 'N']]
<class 'list'>
the content of a dataframe can be accessed as a numpy.ndarray
[['A' '1/1/2014' '1000' '4000' 'Y']
['A' '2/1/2014' '12000' '10000' 'Y']
['A' '3/1/2014' '36000' '2000' 'Y']
['B' '4/1/2014' '15000' '10000' 'N']
['B' '4/1/2014' '12000' '11500' 'N']
['B' '4/1/2014' '90000' '11000' 'N']
['C' '7/1/2014' '22000' '18000' 'N']
['C' '8/1/2014' '30000' '28960' 'N']
['C' '9/1/2014' '53000' '51200' 'N']]
<class 'numpy.ndarray'>
Run Code Online (Sandbox Code Playgroud)
代码:
# prefix ser refers to pd.Series object
# prefix df refers to pd.DataFrame object
# prefix lst refers to list object
import pandas as pd
import numpy as np
df=pd.DataFrame([
['A', '1/1/2014', '1000', '4000', 'Y'],
['A', '2/1/2014', '12000', '10000', 'Y'],
['A', '3/1/2014', '36000', '2000', 'Y'],
['B', '4/1/2014', '15000', '10000', 'N'],
['B', '4/1/2014', '12000', '11500', 'N'],
['B', '4/1/2014', '90000', '11000', 'N'],
['C', '7/1/2014', '22000', '18000', 'N'],
['C', '8/1/2014', '30000', '28960', 'N'],
['C', '9/1/2014', '53000', '51200', 'N']
], columns=['cluster', 'load_date', 'budget', 'actual', 'fixed_price'])
print('df',df, sep='\n', end='\n\n')
ser_aggCol=df.aggregate(lambda x: [x.tolist()], axis=0).map(lambda x:x[0])
print('ser_aggCol (collapse each column to a list)',ser_aggCol, sep='\n', end='\n\n\n')
ser_aggRows=pd.Series(df.values.tolist())
print('ser_aggRows (collapse each row to a list)',ser_aggRows, sep='\n', end='\n\n\n')
df_gr=df.groupby('cluster').agg(lambda x: list(x))
print('df_gr (here you get lists for each cluster)',df_gr, sep='\n', end='\n\n\n')
lst_dfFiltGr=[ df.loc[df['cluster']==val,:] for val in df['cluster'].unique() ]
print('a list of separate dataframes for each cluster', sep='\n', end='\n\n')
for dfTmp in lst_dfFiltGr:
print('df for cluster '+str(dfTmp.loc[dfTmp.index[0],'cluster']),dfTmp, sep='\n', end='\n\n')
ser_singleColLD=df.loc[:,'load_date']
print('just the values of column load_date',ser_singleColLD, sep='\n', end='\n\n\n')
ser_singleCol2=df.iloc[:,2]
print('just the values of column number 2',ser_singleCol2, sep='\n', end='\n\n\n')
ser_singleRow7=df.iloc[7,:]
print('just the values of row number 7',ser_singleRow7, sep='\n', end='\n\n\n')
print('='*30+' JUST FOR COMPLETENESS '+'='*30, end='\n\n\n')
lst_fromSer=ser_singleRow7.tolist()
print('you can convert a series to a list',lst_fromSer, type(lst_fromSer), sep='\n', end='\n\n\n')
lst_fromDf=df.values.tolist()
print('you can convert a dataframe to a nested list',lst_fromDf, type(lst_fromDf), sep='\n', end='\n\n')
arr_fromDf=df.values
print('the content of a dataframe can be accessed as a numpy.ndarray',arr_fromDf, type(arr_fromDf), sep='\n', end='\n\n')
Run Code Online (Sandbox Code Playgroud)
正如cs95所指出的,其他方法应该优先于Pandas.values 0.24 版中的Pandas属性,请参见此处。我在这里使用它,因为大多数人(到 2019 年)仍然拥有不支持新建议的旧版本。你可以检查你的版本print(pd.__version__)
如果您的列只有一个值,pd.series.tolist()则会产生错误。为了保证它适用于所有情况,请使用以下代码:
(
df
.filter(['column_name'])
.values
.reshape(1, -1)
.ravel()
.tolist()
)
Run Code Online (Sandbox Code Playgroud)