Python Pandas - 'loc'和'where'之间的区别?

Sco*_*tEU 8 python pandas

只是好奇'where'的行为以及为什么要在'loc'上使用它.

如果我创建一个数据帧:

df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10], 
                   'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
                   'Goals':[12,23,56,7,8,0,4,2,1,34],
                   'Gender':['m','m','m','f','f','m','f','m','f','m']})
Run Code Online (Sandbox Code Playgroud)

然后应用'where'功能:

df2 = df.where(df['Goals']>10)
Run Code Online (Sandbox Code Playgroud)

我得到以下内容,过滤掉Goals> 10的结果,但将其他所有内容保留为NaN:

  Gender  Goals    ID  Run Distance                                                                                                                                                  
0      m   12.0   1.0         234.0                                                                                                                                                  
1      m   23.0   2.0          35.0                                                                                                                                                  
2      m   56.0   3.0          77.0                                                                                                                                                  
3    NaN    NaN   NaN           NaN                                                                                                                                                  
4    NaN    NaN   NaN           NaN                                                                                                                                                  
5    NaN    NaN   NaN           NaN                                                                                                                                                  
6    NaN    NaN   NaN           NaN                                                                                                                                                  
7    NaN    NaN   NaN           NaN                                                                                                                                                  
8    NaN    NaN   NaN           NaN                                                                                                                                                  
9      m   34.0  10.0         123.0  
Run Code Online (Sandbox Code Playgroud)

但是,如果我使用'loc'功能:

df2 = df.loc[df['Goals']>10]
Run Code Online (Sandbox Code Playgroud)

它返回没有NaN值的子集的数据帧:

  Gender  Goals  ID  Run Distance                                                                                                                                                    
0      m     12   1           234                                                                                                                                                    
1      m     23   2            35                                                                                                                                                    
2      m     56   3            77                                                                                                                                                    
9      m     34  10           123 
Run Code Online (Sandbox Code Playgroud)

所以基本上我很好奇为什么你会在'loc/iloc'上使用'where'以及为什么它会返回NaN值?

Jos*_*der 8

想象一下loc过滤器 - 只给出符合条件的df部分.

where最初来自numpy.它遍历一个数组并检查每个元素是否符合条件.因此它会返回整个数组,结果或NaN.一个很好的特性where是你也可以找回不同的东西,例如df2 = df.where(df['Goals']>10, other='0'),用0替换不满足条件的值.

ID  Run Distance Goals Gender
0   1   234      12     m
1   2   35       23     m
2   3   77       56     m
3   0   0        0      0
4   0   0        0      0
5   0   0        0      0
6   0   0        0      0
7   0   0        0      0
8   0   0        0      0
9   10  123      34     m
Run Code Online (Sandbox Code Playgroud)

此外,虽然where仅用于条件过滤,loc但是在Pandas中选择的标准方式是iloc.loc使用行名和列名,同时iloc使用它们的索引号.所以loc你可以选择返回,比方说df.loc[0:1, ['Gender', 'Goals']]:

    Gender  Goals
0   m   12
1   m   23
Run Code Online (Sandbox Code Playgroud)


jez*_*ael 6

如果检查文档DataFrame.where,则按条件替换行 - 默认为NAN,但可以指定值:

df2 = df.where(df['Goals']>10)
print (df2)
     ID  Run Distance  Goals Gender
0   1.0         234.0   12.0      m
1   2.0          35.0   23.0      m
2   3.0          77.0   56.0      m
3   NaN           NaN    NaN    NaN
4   NaN           NaN    NaN    NaN
5   NaN           NaN    NaN    NaN
6   NaN           NaN    NaN    NaN
7   NaN           NaN    NaN    NaN
8   NaN           NaN    NaN    NaN
9  10.0         123.0   34.0      m

df2 = df.where(df['Goals']>10, 100)
print (df2)
    ID  Run Distance  Goals Gender
0    1           234     12      m
1    2            35     23      m
2    3            77     56      m
3  100           100    100    100
4  100           100    100    100
5  100           100    100    100
6  100           100    100    100
7  100           100    100    100
8  100           100    100    100
9   10           123     34      m
Run Code Online (Sandbox Code Playgroud)

调用另一种语法boolean indexing,用于过滤行 - 删除匹配条件的行.

df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]

print (df2)
   ID  Run Distance  Goals Gender
0   1           234     12      m
1   2            35     23      m
2   3            77     56      m
9  10           123     34      m
Run Code Online (Sandbox Code Playgroud)

如果可以使用,loc也可以按行按条件和列按名称进行过滤:

s = df.loc[df['Goals']>10, 'ID']
print (s)
0     1
1     2
2     3
9    10
Name: ID, dtype: int64

df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
   ID Gender
0   1      m
1   2      m
2   3      m
9  10      m
Run Code Online (Sandbox Code Playgroud)


小智 5

  • loc 仅检索与条件匹配的行.
  • where 返回整个数据帧,替换与条件不匹配的行(默认情况下为NaN).