比较不同列的字符串长度的数据框

Question

比较不同列的字符串长度的数据框

Ber*_*rdL 5 python min string-length dataframe pandas

我正在尝试获取不同列的字符串长度。似乎很简单：

df['a'].str.len()

Run Code Online (Sandbox Code Playgroud)

但是我需要将其应用于多个列。然后获得最低要求。

就像是：

df[['a','b','c']].str.len().min

Run Code Online (Sandbox Code Playgroud)

我知道上述方法不起作用，但希望您能理解。列a，b，c所有包含姓名，我想找回最短的名称。

同样由于数据量大，我避免创建其他列以节省大小。

Answer 1

jez*_*ael 4

我认为你需要列表理解，因为string函数仅适用于Series( column)：

print ([df[col].str.len().min() for col in ['a','b','c']])

Run Code Online (Sandbox Code Playgroud)

另一个解决方案apply：

print ([df[col].apply(len).min() for col in ['a','b','c']])

Run Code Online (Sandbox Code Playgroud)

样本：

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':['st','dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})

print (df)

     a     b      c  d
0    h    st  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]

Run Code Online (Sandbox Code Playgroud)

时间：

#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop

In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop

Run Code Online (Sandbox Code Playgroud)

结论：

apply速度更快，但不适用于None.

df = pd.DataFrame({'a':['h','gg','yyy'],
                   'b':[None,'dsws','sw'],
                   'c':['fffff','','rr'],
                   'd':[1,3,5]})


print (df)
     a     b      c  d
0    h  None  fffff  1
1   gg  dsws         3
2  yyy    sw     rr  5

print ([df[col].apply(len).min() for col in ['a','b','c']])

Run Code Online (Sandbox Code Playgroud)

类型错误：“NoneType”类型的对象没有 len()

print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]

Run Code Online (Sandbox Code Playgroud)

按评论编辑：

#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0    1
1    0
2    2
dtype: int64

Run Code Online (Sandbox Code Playgroud)

#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0    1
1    0
2    2
dtype: int64

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，1 月前
查看次数：	590 次
最近记录：	9 年，1 月前