Output all rows with word count in a column greater than 3

Question

Output all rows with word count in a column greater than 3

I have this dummy df:

columns = ['answer', 'some_number']
data = [['hello how are you doing','1.0'],
       ['hello', '1.0'],
       ['bye bye bye bye', '0.0'],
        ['no', '0.0'],
        ['yes', '1.0'],
        ['Who let the dogs out', '0.0'],
        ['1 + 1 + 1 + 2', '1.0']]
df = pd.DataFrame(columns=columns, data=data)

Run Code Online (Sandbox Code Playgroud)

I want to output the rows with a word count greater than 3. Here that would the rows 'hello how are you doing', 'bye bye bye bye', 'Who let the dogs out', '1 + 1 + 1 + 2'

My approach doesn't work: df[len(df.answer) > 3]

Output: KeyError: True

Answer 1

ank*_*_91 10

If the seperator is ' ' ,you can try series.str.count , else you can replace the sep

n=3
df[df['answer'].str.count(' ').gt(n-1)]

Run Code Online (Sandbox Code Playgroud)

To include Multiple spaces #credits @piRSquared

df['answer'].str.count('\s+').gt(2)

Run Code Online (Sandbox Code Playgroud)

Or using list comprehension:

n= 3
df[[len(i.split())>n for i in df['answer']]] #should be faster than above

Run Code Online (Sandbox Code Playgroud)

                    answer some_number
0  hello how are you doing         1.0
2          bye bye bye bye         0.0
5     Who let the dogs out         0.0
6            1 + 1 + 1 + 2         1.0

Run Code Online (Sandbox Code Playgroud)

我投票给“count”，因为它不会浪费资源创建列表。但是，要包含可能的多个空格：`df['answer'].str.count('\s+').gt(2)` (2认同)

Answer 2

tdy*_*tdy 7

A couple more options using str.split():

Combine with str.len():
```
df[df.answer.str.split().str.len().gt(n)]
```
Run Code Online (Sandbox Code Playgroud)
Or combine with apply(len):
```
df[df.answer.str.split().apply(len).gt(n)]
```
Run Code Online (Sandbox Code Playgroud)

What's fastest?

Fastest overall (BENY's list comprehension):
```
df[[x.count(' ') >= n for x in df.answer]]
```
Run Code Online (Sandbox Code Playgroud)
Fastest pandas-based (anky's first answer):
```
df[df.answer.str.count(' ').ge(n)]
```
Run Code Online (Sandbox Code Playgroud)

Timed with ~20 words per sentence:

Why doesn't `df[len(df.answer) > 3]` work?

len(df.answer) returns the length of the answer column itself (7), not the number of words per answer (5, 1, 4, 1, 1, 5, 7).

That means the final expression evaluates to df[7 > 3] or df[True], which breaks because there is no column True:

>>> len(df.answer)
7

>>> len(df.answer) > 3     # 7 > 3
True

>>> df[len(df.answer) > 3] # df[True] doesn't exist
KeyError: True

Run Code Online (Sandbox Code Playgroud)

Answer 3

tim*_*geb 6

If I understand this correctly, here's one way:

>>> df.loc[df['answer'].str.split().apply(len) > 3, 'answer']
0    hello how are you doing
2            bye bye bye bye
5       Who let the dogs out
6              1 + 1 + 1 + 2

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，9 月前
查看次数：	263 次
最近记录：	4 年，4 月前

Output all rows with word count in a column greater than 3

What's fastest?

Why doesn't df[len(df.answer) > 3] work?

Why doesn't `df[len(df.answer) > 3]` work?