我有一个DataFrame在pandas同一个名为列df.strings与文本字符串.我想在其自己的行上获取这些字符串的单个单词,其他列的值相同.例如,如果我有3个字符串(和一个不相关的列,时间):
Strings Time
0 The dog 4Pm
1 lazy dog 2Pm
2 The fox 1Pm
Run Code Online (Sandbox Code Playgroud)
我想要包含字符串中的单词的新行,但是包含其他相同的列
Strings --- Words ---Time
"The dog" --- "The" --- 4Pm
"The dog" --- "dog" --- 4Pm
"lazy dog"--- "lazy"--- 2Pm
"lazy dog"--- "dog" --- 2Pm
"The fox" --- "The" --- 1Pm
"The fox" --- "fox" --- 1Pm
Run Code Online (Sandbox Code Playgroud)
我知道如何从字符串中分割出单词:
string_list = '\n'.join(df.Strings.map(str))
word_list = re.findall('[a-z]+', Strings)
Run Code Online (Sandbox Code Playgroud)
但是,如何在保留索引和其他变量的同时将这些内容放入数据框中?我使用的是Python 2.7和pandas 0.10.1.
编辑:我现在了解如何使用此问题中的 groupby扩展行:
def f(group):
row = group.irow(0)
return DataFrame({'words': re.findall('[a-z]+',row['Strings'])})
df.groupby('class', group_keys=False).apply(f)
Run Code Online (Sandbox Code Playgroud)
我仍然想保留其他列.这可能吗?
HYR*_*YRY 13
这是我的代码不使用groupby(),我认为它更快.
import pandas as pd
import numpy as np
import itertools
df = pd.DataFrame({
"strings":["the dog", "lazy dog", "The fox jump"],
"value":["a","b","c"]})
w = df.strings.str.split()
c = w.map(len)
idx = np.repeat(c.index, c.values)
#words = np.concatenate(w.values)
words = list(itertools.chain.from_iterable(w.values))
s = pd.Series(words, index=idx)
s.name = "words"
print df.join(s)
Run Code Online (Sandbox Code Playgroud)
结果如下:
strings value words
0 the dog a the
0 the dog a dog
1 lazy dog b lazy
1 lazy dog b dog
2 The fox jump c The
2 The fox jump c fox
2 The fox jump c jump
Run Code Online (Sandbox Code Playgroud)