kab*_*ame 6 python numpy dataframe pandas
此问题类似于拆分(爆炸)pandas数据帧字符串条目到单独的行,但包括有关添加范围的问题.
我有一个DataFrame:
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1,2,4-6 | bob@email.com |
+------+---------+----------------+
| John | NaN | john@email.com |
+------+---------+----------------+
| Mary | 1,2 | mary@email.com |
+------+---------+----------------+
| Jane | 1,3-5 | jane@email.com |
+------+---------+----------------+
Run Code Online (Sandbox Code Playgroud)
我希望Options
用逗号分隔列以及为范围添加的行.
+------+---------+----------------+
| Name | Options | Email |
+------+---------+----------------+
| Bob | 1 | bob@email.com |
+------+---------+----------------+
| Bob | 2 | bob@email.com |
+------+---------+----------------+
| Bob | 4 | bob@email.com |
+------+---------+----------------+
| Bob | 5 | bob@email.com |
+------+---------+----------------+
| Bob | 6 | bob@email.com |
+------+---------+----------------+
| John | NaN | john@email.com |
+------+---------+----------------+
| Mary | 1 | mary@email.com |
+------+---------+----------------+
| Mary | 2 | mary@email.com |
+------+---------+----------------+
| Jane | 1 | jane@email.com |
+------+---------+----------------+
| Jane | 3 | jane@email.com |
+------+---------+----------------+
| Jane | 4 | jane@email.com |
+------+---------+----------------+
| Jane | 5 | jane@email.com |
+------+---------+----------------+
Run Code Online (Sandbox Code Playgroud)
我怎样才能超越使用concat
和split
参考SO文章所说的实现这一点?我需要一种方法来添加范围.
该文章使用以下代码来分割逗号描述的值(1,2,3
):
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Run Code Online (Sandbox Code Playgroud)
提前感谢您的建议!
更新2/14样本数据已更新,以符合我目前的情况.
如果我明白你需要什么
def yourfunc(s):
ranges = (x.split("-") for x in s.split(","))
return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]
df.Options=df.Options.apply(yourfunc)
df
Out[114]:
Name Options Email
0 Bob [1, 2, 4, 5, 6] bob@email.com
1 Jane [1, 3, 4, 5] jane@email.com
df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]:
Name Email 0
0 Bob bob@email.com 1.0
1 Bob bob@email.com 2.0
2 Bob bob@email.com 4.0
3 Bob bob@email.com 5.0
4 Bob bob@email.com 6.0
5 Jane jane@email.com 1.0
6 Jane jane@email.com 3.0
7 Jane jane@email.com 4.0
8 Jane jane@email.com 5.0
Run Code Online (Sandbox Code Playgroud)
从自定义替换功能开始:
def replace(x):
i, j = map(int, x.groups())
return ','.join(map(str, range(i, j + 1)))
Run Code Online (Sandbox Code Playgroud)
将列名存储在某处,稍后我们将使用它们:
c = df.columns
Run Code Online (Sandbox Code Playgroud)
接下来,替换项目df.Options
,然后拆分逗号:
v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
Run Code Online (Sandbox Code Playgroud)
接下来,重塑数据并最终加载到新的数据框中:
df = pd.DataFrame(
df.drop('Options', 1).values.repeat(v.str.len(), axis=0)
)
df.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df.columns = c
Run Code Online (Sandbox Code Playgroud)
df
Name Options Email
0 Bob 1 bob@email.com
1 Bob 2 bob@email.com
2 Bob 4 bob@email.com
3 Bob 5 bob@email.com
4 Bob 6 bob@email.com
5 Jane 1 jane@email.com
6 Jane 3 jane@email.com
7 Jane 4 jane@email.com
8 Jane 5 jane@email.com
Run Code Online (Sandbox Code Playgroud)
我喜欢使用
它np.r_
,slice
我知道它看起来像一团糟,但美丽在旁观者眼中.
def parse(o):
mm = lambda i: slice(min(i), max(i) + 1)
return np.r_.__getitem__(tuple(
mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
))
r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()
df.loc[df.index.repeat(lens)].assign(Options=new)
Name Options Email
0 Bob 1 bob@email.com
0 Bob 2 bob@email.com
0 Bob 4 bob@email.com
0 Bob 5 bob@email.com
0 Bob 6 bob@email.com
2 Mary 1 mary@email.com
2 Mary 2 mary@email.com
3 Jane 1 jane@email.com
3 Jane 3 jane@email.com
3 Jane 4 jane@email.com
3 Jane 5 jane@email.com
Run Code Online (Sandbox Code Playgroud)
说明
np.r_
采用不同的切片器和索引器并返回组合的数组.
np.r_[1, 4:7]
array([1, 4, 5, 6])
Run Code Online (Sandbox Code Playgroud)
要么
np.r_[slice(1, 2), slice(4, 7)]
array([1, 4, 5, 6])
Run Code Online (Sandbox Code Playgroud)
但是,如果我需要通过他们的任意一堆,我需要通过一个tuple
以np.r_
小号__getitem__
方法.
np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
array([ 1, 4, 5, 6, 10, 11, 12, 13])
Run Code Online (Sandbox Code Playgroud)
所以我迭代,解析,制作切片并传递给 np.r_.__getitem__
在应用我的酷解析器后loc
,使用pd.Index.repeat
,和的组合pd.Series.str.len
pd.DataFrame.assign
覆盖现有列__NOTE__
如果你的Options
专栏中有不好的字符,我会尝试像这样过滤.
df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
.replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
.query('Options != ""')
Run Code Online (Sandbox Code Playgroud)