将数据帧中的(拆分)范围拆分为多行

kab*_*ame 6 python numpy dataframe pandas

此问题类似于拆分(爆炸)pandas数据帧字符串条目到单独的行,但包括有关添加范围的问题.

我有一个DataFrame:

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1,2,4-6 | bob@email.com  |
+------+---------+----------------+
| John |   NaN   | john@email.com |
+------+---------+----------------+
| Mary |   1,2   | mary@email.com |
+------+---------+----------------+
| Jane | 1,3-5   | jane@email.com |
+------+---------+----------------+
Run Code Online (Sandbox Code Playgroud)

我希望Options用逗号分隔列以及为范围添加的行.

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1       | bob@email.com  |
+------+---------+----------------+
| Bob  | 2       | bob@email.com  |
+------+---------+----------------+
| Bob  | 4       | bob@email.com  |
+------+---------+----------------+
| Bob  | 5       | bob@email.com  |
+------+---------+----------------+
| Bob  | 6       | bob@email.com  |
+------+---------+----------------+
| John | NaN     | john@email.com |
+------+---------+----------------+
| Mary | 1       | mary@email.com |
+------+---------+----------------+
| Mary | 2       | mary@email.com |
+------+---------+----------------+
| Jane | 1       | jane@email.com |
+------+---------+----------------+
| Jane | 3       | jane@email.com |
+------+---------+----------------+
| Jane | 4       | jane@email.com |
+------+---------+----------------+
| Jane | 5       | jane@email.com |
+------+---------+----------------+
Run Code Online (Sandbox Code Playgroud)

我怎样才能超越使用concatsplit参考SO文章所说的实现这一点?我需要一种方法来添加范围.

该文章使用以下代码来分割逗号描述的值(1,2,3):

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0

0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2
Run Code Online (Sandbox Code Playgroud)

提前感谢您的建议!

更新2/14样本数据已更新,以符合我目前的情况.

WeN*_*Ben 6

如果我明白你需要什么

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))

    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]


df.Options=df.Options.apply(yourfunc)

df
Out[114]: 
   Name          Options           Email
0   Bob  [1, 2, 4, 5, 6]   bob@email.com
1  Jane     [1, 3, 4, 5]  jane@email.com


df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]: 
   Name           Email    0
0   Bob   bob@email.com  1.0
1   Bob   bob@email.com  2.0
2   Bob   bob@email.com  4.0
3   Bob   bob@email.com  5.0
4   Bob   bob@email.com  6.0
5  Jane  jane@email.com  1.0
6  Jane  jane@email.com  3.0
7  Jane  jane@email.com  4.0
8  Jane  jane@email.com  5.0
Run Code Online (Sandbox Code Playgroud)


cs9*_*s95 5

从自定义替换功能开始:

def replace(x):
    i, j = map(int, x.groups())
    return ','.join(map(str, range(i, j + 1)))
Run Code Online (Sandbox Code Playgroud)

将列名存储在某处,稍后我们将使用它们:

c = df.columns
Run Code Online (Sandbox Code Playgroud)

接下来,替换项目df.Options,然后拆分逗号:

v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
Run Code Online (Sandbox Code Playgroud)

接下来,重塑数据并最终加载到新的数据框中:

df = pd.DataFrame(
       df.drop('Options', 1).values.repeat(v.str.len(), axis=0)
)
df.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df.columns = c
Run Code Online (Sandbox Code Playgroud)

df

   Name Options           Email
0   Bob       1   bob@email.com
1   Bob       2   bob@email.com
2   Bob       4   bob@email.com
3   Bob       5   bob@email.com
4   Bob       6   bob@email.com
5  Jane       1  jane@email.com
6  Jane       3  jane@email.com
7  Jane       4  jane@email.com
8  Jane       5  jane@email.com
Run Code Online (Sandbox Code Playgroud)


piR*_*red 5

我喜欢使用 它np.r_,slice
我知道它看起来像一团糟,但美丽在旁观者眼中.

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(
        mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
    ))

r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()

df.loc[df.index.repeat(lens)].assign(Options=new)

   Name  Options           Email
0   Bob        1   bob@email.com
0   Bob        2   bob@email.com
0   Bob        4   bob@email.com
0   Bob        5   bob@email.com
0   Bob        6   bob@email.com
2  Mary        1  mary@email.com
2  Mary        2  mary@email.com
3  Jane        1  jane@email.com
3  Jane        3  jane@email.com
3  Jane        4  jane@email.com
3  Jane        5  jane@email.com
Run Code Online (Sandbox Code Playgroud)

说明

  • np.r_ 采用不同的切片器和索引器并返回组合的数组.

    np.r_[1, 4:7]
    array([1, 4, 5, 6])
    
    Run Code Online (Sandbox Code Playgroud)

    要么

    np.r_[slice(1, 2), slice(4, 7)]
    array([1, 4, 5, 6])
    
    Run Code Online (Sandbox Code Playgroud)

    但是,如果我需要通过他们的任意一堆,我需要通过一个tuplenp.r_小号__getitem__方法.

    np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
    array([ 1,  4,  5,  6, 10, 11, 12, 13])
    
    Run Code Online (Sandbox Code Playgroud)

    所以我迭代,解析,制作切片并传递给 np.r_.__getitem__

  • 在应用我的酷解析器后loc,使用pd.Index.repeat,和的组合pd.Series.str.len

  • 使用pd.DataFrame.assign覆盖现有列

__NOTE__
如果你的Options专栏中有不好的字符,我会尝试像这样过滤.

df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
       .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
       .query('Options != ""')
Run Code Online (Sandbox Code Playgroud)