合并两列中的数字以创建一个数组

Ash*_*y O 6 python dataframe pandas

用于创建示例数据帧的代码:

Sample = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': [[.332, .326], [.058, .138]]},
     {'account': 'Alpha Co',  'Jan': 200, 'Feb': 210, 'Mar': [[.234, .246], [.234, .395]]},
     {'account': 'Blue Inc',  'Jan': 50,  'Feb': 90,  'Mar': [[.084, .23], [.745, .923]]}]
df = pd.DataFrame(Sample)
Run Code Online (Sandbox Code Playgroud)

示例数据框可视化:

 df:
  account        Jan      Feb          Mar
Jones LLC  |     150   |   200    | [.332, .326], [.058, .138]
Alpha Co   |     200   |   210    | [[.234, .246], [.234, .395]
Blue Inc   |     50    |   90     | [[.084, .23], [.745, .923]
Run Code Online (Sandbox Code Playgroud)

我正在寻找一个公式,将Jan和Feb列组合成一个数组,在New列中输出该数组.

预期产量:

 df:
  account        Jan      Feb          Mar                             New
Jones LLC  |     150   |   200    | [.332, .326], [.058, .138]   |    [150, 200]
Alpha Co   |     200   |   210    | [[.234, .246], [.234, .395]  |    [200, 210]
Blue Inc   |     50    |   90     | [[.084, .23], [.745, .923]   |    [50, 90]
Run Code Online (Sandbox Code Playgroud)

piR*_*red 8

使用 values.tolist

df.assign(New=df[['Feb', 'Jan']].values.tolist())
# inplace... use this
# df['New'] = df[['Feb', 'Jan']].values.tolist()

   Feb  Jan                               Mar    account         New
0  200  150  [[0.332, 0.326], [0.058, 0.138]]  Jones LLC  [200, 150]
1  210  200  [[0.234, 0.246], [0.234, 0.395]]   Alpha Co  [210, 200]
2   90   50   [[0.084, 0.23], [0.745, 0.923]]   Blue Inc    [90, 50]
Run Code Online (Sandbox Code Playgroud)

使用更大的数据进行定时使用 3,000行数据帧,
避免apply速度提高了60多倍.

df = pd.concat([df] * 1000, ignore_index=True)

%timeit df.assign(New=df[['Feb', 'Jan']].values.tolist())
%timeit df.assign(New=df.apply(lambda x: [x['Jan'], x['Feb']], axis=1))

1000 loops, best of 3: 947 µs per loop
10 loops, best of 3: 61.7 ms per loop
Run Code Online (Sandbox Code Playgroud)

对于30,000行数据帧,速度提高了160倍

df = pd.concat([df] * 10000, ignore_index=True)

100 loops, best of 3: 3.58 ms per loop
1 loop, best of 3: 586 ms per loop
Run Code Online (Sandbox Code Playgroud)


cs9*_*s95 7

列表理解

如果你正在寻找速度,这是要走的路.

df['New'] = [[x, y] for x, y in zip(df.Jan, df.Feb)]
df

   Feb  Jan                               Mar    account         New
0  200  150  [[0.332, 0.326], [0.058, 0.138]]  Jones LLC  [150, 200]
1  210  200  [[0.234, 0.246], [0.234, 0.395]]   Alpha Co  [200, 210]
2   90   50   [[0.084, 0.23], [0.745, 0.923]]   Blue Inc    [50, 90]
Run Code Online (Sandbox Code Playgroud)

如果要删除原始列,可以使用

df.drop(['Jan', 'Feb'], axis=1, inplace=True)
Run Code Online (Sandbox Code Playgroud)

df.applyaxis=1

这是为了完成 - 我不再宽恕使用apply了.

df['New'] = df.apply(lambda x: [x['Jan'], x['Feb']], axis=1)    
df

   Feb  Jan                               Mar    account         New
0  200  150  [[0.332, 0.326], [0.058, 0.138]]  Jones LLC  [150, 200]
1  210  200  [[0.234, 0.246], [0.234, 0.395]]   Alpha Co  [200, 210]
2   90   50   [[0.084, 0.23], [0.745, 0.923]]   Blue Inc    [50, 90]
Run Code Online (Sandbox Code Playgroud)

性能
重复piR对小数据(3000行)的测试,包括列表理解方法,我们有 -

%timeit df.assign(New=df[['Feb', 'Jan']].values.tolist())
%timeit df.assign(New=df.apply(lambda x: [x['Jan'], x['Feb']], axis=1))
%timeit df.assign(New=[[x, y] for x, y in zip(df.Jan, df.Feb)])

2.76 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
152 ms ± 9.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.59 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Run Code Online (Sandbox Code Playgroud)

对于较大的数据(30,000行) -

5.95 ms ± 527 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.53 s ± 165 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.79 ms ± 793 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)

列表理解和.tolist()竞争方法都是如此.您决定使用哪一个是品味问题.千万不能使用apply!


WeN*_*Ben 5

你也可以试试 df['New'] = list(zip(df.Feb, df.Jan))

或使用 tolist df['New'] = df.ix[:,0:2].values.tolist()