为什么列表理解比 pandas 中的应用更快

Question

为什么列表理解比 pandas 中的应用更快

Tus*_*eth 12 python list-comprehension pandas

使用列表推导式比普通的 for 循环要快得多。给出的原因是列表推导中不需要追加，这是可以理解的。\n但我在各个地方发现列表比较比应用更快。我也有过这样的经历。但无法理解使其比应用速度快得多的内部工作原理是什么？

\n\n

我知道这与 numpy 中的矢量化有关，numpy 是 pandas 数据帧的基本实现。但是导致列表推导式比 apply 更好的原因不太理解，因为在列表推导式中，我们在列表内给出 for 循环，而在 apply 中，我们甚至不给出任何 for 循环（我假设还有向量化发生）

\n\n

编辑：\na添加代码：\n这是在泰坦尼克号数据集上工作，其中标题是从名称中提取的：\n https://www.kaggle.com/c/titanic/data

\n\n

%timeit train[\'NameTitle\'] = train[\'Name\'].apply(lambda x: \'Mrs.\' if \'Mrs\' in x else \\\n                                         (\'Mr\' if \'Mr\' in x else (\'Miss\' if \'Miss\' in x else\\\n                                                (\'Master\' if \'Master\' in x else \'None\'))))\n\n%timeit train[\'NameTitle\'] = [\'Mrs.\' if \'Mrs\' in x else \'Mr\' if \'Mr\' in x else (\'Miss\' if \'Miss\' in x else (\'Master\' if \'Master\' in x else \'None\')) for x in train[\'Name\']]\n

Run Code Online (Sandbox Code Playgroud)\n\n

结果：\n782 \xc2\xb5s \xc2\xb1 6.36 \xc2\xb5s 每个循环（平均 \xc2\xb1 标准偏差 7 次运行，每次 1000 次循环）

\n\n

499 \xc2\xb5s \xc2\xb1 5.76 \xc2\xb5s 每个循环（平均 \xc2\xb1 标准偏差 7 次运行，每次 1000 次循环）

\n\n

Edit2：\n要为SO添加代码，正在创建一个简单的代码，令人惊讶的是，对于下面的代码，结果相反：

\n\n

import pandas as pd\nimport timeit\ndf_test = pd.DataFrame()\ntlist = []\ntlist2 = []\nfor i in range (0,5000000):\n  tlist.append(i)\n  tlist2.append(i+5)\ndf_test[\'A\'] = tlist\ndf_test[\'B\'] = tlist2\n\ndisplay(df_test.head(5))\n\n\n%timeit df_test[\'C\'] = df_test[\'B\'].apply(lambda x: x*2 if x%5==0 else x)\ndisplay(df_test.head(5))\n%timeit df_test[\'C\'] = [ x*2 if x%5==0 else x for x in df_test[\'B\']]\n\ndisplay(df_test.head(5))\n

Run Code Online (Sandbox Code Playgroud)\n\n

1 个循环，3 次最佳：每个循环 2.14 秒

\n\n

1 个循环，3 次最佳：每个循环 2.24 秒

\n\n

Edit3：\n正如一些人所建议的，该 apply 本质上是一个 for 循环，这并不是我用 for 循环运行此代码的情况，它几乎永远不会结束，我必须在 3-4 分钟后手动停止它，而且它永远不会在此期间完成：

\n\n

for row in df_test.itertuples():\n  x = row.B\n  if x%5==0:\n    df_test.at[row.Index,\'B\'] = x*2\n

Run Code Online (Sandbox Code Playgroud)\n\n

运行上面的代码大约需要 23 秒，但 apply 只需要 1.8 秒。那么，itertuples 和 apply 中的这些物理循环有什么区别呢？

\n

Answer 1

Ale*_*lex 4

列表理解和列表理解之间的性能差异有几个原因apply。

\n

首先，代码中的列表理解不会在每次迭代时进行函数调用，而会进行函数apply调用。这产生了巨大的差异：

\n

map_function = lambda x: \'Mrs.\' if \'Mrs\' in x else \\\n                 (\'Mr\' if \'Mr\' in x else (\'Miss\' if \'Miss\' in x else \\\n                 (\'Master\' if \'Master\' in x else \'None\')))\n\n%timeit train[\'NameTitle\'] = [map_function(x) for x in train[\'Name\']]\n# 581 \xc2\xb5s \xc2\xb1 21.8 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n%timeit train[\'NameTitle\'] = [\'Mrs.\' if \'Mrs\' in x else \\\n                 (\'Mr\' if \'Mr\' in x else (\'Miss\' if \'Miss\' in x else \\\n                 (\'Master\' if \'Master\' in x else \'None\'))) for x in train[\'Name\']]\n# 482 \xc2\xb5s \xc2\xb1 14.1 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n

其次，apply 的作用远不止列表理解。例如，它尝试为结果找到合适的数据类型。通过禁用该行为，您可以看到它会产生什么影响：

\n

%timeit train[\'NameTitle\'] = train[\'Name\'].apply(map_function)\n# 660 \xc2\xb5s \xc2\xb1 2.57 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n%timeit train[\'NameTitle\'] = train[\'Name\'].apply(map_function, convert_dtype=False)\n# 626 \xc2\xb5s \xc2\xb1 4.8 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n

中还发生了很多其他事情apply，因此在本示例中您需要使用map：

\n

%timeit train[\'NameTitle\'] = train[\'Name\'].map(map_function)\n# 545 \xc2\xb5s \xc2\xb1 4.02 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n

其性能比其中包含函数调用的列表理解更好。

\n

那么apply您可能会问为什么要使用呢？我至少知道一个例子，当您要应用的操作是矢量化通用函数时，它的性能优于其他所有操作。这是因为，applyunlikemap和列表理解允许函数在整个 Series 上运行，而不是在其中的单个对象上运行。让我们看一个例子：

\n

%timeit train[\'AgeExp\'] = train[\'Age\'].apply(lambda x: np.exp(x))\n# 1.44 ms \xc2\xb1 41.5 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n%timeit train[\'AgeExp\'] = train[\'Age\'].apply(np.exp)\n# 256 \xc2\xb5s \xc2\xb1 12.3 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n%timeit train[\'AgeExp\'] = train[\'Age\'].map(np.exp)\n# 1.01 ms \xc2\xb1 8.7 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n%timeit train[\'AgeExp\'] = [np.exp(x) for x in train[\'Age\']]\n# 1.21 ms \xc2\xb1 28.7 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 1000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	6 年，7 月前
查看次数：	6394 次
最近记录：	3 年，2 月前