用于与多个布尔列/系列进行类似集合比较的矢量化 Pandas 方法

Question

用于与多个布尔列/系列进行类似集合比较的矢量化 Pandas 方法

Cra*_*cky 5 python boolean set dataframe pandas

示例数据来说明：

\n

import pandas as pd\n\nanimals = pd.DataFrame({\'name\': [\'ostrich\', \'parrot\', \'platypus\'],\n                        \'legs\': [2, 2, 4],\n                        \'flight\': [False, True, False],\n                        \'beak\': [True, True, True],\n                        \'feathers\': [True, True, False]})\n

Run Code Online (Sandbox Code Playgroud)\n

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n

姓名	腿	航班	喙	羽毛
鸵鸟	2		\xe2\x9c\x94	\xe2\x9c\x94
鹦鹉	2	\xe2\x9c\x94	\xe2\x9c\x94	\xe2\x9c\x94
鸭嘴兽	4		\xe2\x9c\x94

\n

什么已经有效

\n

Pandas 可以轻松地根据条件检查整个列（这是一个系列），并且结果（一系列布尔值）可用于通过布尔索引过滤数据帧：

\n

bipeds = (animals.legs == 2)\nprint(animals[bipeds])\n\n          name  legs  flight  beak  feathers\n0      ostrich     2   False  True      True\n1       parrot     2    True  True      True\n

Run Code Online (Sandbox Code Playgroud)\n

在我的用例中，每个这样的条件都是从文本搜索字符串中的术语解析的，因此我需要以编程方式构造它们。（我知道 Pandas 的查询，但我需要不同的功能。）编写一个函数来执行此操作非常简单：

\n

def comp_search(df, column_name, comp, value):\n    return getattr(df[column_name], f\'__{comp}__\')(value)\n\nbipeds = comp_search(animals, \'legs\', \'eq\', 2)\n

Run Code Online (Sandbox Code Playgroud)\n

检查任何给定的布尔列就像简单一样，例如animals[animals.feathers]。

\n

我想做的事

\n

我想对布尔列的集合进行集合比较：例如，查找至少具有一组特定特征或少于一组特征等的所有动物。从之前的推断，我可以想象这样的情况：这：

\n

set(df[features]) <= set(values)\n

Run Code Online (Sandbox Code Playgroud)\n

假设这样的条件可以这样建立：

\n

def set_comp_search(df, column_names, comp, values):\n    return getattr(set(df[column_names]), f\'__{comp}__\')(set(values))\n

Run Code Online (Sandbox Code Playgroud)\n

当然，这些都不起作用，因为set()数据框创建了一组普通的列名。

\n

什么有效，但效率极低

\n

上述可以通过使用apply将每行布尔值转换为一个集合，然后与生成的一系列集合进行比较来实现：

\n

def row_to_set(row):\n    return set(label for label, value\n               in zip(row.index, row)\n               if value)\n\ndef set_comp_search(df, column_names, comp, values):\n    series_of_sets = df[column_names].apply(row_to_set, axis=1)\n    return getattr(series_of_sets, f\'__{comp}__\')(set(values))\n

Run Code Online (Sandbox Code Playgroud)\n

又好又简洁！apply不幸的是，当源数据帧增长到数千行时，迭代变得非常慢。

\n

什么有效，但似乎是重新实现

\n

如果我像这样为每个单独的集合比较硬编码一个等效的布尔表达式，则结果比较将被矢量化（在整个列上执行，而不是在 Python 级别迭代）。

\n

def set_comp_search(df, column_names, comp, values):\n    other_column_names = set(column_names) - set(values)\n    value_columns = df[values]\n    other_columns = df[other_column_names]\n    \n    if comp == \'gt\':\n        # All the searched features, and at least one other\n        return value_columns.all(axis=1) & other_columns.any(axis=1)\n\n    if comp == \'ge\':\n        # All the searched features\n        return value_columns.all(axis=1)\n    \n    if comp == \'eq\':\n        # All the searched features, and none other\n        return value_columns.all(axis=1) & ~other_columns.any(axis=1)\n    \n    if comp == \'le\':\n        # No other features\n        return ~other_columns.any(axis=1)\n    \n    if comp == \'lt\':\n        # Not all of the searched features, and none other\n        return ~value_columns.all(axis=1) & ~other_columns.any(axis=1)\n

Run Code Online (Sandbox Code Playgroud)\n

所以如果我想要一个条件来表示set(animals[features]) > {\'beak\'}：

\n

more_than_beak = set_comp_search(animals, {\'flight\', \'beak\', \'feathers\'},\n                                 \'gt\', {\'beak\'})\n# Converts to: (animals.beak) & (animals.flight | animals.feathers)\nprint(animals[more_than_beak])\n\n          name  legs  flight  beak  feathers\n0      ostrich     2   False  True      True\n1       parrot     2    True  True      True\n\n# Correctly omits the platypus\n

Run Code Online (Sandbox Code Playgroud)\n

撇开笨重不谈，它运行得足够快。但我觉得我必须重新发明轮子。这看起来与这些方法的用途大致相似Series.str，尽管它需要使用数据帧、系列序列或 numpy 数组（而不是单个系列）进行操作。（遗憾的是没有DataFrame.set模块。）

\n

所以我的问题是：Pandas 是否提供了一种矢量化方法，用于与布尔列集合进行类似集合的比较？

\n

（我也看过这个问题，因为它听起来很相似，但它不适用于类似集合的行为。）

\n

Answer 1

小智 0

在我看来，您可能会受益于使用 numpy 进行矢量化的函数。以下是此类函数、向量化及其应用的示例：

def analyze_birds (name: str, legs: int, feathers: bool):
  if feathers and legs == 2 :
    return name + "-Feathered Biped"
  if legs > 2 :
    return name + "-Quadruped" 

vector_analyze_birds = np.vectorize(analyze_birds) 

animals['Analysis'] = vector_analyze_birds(animals['name'], animals['legs'], animals['feathers'])

Run Code Online (Sandbox Code Playgroud)

输出

归档时间：	5 年，5 月前
查看次数：	856 次
最近记录：	5 年，2 月前