分层数据:有效地为每个节点构建每个后代的列表

tak*_*ky2 15 python tree recursion numpy pandas

我有一个两列数据集,描述了形成一棵大树的多个子父关系.我想用它来构建每个节点的每个后代的更新列表.

原始输入:

   child  parent
1   2010    1000
7   2100    1000
5   2110    1000
3   3000    2110
2   3011    2010
4   3033    2100
0   3102    2010
6   3111    2110
Run Code Online (Sandbox Code Playgroud)

关系的图形描述:

示例数据关系树

预期产量:

    descendant  ancestor
0         2010      1000
1         2100      1000
2         2110      1000
3         3000      1000
4         3011      1000
5         3033      1000
6         3102      1000
7         3111      1000
8         3011      2010
9         3102      2010
10        3033      2100
11        3000      2110
12        3111      2110
Run Code Online (Sandbox Code Playgroud)

最初我决定使用DataFrames的递归解决方案.它按预期工作,但熊猫的效率非常低.我的研究让我相信使用NumPy数组(或其他简单数据结构)的实现在大型数据集(数千个记录中的10个)上要快得多.

解决方案使用数据框:

import pandas as pd

df = pd.DataFrame(
    {
        'child':     [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent':    [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    },  columns=['child', 'parent']
)


def get_ancestry_dataframe_flat(df):

    def get_child_list(parent_id):

        list_of_children = list()
        list_of_children.append(df[df['parent'] == parent_id]['child'].values)

        for i, r in df[df['parent'] == parent_id].iterrows():
            if r['child'] != parent_id:
                list_of_children.append(get_child_list(r['child']))

        # flatten list
        list_of_children = [item for sublist in list_of_children for item in sublist]
        return list_of_children

    new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
    for index, row in df.iterrows():
        temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
        temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
        temp_df['ancestor'] = row['parent']
        new_df = new_df.append(temp_df)

    new_df = new_df\
        .drop_duplicates()\
        .sort_values(['ancestor', 'descendant'])\
        .reset_index(drop=True)

    return new_df
Run Code Online (Sandbox Code Playgroud)

因为以这种方式使用pandas DataFrames对大型数据集效率非常低,所以我需要提高此操作的性能.我的理解是,这可以通过使用更适合循环和递归的更有效的数据结构来完成.我想以最有效的方式执行相同的操作.

具体来说,我要求优化速度.

Ste*_*uch 7

这是一个构建dict的方法,以便更轻松地导航树.然后运行一次树,并将孩子添加到他们的祖父母及以上.最后将新数据添加到数据框中.

码:

def add_children_of_children(dataframe, root_node):
    # build a dict of lists to allow easy tree descent
    tree = {}
    for idx, (child, parent) in dataframe.iterrows():
        tree.setdefault(parent, []).append(child)

    data = []

    def descend_tree(parent):
        # get list of children of this parent
        children = tree[parent]

        # reverse order so that we can modify the list while looping
        for child in reversed(children):
            if child in tree:

                # descend tree and find children which need to be added
                lower_children = descend_tree(child)

                # add children from below to parent at this level
                data.extend([(c, parent) for c in lower_children])

                # return lower children to parents above
                children.extend(lower_children)

        return children

    descend_tree(root_node)

    return dataframe.append(
        pd.DataFrame(data, columns=dataframe.columns))
Run Code Online (Sandbox Code Playgroud)

时序:

测试代码中有三种测试方法,即时间运行的秒数:

  • 0.073 - add_children_of_children()从上面.
  • 0.153 - add_children_of_children()输出已排序.
  • 3.385 - 原始的get_ancestry_dataframe_flat()熊猫实施.

因此,本机数据结构方法比原始实现快得多.

测试代码:

import pandas as pd

df = pd.DataFrame(
    {
        'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    }, columns=['child', 'parent']
)

def method1():
    # the root node is the node which is not a child
    root = set(df.parent) - set(df.child)
    assert len(root) == 1, "Number of roots != 1 '{}'".format(root)
    return add_children_of_children(df, root.pop())

def method2():
    dataframe = method1()
    names = ['ancestor', 'descendant']
    rename = {o: n for o, n in zip(dataframe.columns, reversed(names))}
    return dataframe.rename(columns=rename) \
        .sort_values(names).reset_index(drop=True)

def method3():
    return get_ancestry_dataframe_flat(df)

def get_ancestry_dataframe_flat(df):

    def get_child_list(parent_id):

        list_of_children = list()
        list_of_children.append(
            df[df['parent'] == parent_id]['child'].values)

        for i, r in df[df['parent'] == parent_id].iterrows():
            if r['child'] != parent_id:
                list_of_children.append(get_child_list(r['child']))

        # flatten list
        list_of_children = [
            item for sublist in list_of_children for item in sublist]
        return list_of_children

    new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
    for index, row in df.iterrows():
        temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
        temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
        temp_df['ancestor'] = row['parent']
        new_df = new_df.append(temp_df)

    new_df = new_df\
        .drop_duplicates()\
        .sort_values(['ancestor', 'descendant'])\
        .reset_index(drop=True)

    return new_df

print(method2())
print(method3())

from timeit import timeit
print(timeit(method1, number=50))
print(timeit(method2, number=50))
print(timeit(method3, number=50))
Run Code Online (Sandbox Code Playgroud)

检测结果:

    descendant  ancestor
0         2010      1000
1         2100      1000
2         2110      1000
3         3000      1000
4         3011      1000
5         3033      1000
6         3102      1000
7         3111      1000
8         3011      2010
9         3102      2010
10        3033      2100
11        3000      2110
12        3111      2110

    descendant  ancestor
0         2010      1000
1         2100      1000
2         2110      1000
3         3000      1000
4         3011      1000
5         3033      1000
6         3102      1000
7         3111      1000
8         3011      2010
9         3102      2010
10        3033      2100
11        3000      2110
12        3111      2110

0.0737142168563
0.153700592966
3.38558308083
Run Code Online (Sandbox Code Playgroud)


EFT*_*EFT 5

这是一种使用numpy一次遍历树的方法。

码:

import numpy as np
import pandas as pd  # only used to return a dataframe


def list_ancestors(edges):
    """
    Take edge list of a rooted tree as a numpy array with shape (E, 2),
    child nodes in edges[:, 0], parent nodes in edges[:, 1]
    Return pandas dataframe of all descendant/ancestor node pairs

    Ex:
        df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
                           'parent': [100, 100, 200, 200, 201, 300]})

        df
           child  parent
        0    200     100
        1    201     100
        2    300     200
        3    301     200
        4    302     201
        5    400     300

        list_ancestors(df.values)

        returns

            descendant  ancestor
        0          200       100
        1          201       100
        2          300       200
        3          300       100
        4          301       200
        5          301       100
        6          302       201
        7          302       100
        8          400       300
        9          400       200
        10         400       100
    """
    ancestors = []
    for ar in trace_nodes(edges):
        ancestors.append(np.c_[np.repeat(ar[:, 0], ar.shape[1]-1),
                               ar[:, 1:].flatten()])
    return pd.DataFrame(np.concatenate(ancestors),
                        columns=['descendant', 'ancestor'])


def trace_nodes(edges):
    """
    Take edge list of a rooted tree as a numpy array with shape (E, 2),
    child nodes in edges[:, 0], parent nodes in edges[:, 1]
    Yield numpy array with cross-section of tree and associated
    ancestor nodes

    Ex:
        df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
                           'parent': [100, 100, 200, 200, 201, 300]})

        df
           child  parent
        0    200     100
        1    201     100
        2    300     200
        3    301     200
        4    302     201
        5    400     300

        trace_nodes(df.values)

        yields

        array([[200, 100],
               [201, 100]])

        array([[300, 200, 100],
               [301, 200, 100],
               [302, 201, 100]])

        array([[400, 300, 200, 100]])
    """
    mask = np.in1d(edges[:, 1], edges[:, 0])
    gen_branches = edges[~mask]
    edges = edges[mask]
    yield gen_branches
    while edges.size != 0:
        mask = np.in1d(edges[:, 1], edges[:, 0])
        next_gen = edges[~mask]
        gen_branches = numpy_col_inner_many_to_one_join(next_gen, gen_branches)
        edges = edges[mask]
        yield gen_branches


def numpy_col_inner_many_to_one_join(ar1, ar2):
    """
    Take two 2-d numpy arrays ar1 and ar2,
    with no duplicate values in first column of ar2
    Return inner join of ar1 and ar2 on
    last column of ar1, first column of ar2

    Ex:

        ar1 = np.array([[1,  2,  3],
                        [4,  5,  3],
                        [6,  7,  8],
                        [9, 10, 11]])

        ar2 = np.array([[ 1,  2],
                        [ 3,  4],
                        [ 5,  6],
                        [ 7,  8],
                        [ 9, 10],
                        [11, 12]])

        numpy_col_inner_many_to_one_join(ar1, ar2)

        returns

        array([[ 1,  2,  3,  4],
               [ 4,  5,  3,  4],
               [ 9, 10, 11, 12]])
    """
    ar1 = ar1[np.in1d(ar1[:, -1], ar2[:, 0])]
    ar2 = ar2[np.in1d(ar2[:, 0], ar1[:, -1])]
    if 'int' in ar1.dtype.name and ar1[:, -1].min() >= 0:
        bins = np.bincount(ar1[:, -1])
        counts = bins[bins.nonzero()[0]]
    else:
        counts = np.unique(ar1[:, -1], False, False, True)[1]
    left = ar1[ar1[:, -1].argsort()]
    right = ar2[ar2[:, 0].argsort()]
    return np.concatenate([left[:, :-1],
                           right[np.repeat(np.arange(right.shape[0]),
                                           counts)]], 1)
Run Code Online (Sandbox Code Playgroud)

时序比较:

@ taky2提供的测试用例1和2,分别比较高大的树木结构和宽大的树形结构性能的测试用例3和4 –大多数用例可能在中间位置。

df = pd.DataFrame(
    {
        'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
        'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
    }
)

df2 = pd.DataFrame(
    {
        'child': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001,
                  3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311],
        'parent': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102,
                   2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033]
    }
)

df3 = pd.DataFrame(np.r_[np.c_[np.arange(1, 501), np.arange(500)],
                         np.c_[np.arange(501, 1001), np.arange(500)]],
                   columns=['child', 'parent'])

df4 = pd.DataFrame(np.r_[np.c_[np.arange(1, 101), np.repeat(0, 100)],
                         np.c_[np.arange(1001, 11001),
                               np.repeat(np.arange(1, 101), 100)]],
                   columns=['child', 'parent'])

%timeit get_ancestry_dataframe_flat(df)
10 loops, best of 3: 53.4 ms per loop

%timeit add_children_of_children(df)
1000 loops, best of 3: 1.13 ms per loop

%timeit all_descendants_nx(df)
1000 loops, best of 3: 675 µs per loop

%timeit list_ancestors(df.values)
1000 loops, best of 3: 391 µs per loop

%timeit get_ancestry_dataframe_flat(df2)
10 loops, best of 3: 168 ms per loop

%timeit add_children_of_children(df2)
1000 loops, best of 3: 1.8 ms per loop

%timeit all_descendants_nx(df2)
1000 loops, best of 3: 1.06 ms per loop

%timeit list_ancestors(df2.values)
1000 loops, best of 3: 933 µs per loop

%timeit add_children_of_children(df3)
10 loops, best of 3: 156 ms per loop

%timeit all_descendants_nx(df3)
1 loop, best of 3: 952 ms per loop

%timeit list_ancestors(df3.values)
10 loops, best of 3: 104 ms per loop

%timeit add_children_of_children(df4)
1 loop, best of 3: 503 ms per loop

%timeit all_descendants_nx(df4)
1 loop, best of 3: 238 ms per loop

%timeit list_ancestors(df4.values)
100 loops, best of 3: 2.96 ms per loop
Run Code Online (Sandbox Code Playgroud)

笔记:

get_ancestry_dataframe_flat 由于时间和记忆的原因,未对情况3和4进行计时。

add_children_of_children修改为在内部标识根节点,但允许采用唯一的根。root_node = (set(dataframe.parent) - set(dataframe.child)).pop()添加第一行。

all_descendants_nx 修改为接受数据框作为参数,而不是从外部名称空间提取。

演示正确行为的示例:

np.all(get_ancestry_dataframe_flat(df2).sort_values(['descendant', 'ancestor'])\
                                       .reset_index(drop=True) ==\
       list_ancestors(df2.values).sort_values(['descendant', 'ancestor'])\
                                 .reset_index(drop=True))
Out[20]: True
Run Code Online (Sandbox Code Playgroud)