汇总数据帧中的事务链,按列值链接的行

Ben*_*let 5 python pandas

我正在尝试从DataFrame链接多行,以获取通过将接收者ID连接到发送者ID形成的所有可能路径。

这是我的DataFrame的示例:

   transaction_id sender_id receiver_id  amount
0          213234       002         125      10
1          223322       017         354      90
2          343443       125         689      70
3          324433       689         233       5
4          328909       354         456      10
Run Code Online (Sandbox Code Playgroud)

创建于:

df = pd.DataFrame(
    {'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
     'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
     'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
     'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)
Run Code Online (Sandbox Code Playgroud)

我的代码的结果应该是链ID列表和交易链的总金额。对于上面示例中的前两行,类似于:

[('002', '125', '689', '233'), 85]
[('017', '354', '456'), 100]
Run Code Online (Sandbox Code Playgroud)

我已经尝试遍历各行并将每行转换为一个Node类的实例,然后使用遍历链表的方法,但是我不知道下一步是什么:

class Node:
    def __init__(self,transaction_id,sender,receiver,amount):
        self.transac = transaction_id
        self.val = sender_id
        self.next = receiver_id
        self.amount = amount
    def traverse(self):
        node = self # start from the head node
        while node != None:
            print (node.val) # access the node value
            node = node.next # move on to the next node

for index, row in customerTransactionSqlDf3.iterrows():
    index = Node( 
        row["transaction_id"],
        row["sender_id"],
        row["receiver_id"],
        row["amount"]
    )
Run Code Online (Sandbox Code Playgroud)

附加信息:

  • sender_id值是唯一的,对于每个发送者id,只有一个可能的交易链。
  • 没有周期,没有一条链可以使接收者ID指向同一路径中的发送者ID。

dee*_*cue 1

我不知道下一步是什么

Node通过使用当前的实现,您可以通过迭代每个节点来连接两个对象。您还可以visited在类中添加属性Node,以便在遍历树时可以识别唯一的链,即没有一个链是另一链的子链。但是,如果您想了解每个 的链sender_id,则可能没有必要。

编辑:我注意到您提到预期结果的示例是前两行。这意味着每个人都sender_id应该有自己的链条。修改traverse方法,使得节点全部连接后才能使用。

编辑:重新实现visited属性以获得唯一的链

df = pd.DataFrame(
    {'transaction_id': {0: '213234', 1: '223322', 2: '343443', 3: '324433', 4: '328909'},
     'sender_id': {0: '002', 1: '017', 2: '125', 3: '689', 4: '354'},
     'receiver_id': {0: '125', 1: '354', 2: '689', 3: '233', 4: '456'},
     'amount': {0: 10, 1: 90, 2: 70, 3: 5, 4: 10}}
)

class Node:
    def __init__(self,transaction_id,sender_id,receiver_id,amount):
        self.transac = transaction_id
        self.sender = sender_id
        self.receiver = receiver_id
        self.next = None
        self.amount = amount
        self.visited = False
    def traverse(self, chain=None, total=0):
        if (self.visited): # undo visited nodes
            return
        self.visited = True
        if chain is None: # this is the beginning of the traversal
            chain = [self.sender]
        chain += [self.receiver]
        total += self.amount
        if self.next is not None:
            return self.next.traverse(chain, total)
        return chain, total

transc = [Node( 
        row["transaction_id"],
        row["sender_id"],
        row["receiver_id"],
        row["amount"]
    ) for i, row in df.iterrows()]

# connect the nodes
for i, v in enumerate(transc):
    for j, k in enumerate(transc):
        # if the receiver v same as the sender from j
        if v.receiver == k.sender:
            v.next = k


summary = [i.traverse() for i in transc]
summary = [i for i in summary if i is not None] # removing None

print(summary)
Run Code Online (Sandbox Code Playgroud)

输出:

[
    (['002', '125', '689', '233'], 85), 
    (['017', '354', '456'], 100)
]
Run Code Online (Sandbox Code Playgroud)