分析大图 - 检索聚类和计算最强路径

Nik*_*lin 6 sql graph-theory social-networking

我尝试使用广度优先算法以下列方式检索所选用户所属的整个连接群集:

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
    AS
    BEGIN
        -- Automatically rollback the transaction if something goes wrong.   
        SET XACT_ABORT ON   
        BEGIN TRAN

        -- Increase performance and do not intefere with the results.
        SET NOCOUNT ON;

        -- Create a temporary table for storing the discovered nodes as the algorithm runs
        CREATE TABLE #Discovered
        (
             DiscoveredUser varchar(50) NOT NULL,    -- The Node Id
            Predecessor varchar(50) NULL,    -- The node we came from to get to this node.
            LinkStrength decimal(10,7) NULL, -- positive - from predecessor to  DiscoveredUser, negative - from  DiscoveredUser to predecessor
            OrderDiscovered int -- The order in which the nodes were discovered.
        )

        -- Initially, only the start node is discovered.
        INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
        VALUES (@StartNode, NULL, NULL, 0)

        -- Add all nodes that we can get to from the current set of nodes,
        -- that are not already discovered. Run until no more nodes are discovered.
        WHILE @@ROWCOUNT > 0
        BEGIN
            -- If we have found the node we were looking for, abort now.
            IF @EndNode IS NOT NULL
                IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE  DiscoveredUser = @EndNode)
                    BREAK   

            -- We need to group by ToNode and select one FromNode since multiple
            -- edges could lead us to new same node, and we only want to insert it once.
            INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
            (SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
            WHERE mc.called_party NOT IN (SELECT  DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
            UNION
            SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
            WHERE mc.calling_party NOT IN (SELECT  DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
            )
        END;

        -- Select the results. We use a recursive common table expression to
        -- get the full path from the start node to the current node.
        WITH BacktraceCTE(Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path)
        AS
        (
            -- Anchor/base member of the recursion, this selects the start node.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
                CAST(n. DiscoveredUser AS varchar(MAX))
            FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
            WHERE d. DiscoveredUser = @StartNode

            UNION ALL

            -- Recursive member, select all the nodes which have the previous
            -- one as their predecessor. Concat the paths together.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
                CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
            FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
            JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
        )

        SELECT Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
        WHERE  DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
        ORDER BY OrderDiscovered                -- a bad execution plan, but I use it for simplicity here.

        DROP TABLE #Discovered
        COMMIT TRAN
        RETURN 0
    END
Run Code Online (Sandbox Code Playgroud)

我目前正在分析的图形(社交网络)具有28M连接并且没有忽略弱连接(使用@LinkStrength设置阈值)执行运行很长时间(到目前为止我还没有得到任何结果并且会尝试让它运行过夜).

无论如何,下一步是计算两个用户之间的最短(最强)链接(大约有3M用户).我正在考虑使用Djikstra算法,但不确定是否有可能在我目前正在使用的PC上分析这样的网络(四核CPU 2.66 GHz,4GB RAM)并且数据存储在MS SQL Server 2008数据库中.

总结一下,我希望得到以下问题的一些答案/建议:

  1. 对于所述机器(2.66 GHz,4GB RAM)上描述的图形(28M连接,3M用户),是否可以执行如上所述的复杂查询?
  2. 如果不可能有任何其他可能的方法可以减少执行时间(例如创建具有部分结果的表)?
  3. 您是否建议使用任何其他算法来检测群集并计算所描述图形的最短路径?

谢谢!

aso*_*ove 0

如果你想要一个确切的答案,无论你追求广度优先还是深度优先并不重要。确切的答案需要详尽的搜索,这会很慢。

正如 fmark 所建议的,启发式方法可以帮助您找到具有合理确定性的潜在最大解决方案。它会节省你很多时间,但并不准确。

您必须选择速度或精确度,不能真正两者兼得。这就像照片(或声音或视频)的图像压缩:对于大多数自然场景照片,png 是无损的,但压缩程度不高,jpeg 压缩得很好,但有一些损失。

编辑1:我能想到的唯一可以帮助您进行精确搜索的是稀疏矩阵的数学理论。您的问题类似于将社会关系强度矩阵提升到一系列不同的幂(A 和 B 之间的幂 n = n 步)并找出每个 (A, B) 对的哪些单元格具有最高值。这就是您对查询所做的事情,仅数据库查询可能不是实现此目的的最快方法。

不过,我无法为你提供更多帮助。您可能想查看维基百科的稀疏矩阵

编辑2:我刚刚又想到了一件事。我不明白你如何剔除你知道使用 SQL 查询肯定会很弱的分支,而使用定制的算法来处理你的稀疏矩阵,应该很容易剔除你知道可以消除的分支,基于在你的力量模型上。