有没有办法使用OVER子句而不是CTE来计算TSQL中的相关性?

bpe*_*kes 7 sql t-sql sql-server correlation

假设您有一个包含列,Date,GroupID,X和Y的表.

CREATE TABLE #sample
  (
     [Date]  DATETIME,
     GroupID INT,
     X       FLOAT,
     Y       FLOAT
  )

DECLARE @date DATETIME = getdate()

INSERT INTO #sample VALUES(@date, 1, 1,3)
INSERT INTO #sample VALUES(DATEADD(d, 1, @date), 1, 1,1)
INSERT INTO #sample VALUES(DATEADD(d, 2, @date), 1, 4,2)
INSERT INTO #sample VALUES(DATEADD(d, 3, @date), 1, 3,3)
INSERT INTO #sample VALUES(DATEADD(d, 4, @date), 1, 6,4)
INSERT INTO #sample VALUES(DATEADD(d, 5, @date), 1, 7,5)
INSERT INTO #sample VALUES(DATEADD(d, 6, @date), 1, 1,6)
Run Code Online (Sandbox Code Playgroud)

并且您想要计算每个组的X和Y的相关性.目前我使用的CTE有点乱:

;WITH DataAvgStd
     AS (SELECT GroupID,
                AVG(X)   AS XAvg,
                AVG(Y)   AS YAvg,
                STDEV(X) AS XStdev,
                STDEV(Y) AS YSTDev,
                COUNT(*) AS SampleSize
         FROM   #sample
         GROUP  BY GroupID),
     ExpectedVal
     AS (SELECT s.GroupID,
                SUM(( X - XAvg ) * ( Y - YAvg )) AS ExpectedValue
         FROM   #sample s
                JOIN DataAvgStd das
                  ON s.GroupID = das.GroupID
         GROUP  BY s.GroupID)
SELECT das.GroupID,
       ev.ExpectedValue / ( das.SampleSize - 1 ) / ( das.XStdev * das.YSTDev )
       AS
       Correlation
FROM   DataAvgStd das
       JOIN ExpectedVal ev
         ON das.GroupID = ev.GroupID

DROP TABLE #sample  
Run Code Online (Sandbox Code Playgroud)

似乎应该有一种方法可以使用OVER和PARTITION一次性执行此操作而不需要任何子查询.理想情况下,TSQL会有一个函数,所以你可以写:

SELECT GroupID, CORR(X, Y) OVER(PARTITION BY GroupID)
FROM #sample
GROUP BY GroupID
Run Code Online (Sandbox Code Playgroud)

Ath*_*oud 9

使用这个corellation公式即使你使用也无法避免所有嵌套查询over().问题是你不能在同一个查询中反复使用这两个组,也不能有嵌套的聚合函数,例如sum(x - avg(x)).因此,在最佳情况下,根据您的数据,您至少需要保留with.

你的代码看起来就像那样

;WITH DataAvgStd
     AS (SELECT GroupID,
                STDEV(X) over(partition by GroupID) AS XStdev,
                STDEV(Y) over(partition by GroupID) AS YSTDev,
                COUNT(*) over(partition by GroupID) AS SampleSize,
                ( X - AVG(X) over(partition by GroupID)) * ( Y - AVG(Y) over(partition by GroupID)) AS ExpectedValue
         FROM   #sample s)         
SELECT distinct GroupID,
       SUM(ExpectedValue) over(partition by GroupID) / (SampleSize - 1 ) / ( XStdev * YSTDev ) AS Correlation
FROM DataAvgStd 
Run Code Online (Sandbox Code Playgroud)

另一种方法是使用维基百科描述的相关公式进行相关.

这可以写成

SELECT GroupID,
       Correlation=(COUNT(*) * SUM(X * Y) - SUM(X) * SUM(Y)) / 
                   (SQRT(COUNT(*) * SUM(X * X) - SUM(X) * SUM(x))
                    * SQRT(COUNT(*) * SUM(Y* Y) - SUM(Y) * SUM(Y)))
FROM #sample s
GROUP BY GroupID;
Run Code Online (Sandbox Code Playgroud)