是否可以在没有 UNION 的情况下组合 7 个源?

Mat*_*don 6 normalization sql-server ssis sql-server-2014

我有一个源表,它看起来基本上是这样的:

  • 员工代码
  • 周开始日期
  • 工作时间Day1
  • 工作时间Day2
  • 工作时间Day3
  • 工作时间Day4
  • 工作时间Day5
  • 工作时间Day6
  • 工作时间Day7

实际的表有类似于 500 个编号的列(并没有真正计算它们 - 有各种各样的编号为 1-7 的字段,然后是另一个编号为 1-25,乘以 7 的字段)每个工作日(不,这不是我的设计) ,目前大约有 38,600 行(每周增加)。

我有一个 SSIS 包,它试图标准化这些数据......目前看起来像这样:

联合所有

每个“源”从同一个源表中选择一组编号的列,UNION ALL 组件将 7 个源合并为一个,从而产生大约 258,900 行。

工作流的其余部分添加一些计算列,查找代理键(例如EmployeeCode用于查找EmployeeId,然后计算日期并用于查找 a TimeId),然后“修改”的行得到更新和“新的”被插入到规范化表中;未更改的行最终无处可去。

有没有更好的方法(例如减轻内存压力)来规范化源数据?

Han*_*non 11

没有完整的表定义,很难提供完美的答案。但是,为了在有限的重现中显示差异,使用非常少量的数据,我创建了以下测试台:

IF OBJECT_ID('tempdb..#src') IS NOT NULL
DROP TABLE #src;
CREATE TABLE #src
(
    EmployeeCode INT NOT NULL
    , WeekStartDate DATE NOT NULL
    , HoursDay1 INT NOT NULL
    , HoursDay2 INT NOT NULL
    , HoursDay3 INT NOT NULL
    , HoursDay4 INT NOT NULL
    , HoursDay5 INT NOT NULL
    , HoursDay6 INT NOT NULL
    , HoursDay7 INT NOT NULL
    , Widget1Day1 INT NOT NULL
    , Widget1Day2 INT NOT NULL
    , Widget1Day3 INT NOT NULL
    , Widget1Day4 INT NOT NULL
    , Widget1Day5 INT NOT NULL
    , Widget1Day6 INT NOT NULL
    , Widget1Day7 INT NOT NULL
    , Widget2Day1 INT NOT NULL
    , Widget2Day2 INT NOT NULL
    , Widget2Day3 INT NOT NULL
    , Widget2Day4 INT NOT NULL
    , Widget2Day5 INT NOT NULL
    , Widget2Day6 INT NOT NULL
    , Widget2Day7 INT NOT NULL
    , PRIMARY KEY CLUSTERED (WeekStartDate, EmployeeCode)
);
GO
INSERT INTO #src (EmployeeCode, WeekStartDate
    , HoursDay1, HoursDay2, HoursDay3, HoursDay4, HoursDay5, HoursDay6, HoursDay7
    , Widget1Day1, Widget1Day2, Widget1Day3, Widget1Day4, Widget1Day5, Widget1Day6, Widget1Day7
    , Widget2Day1, Widget2Day2, Widget2Day3, Widget2Day4, Widget2Day5, Widget2Day6, Widget2Day7
    )
VALUES (1, '2016-09-18'
    , 0, 8, 8, 8, 8, 8, 0
    , ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID()))
    , ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID())), ABS(CHECKSUM(NEWID()))
    );
Run Code Online (Sandbox Code Playgroud)

下面我们比较两个查询;第一个使用CROSS APPLY我在SQLServerScience.com 上详细介绍的方法,第二个使用该UNION ALL方法。

SELECT s.WeekStartDate
    , s.EmployeeCode
    , ItemsByDay.DayOfWeekName
    , ItemsByDay.HoursWorked
    , ItemsByDay.Widget1
    , ItemsByDay.Widget2
FROM #src s
CROSS APPLY (VALUES ('Sunday', HoursDay1, Widget1Day1, Widget2Day1)
    , ('Monday', HoursDay2, Widget1Day2, Widget2Day2)
    , ('Tuesday', HoursDay3, Widget1Day3, Widget2Day3)
    , ('Wednesday', HoursDay4, Widget1Day4, Widget2Day4)
    , ('Thursday', HoursDay5, Widget1Day5, Widget2Day5)
    , ('Friday', HoursDay6, Widget1Day6, Widget2Day6)
    , ('Saturday', HoursDay7, Widget1Day7, Widget2Day7)
    ) ItemsByDay(DayOfWeekName, HoursWorked, Widget1, Widget2);


SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Sunday'
    , s.HoursDay1
    , s.Widget1Day1
    , s.Widget2Day1
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Monday'
    , s.HoursDay2
    , s.Widget1Day2
    , s.Widget2Day2
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Tuesday'
    , s.HoursDay3
    , s.Widget1Day3
    , s.Widget2Day3
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Wednesday'
    , s.HoursDay4
    , s.Widget1Day4
    , s.Widget2Day4
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Thursday'
    , s.HoursDay5
    , s.Widget1Day5
    , s.Widget2Day5
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Friday'
    , s.HoursDay6
    , s.Widget1Day6
    , s.Widget2Day6
FROM #src s
UNION ALL
SELECT s.EmployeeCode
    , s.WeekStartDate
    , 'Saturday'
    , s.HoursDay7
    , s.Widget1Day7
    , s.Widget2Day7
FROM #src s;
Run Code Online (Sandbox Code Playgroud)

首先要注意的CROSS APPLY是, 更容易查看。这已经让我很高兴了。

让我们检查两个变体的执行计划:

在此处输入图片说明

UNION ALL变体扫描源表 7 次,而CROSS APPLY使用单表扫描。通过使用交叉申请,我们就是#Winning。

让我们添加更多数据:

/* create a table with 2 years worth of week start dates */
IF OBJECT_ID('tempdb..#Weeks') IS NULL
BEGIN
    CREATE TABLE #Weeks
    (
        WeekStart DATE NOT NULL
        PRIMARY KEY CLUSTERED
    );

    INSERT INTO #Weeks (WeekStart)
    SELECT TOP(104) DATEADD(DAY, (ROW_NUMBER() OVER (ORDER BY o1.name) - 1) * 7, '2016-01-03')
    FROM sys.objects o1
        CROSS JOIN sys.objects o2;
END

/* remove the single row from the source table we inserted above */
TRUNCATE TABLE #src;

/* insert a load of rows into the #src table */
INSERT INTO #src (EmployeeCode, WeekStartDate, HoursDay1, HoursDay2, HoursDay3, HoursDay4, HoursDay5, HoursDay6, HoursDay7)
SELECT ABS(CHECKSUM(NEWID()))
    , w.WeekStart
    , 0, 8, 8, 8, 8, 8, 0
FROM #Weeks w
    CROSS JOIN sys.objects o1;
Run Code Online (Sandbox Code Playgroud)

在我的系统上,上面的代码生成了大约 85,000 行。这两个查询的计划现在是:

在此处输入图片说明

SQL Sentry Plan Explorer显示以下摘要信息,这些信息非常宝贵:

在此处输入图片说明

这表示 CPU 被 更密集地使用CROSS APPLY,但是UNION ALL变体使用的 I/O 是 7 倍。

  • 太棒了,从未见过像这样使用“CROSS APPLY”!FWIW 这是我在这里合并的 118x7 列(我知道,这完全不雅!);单表扫描在 48 秒内发出了 270,179 行 - 我正在将该查询推入一个视图并现在将其用作我的 SSIS 源! (2认同)