需要什么:我从 oracle 增量加载 2500 万条记录到 SQL Server 2012。它需要在包中有更新、删除、新记录功能。oracle 数据源一直在变化。
我有什么:我以前做过很多次,但没有超过 1000 万条记录。首先,我有一个 [执行 SQL 任务],它被设置为获取 [最大修改日期] 的结果集。然后我有一个查询,它只从 [ORACLE SOURCE] > [最大修改日期] 中提取数据,并对我的目标表进行查找。
我有 [ORACLE Source] 连接到 [Lookup-Destination table],查找设置为 NO CACHE 模式,如果我使用部分或完全缓存模式我会出错,因为我假设 [ORACLE Source] 总是在变化。[Lookup] 然后连接到 [Conditional Split],我将在其中输入如下所示的表达式。
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
Run Code Online (Sandbox Code Playgroud)
然后我会将 [Conditional Split] 输出到暂存表中的行。然后我添加一个 [执行 SQL 任务] 并使用以下查询对 DESTINATION-TABLE 执行更新:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Run Code Online (Sandbox Code Playgroud)
问题:这会变得很慢,需要很长时间,而且它一直在运行。我怎样才能改善时间并让它发挥作用?我应该使用缓存转换吗?我应该使用合并语句吗?
当它是数据列时,如何在条件拆分中使用表达式 REPLACENULL?我会使用类似的东西:
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
Run Code Online (Sandbox Code Playgroud)
下图:


对于较大的数据集,通常更快的模式是将源数据加载到本地临时表中,然后使用如下查询来识别新记录:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Run Code Online (Sandbox Code Playgroud)
然后你只需将该数据集输入到插入中:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Run Code Online (Sandbox Code Playgroud)
更新如下:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Run Code Online (Sandbox Code Playgroud)
请注意附加列DTUpdated。您的表中应该始终有一个“上次更新”列,以帮助审核和调试。
这是一种插入/更新方法。还有其他数据加载方法,例如窗口化(选择要完全删除和重新加载的数据的尾随窗口),但该方法取决于您的系统如何工作以及您是否可以对数据做出假设(即源中发布的数据永远不会改变了)
您可以将单独的INSERT和UPDATE语句压缩为一个MERGE语句,尽管它变得非常庞大,而且我遇到了性能问题,并且还有其他记录在案的问题MERGE
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
| 归档时间: |
|
| 查看次数: |
19374 次 |
| 最近记录: |