Chr*_*s L 4 performance sql-server sql-server-2012 functions
这适用于 SQL Server 2012。
我们有一些 FTP 文件的导入过程,这些过程被拾取并读入临时表,从那里我们在进入生产之前按摩/检查数据。导致一些问题的领域之一是日期,有些是有效的,有些是错别字,有些只是简单的胡言乱语。
我有以下示例表:
Create Table RawData
(
InsertID int not null,
MangledDateTime1 varchar(10) null,
MangledDateTime2 varchar(10) null,
MangledDateTime3 varchar(10) null
)
Run Code Online (Sandbox Code Playgroud)
我也有一个目标表(比如在生产中)
Create Table FinalData
(
PrimaryKeyID int not null, -- PK constraint here, ident
ForeighKeyID int not null, -- points to InsertID of RawData
ValidDateTime1 SmallDateTime null,
ValidDateTime2 SmallDateTime null,
ValidDateTime3 SmallDateTime null
)
Run Code Online (Sandbox Code Playgroud)
我将以下内容插入到 RawData 表中:
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '20001010', '20800630', '00000000') -- First is legit, second two are not
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '20800630', '20130630', '20000000') -- middle is legit, first/third are not
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '00001010', '00800630', '20130630') -- Last is legit, first two are not
Run Code Online (Sandbox Code Playgroud)
我写了一个函数dbo.CreateDate
来解决这个问题。我们尝试尽可能地清理数据(NULL
如果不能,则使用),然后将数据转换为正确的数据类型(在本例中smalldatetime
)。
Insert Into FinalData(ForeighKeyID , ValidDateTime1, ValidDateTime2, ValidDateTime3)
Select
InsertID
,dbo.CreateDate(MangledDateTime1)
,dbo.CreateDate(MangledDateTime2)
,dbo.CreateDate(MangledDateTime3)
From RawData
Run Code Online (Sandbox Code Playgroud)
我们遇到了一些函数的性能问题。我想知道它们是否/如何并行工作。
我在这里假设该函数CreateDate
在每行插入时并行运行。这样每个列/值都有它的“自己的”功能,并且在插入的同时运行。
但我可能是错的,它是否在插入时在每一行的每一列上连续运行?
Alter Function dbo.CreateDate
(
@UnformattedString varchar(12)
)
Returns smalldatetime
As
Begin
Declare @FormattedDate smalldatetime
If(@UnformattedString Is Not Null)
Begin
Declare @MaxSmallDate varchar(8) = '20790606'
-- We got gibberish
If Len(@UnformattedString) = 1
Begin
return null
End
-- To account for date and time
If Len(@UnformattedString) = 12
Begin
Select @UnformattedString = Substring(@UnformattedString, 0,9)
End
If @UnformattedString = '20000000'
Begin
Select @UnformattedSTring = @MaxSmallDate
End
-- Some people are sending us two digit years, won't parse right
If Substring(@UnformattedString,0,3) = '00'
Begin
Select @UnformattedString = Replace(@UnformattedString, '00','20')
End
-- Some people are fat fingering in people born in 18??, so change to 19??
If Substring(@UnformattedString,0,3) in ('18')
Begin
-- We only want to change the year '18', not day 18
SELECT @UnformattedString = STUFF(@UnformattedString,
CHARINDEX('18', @UnformattedString), 2, '19')
End
-- We're getting gibberish
If Substring(@UnformattedString,0,3) not in ('19','20')
And Len(@UnformattedString) != 6
Begin
Select @UnformattedString = Replace(@UnformattedString,
Substring(@UnformattedString,0,3),'20')
End
-- If the 4 digit year is greater than current year, set to max date
If Convert(int, Substring(@UnformattedString,0,5)) > Year(getdate())
Begin
Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
End
-- If the 4 digit year is less than 100 years ago, set to max date
Else If Year(getdate()) - Convert(int, Substring(@UnformattedString,0,5)) >= 100
Begin
Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
End
Else -- valid date(we hope)
Begin
Set @FormattedDate = CONVERT(smalldatetime,@UnformattedString,1)
End
End
Return @FormattedDate
End
Go
Run Code Online (Sandbox Code Playgroud)
SQL*_*Fox 10
使用 T-SQL 标量函数经常会导致性能问题*,因为 SQL Server 对每一行进行单独的函数调用(使用全新的 T-SQL 上下文)。此外,整个查询不允许并行执行。
T-SQL 标量函数还可能导致难以解决性能问题(无论这些问题是否由函数引起)。该函数对查询优化器来说是一个“黑匣子”:它被分配了一个固定的低估计成本,而不管函数的实际内容如何。
在 SQL Server 2012 中使用新的TRY_CONVERT函数可能会更好:
SELECT
InsertID,
dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;
??????????????????????????????????????????????????????????????????????????????
? InsertID ? dt1 ? dt2 ? dt3 ?
??????????????????????????????????????????????????????????????????????????????
? 1 ? 2000-10-10 00:00:00 ? NULL ? NULL ?
? 1 ? NULL ? 2013-06-30 00:00:00 ? NULL ?
? 1 ? NULL ? NULL ? 2013-06-30 00:00:00 ?
??????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)
我看到该函数包含一些特定的逻辑。您仍然可以将其TRY_CONVERT
用作其中的一部分,但您绝对应该将标量函数转换为内联函数。内嵌函数 ( RETURNS TABLE
) 使用单个SELECT
语句并扩展到调用查询中,并以与视图大致相同的方式进行全面优化。将内联函数视为参数化视图会很有帮助。
例如,标量函数到内联版本的近似转换是:
CREATE FUNCTION dbo.CleanDate
(@UnformattedString varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
-- Successful conversion or NULL after
-- workarounds applied in CROSS APPLY
-- clauses below
TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
-- Logic starts here
SELECT
CASE
WHEN @UnformattedString IS NULL
THEN NULL
WHEN LEN(@UnformattedString) <= 1
THEN NULL
WHEN LEN(@UnformattedString) = 12
THEN LEFT(@UnformattedString, 8)
ELSE @UnformattedString
END
) AS Input (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT
CASE
WHEN @UnformattedString = '20000000'
THEN '20790606'
ELSE Input.string
END
) AS ca1 (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT CASE
WHEN LEFT(ca1.string, 2) = '00' THEN '20' + RIGHT(ca1.string, 6)
WHEN LEFT(ca1.string, 2) = '18' THEN '19' + RIGHT(ca1.string, 6)
WHEN LEFT(ca1.string, 2) = '19' THEN ca1.string
WHEN LEFT(ca1.string, 2) = '20' THEN ca1.string
WHEN LEN(ca1.string) <> 6 THEN '20' + RIGHT(ca1.string, 6)
ELSE ca1.string
END
) AS ca2 (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT
CASE
WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
THEN '20790606'
WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
THEN '20790606'
ELSE ca2.string
END
) AS ca3 (string);
Run Code Online (Sandbox Code Playgroud)
用于样本数据的函数:
SELECT
InsertID,
Result1 = CD1.Result,
Result2 = CD2.Result,
Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;
Run Code Online (Sandbox Code Playgroud)
输出:
??????????????????????????????????????????????????????????????????????????????
? InsertID ? Result1 ? Result2 ? Result3 ?
??????????????????????????????????????????????????????????????????????????????
? 1 ? 2000-10-10 00:00:00 ? 2079-06-06 00:00:00 ? NULL ?
? 1 ? 2079-06-06 00:00:00 ? 2013-06-30 00:00:00 ? 2079-06-06 00:00:00 ?
? 1 ? 2000-10-10 00:00:00 ? 2079-06-06 00:00:00 ? 2013-06-30 00:00:00 ?
??????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)
*CLR 标量函数的调用路径比 T-SQL 标量函数快得多,并且不会阻止并行性。