并行运行函数

Chr*_*s L 4 performance sql-server sql-server-2012 functions

这适用于 SQL Server 2012。

我们有一些 FTP 文件的导入过程,这些过程被拾取并读入临时表,从那里我们在进入生产之前按摩/检查数据。导致一些问题的领域之一是日期,有些是有效的,有些是错别字,有些只是简单的胡言乱语。

我有以下示例表:

Create Table RawData
(
 InsertID int not null,
 MangledDateTime1 varchar(10) null,
 MangledDateTime2 varchar(10) null,
 MangledDateTime3 varchar(10) null
)
Run Code Online (Sandbox Code Playgroud)

我也有一个目标表(比如在生产中)

Create Table FinalData
(
  PrimaryKeyID int not null, -- PK constraint here, ident
  ForeighKeyID int not null, -- points to InsertID of RawData
  ValidDateTime1 SmallDateTime null,
  ValidDateTime2 SmallDateTime null,
  ValidDateTime3 SmallDateTime null
)
Run Code Online (Sandbox Code Playgroud)

我将以下内容插入到 RawData 表中:

 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '20001010', '20800630', '00000000') -- First is legit, second two are not
 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '20800630', '20130630', '20000000') -- middle is legit, first/third are not
 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '00001010', '00800630', '20130630') -- Last is legit, first two are not
Run Code Online (Sandbox Code Playgroud)

我写了一个函数dbo.CreateDate来解决这个问题。我们尝试尽可能地清理数据(NULL如果不能,则使用),然后将数据转换为正确的数据类型(在本例中smalldatetime)。

Insert Into FinalData(ForeighKeyID , ValidDateTime1, ValidDateTime2, ValidDateTime3)
Select 
 InsertID
 ,dbo.CreateDate(MangledDateTime1)
 ,dbo.CreateDate(MangledDateTime2)
 ,dbo.CreateDate(MangledDateTime3)
From RawData
Run Code Online (Sandbox Code Playgroud)

我们遇到了一些函数的性能问题。我想知道它们是否/如何并行工作。

我在这里假设该函数CreateDate在每行插入时并行运行。这样每个列/值都有它的“自己的”功能,并且在插入的同时运行。

但我可能是错的,它是否在插入时在每一行的每一列上连续运行?

创建日期()代码:

Alter Function dbo.CreateDate
(
@UnformattedString  varchar(12)
)
Returns smalldatetime
As
Begin
Declare @FormattedDate smalldatetime

If(@UnformattedString Is Not Null)
Begin
    Declare @MaxSmallDate varchar(8) = '20790606'


    -- We got gibberish
    If Len(@UnformattedString) = 1
    Begin
        return null
    End

    -- To account for date and time
    If Len(@UnformattedString) = 12
    Begin
        Select @UnformattedString = Substring(@UnformattedString, 0,9)
    End

    If @UnformattedString = '20000000'
    Begin
        Select @UnformattedSTring = @MaxSmallDate
    End

    -- Some people are sending us two digit years, won't parse right
    If Substring(@UnformattedString,0,3) = '00'
    Begin
        Select @UnformattedString = Replace(@UnformattedString, '00','20')
    End

    -- Some people are fat fingering in people born in 18??, so change to 19??
    If Substring(@UnformattedString,0,3) in ('18')
    Begin
        -- We only want to change the year '18', not day 18 
        SELECT @UnformattedString = STUFF(@UnformattedString, 
                           CHARINDEX('18', @UnformattedString), 2, '19')
    End

    -- We're getting gibberish
    If Substring(@UnformattedString,0,3) not in ('19','20') 
               And Len(@UnformattedString) != 6
    Begin
        Select @UnformattedString = Replace(@UnformattedString, 
                       Substring(@UnformattedString,0,3),'20')
    End

    -- If the 4 digit year is greater than current year, set to max date
    If Convert(int, Substring(@UnformattedString,0,5)) > Year(getdate())
    Begin
        Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
    End
    -- If the 4 digit year is less than 100 years ago, set to max date
    Else If Year(getdate()) - Convert(int, Substring(@UnformattedString,0,5)) >= 100
    Begin
        Set @FormattedDate = CONVERT(smalldatetime,@MaxSmallDate,1)
    End
    Else -- valid date(we hope)
    Begin
        Set @FormattedDate = CONVERT(smalldatetime,@UnformattedString,1) 
    End

    
    
End

Return @FormattedDate
End
Go
Run Code Online (Sandbox Code Playgroud)

SQL*_*Fox 10

使用 T-SQL 标量函数经常会导致性能问题*,因为 SQL Server 对每一行进行单独的函数调用(使用全新的 T-SQL 上下文)。此外,整个查询不允许并行执行

T-SQL 标量函数还可能导致难以解决性能问题(无论这些问题是否由函数引起)。该函数对查询优化器来说是一个“黑匣子”:它被分配了一个固定的低估计成本,而不管函数的实际内容如何。

有关标量函数的陷阱的更多信息,请参阅thisthis

在 SQL Server 2012 中使用新的TRY_CONVERT函数可能会更好:

SELECT
    InsertID,
    dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
    dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
    dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;

??????????????????????????????????????????????????????????????????????????????
? InsertID ?         dt1         ?         dt2         ?         dt3         ?
??????????????????????????????????????????????????????????????????????????????
?        1 ? 2000-10-10 00:00:00 ? NULL                ? NULL                ?
?        1 ? NULL                ? 2013-06-30 00:00:00 ? NULL                ?
?        1 ? NULL                ? NULL                ? 2013-06-30 00:00:00 ?
??????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

编辑问题后

我看到该函数包含一些特定的逻辑。您仍然可以将其TRY_CONVERT用作其中的一部分,但您绝对应该将标量函数转换为内联函数。内嵌函数 ( RETURNS TABLE) 使用单个SELECT语句并扩展到调用查询中,并以与视图大致相同的方式进行全面优化。将内联函数视为参数化视图会很有帮助。

例如,标量函数到内联版本的近似转换是:

CREATE FUNCTION dbo.CleanDate
    (@UnformattedString  varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
    -- Successful conversion or NULL after
    -- workarounds applied in CROSS APPLY
    -- clauses below
    TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
    -- Logic starts here
    SELECT        
        CASE
            WHEN @UnformattedString IS NULL
                THEN NULL
            WHEN LEN(@UnformattedString) <= 1
                THEN NULL
            WHEN LEN(@UnformattedString) = 12
                THEN LEFT(@UnformattedString, 8)
            ELSE @UnformattedString
        END
) AS Input (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT 
        CASE 
            WHEN @UnformattedString = '20000000' 
            THEN '20790606' 
            ELSE Input.string
        END
) AS ca1 (string)
CROSS APPLY 
(
    -- Next stage using result so far
    SELECT CASE
        WHEN LEFT(ca1.string, 2) = '00' THEN '20' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '18' THEN '19' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '19' THEN ca1.string
        WHEN LEFT(ca1.string, 2) = '20' THEN ca1.string
        WHEN LEN(ca1.string) <> 6 THEN '20' + RIGHT(ca1.string, 6)
        ELSE ca1.string
    END
) AS ca2 (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT
        CASE 
            WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
                THEN '20790606'
            WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
                THEN '20790606'
            ELSE ca2.string
        END
) AS ca3 (string);
Run Code Online (Sandbox Code Playgroud)

用于样本数据的函数:

SELECT
    InsertID,
    Result1 = CD1.Result,
    Result2 = CD2.Result,
    Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;
Run Code Online (Sandbox Code Playgroud)

输出:

??????????????????????????????????????????????????????????????????????????????
? InsertID ?       Result1       ?       Result2       ?       Result3       ?
??????????????????????????????????????????????????????????????????????????????
?        1 ? 2000-10-10 00:00:00 ? 2079-06-06 00:00:00 ? NULL                ?
?        1 ? 2079-06-06 00:00:00 ? 2013-06-30 00:00:00 ? 2079-06-06 00:00:00 ?
?        1 ? 2000-10-10 00:00:00 ? 2079-06-06 00:00:00 ? 2013-06-30 00:00:00 ?
??????????????????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

*CLR 标量函数的调用路径比 T-SQL 标量函数快得多,并且不会阻止并行性。