用特殊字符替换一组连续的数字

Lev*_*evi 3 sql-server pattern-matching sql-server-2014 string-manipulation query-performance

我有一个varchar(200)列,其中包含诸如,

ABC123124_A12312 ABC123_A1212 ABC123124_B12312 AC123124_AD12312 A12312_123 等等..

我想用一个数字替换一个数字序列,*以便我可以对表格中的不同非数字模式进行分组。

这个集合的结果是 ABC*_A* ABC*_B* AC*_AD* A*_*

我在下面编写了以下原始查询,它可以正常工作,但是在一张大表上运行需要很长时间。

我需要帮助重写或编辑它以提高它的性能。SQL Server 2014

-- 1. replace all numeric characters with '*'
-- 2. replace multiple consecutive '*' with just a single '*'
SELECT REPLACE
        (REPLACE
             (REPLACE
                  (REPLACE
                       (REPLACE
                            (REPLACE
                                 (REPLACE
                                      (REPLACE
                                           (REPLACE
                                                (REPLACE
                                                     (REPLACE
                                                          (REPLACE
                                                               (REPLACE(SampleID, '0', '*'),
                                                                '1', '*'),
                                                           '2', '*'),
                                                      '3', '*'),
                                                 '4', '*'),
                                            '5', '*'),
                                       '6', '*'),
                                  '7', '*'),
                             '8', '*'),
                        '9', '*')
                  , '*', '~*') -- replace each occurrence of '*' with '~*' (token plus asterisk)
             , '*~', '') -- replace in the result of the previous step each occurrence of '*~' (asterisk plus token) with '' (an empty string)
        , '~*', '*') -- replace in the result of the previous step each occurrence of '~*' (token plus asterisk) with '*' (asterisk)
        AS Pattern
FROM TABLE_X
Run Code Online (Sandbox Code Playgroud)

数据

该列包含字母和数字[A-Za-z0-9],还可能包含特殊字符/_. 我想用 替换任何数字序列*,但我不知道该条目是否有特殊字符,如果有,有多少特殊字符。

我也不知道条目中有多少个数字序列。我只知道一个条目必须至少有 1 个数字序列。

Pau*_*ite 12

两个因素对性能很重要:

  1. 减少字符串操作的次数。

    您可能会发现可以使用例如实现您需要的内容CHARINDEXPATINDEX找到组的开始和结束,而不是REPLACE每次都对整个字符串执行很多操作。

  2. 使用提供正确结果的最便宜的归类。

    二进制排序是最便宜的。SQL 排序规则(仅针​​对非 Unicode 数据)要贵一些。Windows 排序规则要贵得多。

例如:

DECLARE @T table
(
    SampleID varchar(200) NOT NULL UNIQUE
);

INSERT @T
    (SampleID)
VALUES
    ('ABC123124_A12312'),
    ('ABC123_A1212'),
    ('ABC123124_B12312'),
    ('AC123124_AD12312'),
    ('A12312_123'),
    ('999ABC888DEF');
Run Code Online (Sandbox Code Playgroud)
SELECT
    T.SampleID,
    Pattern =
    (
        SELECT
            CASE
                WHEN Chars.this NOT LIKE '[0123456789]' THEN Chars.this
                WHEN Chars.prev NOT LIKE '[0123456789]' THEN '*'
                ELSE ''
            END
        FROM dbo.Numbers AS N
        OUTER APPLY
        (
            SELECT 
                SUBSTRING(Bin.string, N.n, 1),
                SUBSTRING(Bin.string, N.n + 1, 1)
        ) AS Chars (prev, this)
        WHERE
            N.n BETWEEN 1 AND LEN(Bin.string)
        ORDER BY N.n
        FOR XML PATH ('')
    )
FROM @T AS T
OUTER APPLY (VALUES('$' + T.SampleID COLLATE Latin1_General_100_BIN2)) AS Bin (string);
Run Code Online (Sandbox Code Playgroud)

db<>小提琴演示

该示例依赖于一个永久的数字表。如果需要,足够的表varchar(200)是:

-- Create a numbers table 1-200 using Itzik Ben-Gan's row generator
WITH
  L0   AS (SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
    -- Destination column type integer NOT NULL
    ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 200
OPTION (MAXDOP 1);

-- Add clustered primary key
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);
Run Code Online (Sandbox Code Playgroud)

如果这不是更快,您可能会发现单独使用二进制排序规则会充分加速您现有的实现。要实现这一点,请将一行代码更改为:

(REPLACE(SampleID COLLATE Latin1_General_100_BIN2, '0', '*'),
Run Code Online (Sandbox Code Playgroud)

SQL Server 2017 或更高版本的用户可以利用内置TRANSLATE函数,其性能可能比嵌套REPLACE调用更好。

您还可以使用通用正则表达式 CLR 函数,或在 SQLCLR 中为该特定任务实现一些自定义功能。请参见示例SQL Server:替换为通配符?

使用SQL# 库,一个完整的解决方案是:

SELECT 
    T.SampleID,
    SQL#.RegEx_Replace4k(T.SampleID, '\d+', '*', -1, 1, 'CultureInvariant')
FROM @T AS T;
Run Code Online (Sandbox Code Playgroud)

完整的正则表达式支持对于这项任务来说太过分了,因此如果您能够使用 SQLCLR,那么根据您的需要编写特定的函数可能是所有解决方案中性能最佳的。