Microsoft SQL Server中的自然(人类字母数字)排序

Kai*_*ran 6 sql sql-server sorting natural-sort

感谢您花时间阅读所有这些,很多!感谢所有爱好者!

如何自然排序?

即.命令一组字母数字数据显示为:

Season 1, Season 2, Season 10, Season 20
Run Code Online (Sandbox Code Playgroud)

代替

Season 1, Season 10, Season 2, Season 20
Run Code Online (Sandbox Code Playgroud)

我以非常实用的格式使用一个非常实用的电视季节示例作为案例.

我希望完成以下任务:

  1. 与他人分享我的工作解决方案
  2. 请求您的帮助,以确定如何缩短它(或找到更好的解决方案)到我的解决方案
  3. 你能解决下面的标准7吗?

我花了大约2个小时在线研究,另外3个小时建立了这个解决方案.一些参考资料来自:

在SO和其他站点上找到的一些解决方案仅适用于90%的案例.但是,如果文本中有多个数值,则大多数/全部都不起作用,或者如果文本中根本没有找到数字,则会导致SQL错误.

我创建了这个SQLFiddle链接来玩(包括以下所有代码).

这是create语句:

create table tvseason
(
    title varchar(100)
);

insert into tvseason (title)
values ('100 Season 03'), ('100 Season 1'),
       ('100 Season 10'), ('100 Season 2'),
       ('100 Season 4'), ('Show Season 1 (2008)'),
       ('Show Season 2 (2008)'), ('Show Season 10 (2008)'),
       ('Another Season 01'), ('Another Season 02'),
       ('Another 1st Anniversary Season 01'),
       ('Another 2nd Anniversary Season 01'),
       ('Another 10th Anniversary Season 01'),
       ('Some Show Another No Season Number'),
       ('Some Show No Season Number'),
       ('Show 2 Season 1'),
       ('Some Show With Season Number 1'),
       ('Some Show With Season Number 2'),
       ('Some Show With Season Number 10');
Run Code Online (Sandbox Code Playgroud)

这是我的工作解决方案(只能解决下面的标准#7):

select 
    title, "index", titleLeft,
    convert(int, coalesce(nullif(titleRightTrim2, ''), titleRight)) titleRight
from
    (select 
         title, "index", titleLeft, titleRight, titleRightTrim1,
         case 
            when PATINDEX('%[^0-9]%', titleRightTrim2) = 0 
               then titleRightTrim2
               else left(titleRightTrim2, PATINDEX('%[^0-9]%', titleRightTrim2) - 1)
         end as titleRightTrim2
     from
         (select
              title, 
              len(title) - PATINDEX('%[0-9] %', reverse(title)) 'index',
              left(title, len(title) - PATINDEX('%[0-9] %', reverse(title))) titleLeft,
              ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))) titleRight,
              ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))) titleRightTrim1,
              left(ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))), PATINDEX('% %', ltrim(right(title, PATINDEX('%[0-9] %', reverse(title)))))) titleRightTrim2
          from
              tvseason) x) y
order by 
    titleLeft, titleRight
Run Code Online (Sandbox Code Playgroud)

要考虑的标准:

  1. 文字不包含数字
  2. 文本包含开头和结尾的数字
  3. 文本仅包含开头的数字
  4. 文本仅包含末尾的数字
  5. 文本最后可能包含(YYYY)
  6. 文本可以以单个数字或双位数字结尾(例如1或01)
  7. 可选:上述任意组合,加上文本中间的数字

这是输出:

title
100 Season 1
100 Season 2
100 Season 03
100 Season 4
100 Season 10
**Case 7 here**
Another 10th Anniversary Season 01
Another 1st Anniversary Season 01
Another 2nd Anniversary Season 01
Another Season 01
Another Season 02
Show (2008) Season 1
Show (2008) Season 2
Show 2 The 75th Anniversary Season 1
Show Season 1 (2008)
Show Season 2 (2008)
Show Season 10 (2008)
Some Show Another No Season Number
Some Show No Season Number
Some Show With Season Number 1
Some Show With Season Number 2
Some Show With Season Number 10
Run Code Online (Sandbox Code Playgroud)

Jas*_*ong 5

我认为这可以解决问题......我只是识别从非数字到数字的变化。我没有做过任何大规模的测试,但它应该相当快。

SET QUOTED_IDENTIFIER ON;
GO
SET ANSI_NULLS ON;
GO

ALTER FUNCTION dbo.tfn_SplitForSort
/* ===================================================================
11/11/2018 JL, Created: Comments    
=================================================================== */
--===== Define I/O parameters
(
    @string VARCHAR(8000)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN 
    WITH 
        cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)), 
        cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),
        cte_Tally (n) AS (
            SELECT TOP (LEN(@string))
                ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            FROM
                cte_n2 a CROSS JOIN cte_n2 b
            ),
        cte_split_string AS (
            SELECT 
                col_num = ROW_NUMBER() OVER (ORDER BY t.n) + CASE WHEN LEFT(@string, 1) LIKE '[0-9]' THEN 0 ELSE 1 END,
                string_part = SUBSTRING(@string, t.n, LEAD(t.n, 1, 8000) OVER (ORDER BY t.n) - t.n)
            FROM
                cte_Tally t
                CROSS APPLY ( VALUES (SUBSTRING(@string, t.n, 2)) ) s (str2)
            WHERE 
                t.n = 1
                OR SUBSTRING(@string, t.n - 1, 2) LIKE '[0-9][^0-9]'
                OR SUBSTRING(@string, t.n - 1, 2) LIKE '[^0-9][0-9]'
            )

    SELECT 
        so_01 = ISNULL(MAX(CASE WHEN ss.col_num = 1 THEN CONVERT(FLOAT, ss.string_part) END), 99999999),
        so_02 = MAX(CASE WHEN ss.col_num = 2 THEN ss.string_part END),
        so_03 = MAX(CASE WHEN ss.col_num = 3 THEN CONVERT(FLOAT, ss.string_part) END),
        so_04 = MAX(CASE WHEN ss.col_num = 4 THEN ss.string_part END),
        so_05 = MAX(CASE WHEN ss.col_num = 5 THEN CONVERT(FLOAT, ss.string_part) END),
        so_06 = MAX(CASE WHEN ss.col_num = 6 THEN ss.string_part END),
        so_07 = MAX(CASE WHEN ss.col_num = 7 THEN CONVERT(FLOAT, ss.string_part) END),
        so_08 = MAX(CASE WHEN ss.col_num = 8 THEN ss.string_part END),
        so_09 = MAX(CASE WHEN ss.col_num = 9 THEN CONVERT(FLOAT, ss.string_part) END),
        so_10 = MAX(CASE WHEN ss.col_num = 10 THEN ss.string_part END)
    FROM
        cte_split_string ss;
GO
Run Code Online (Sandbox Code Playgroud)

正在使用的函数...

SELECT 
    ts.*
FROM
    #tvseason ts
    CROSS APPLY dbo.tfn_SplitForSort (ts.title) sfs
ORDER BY
    sfs.so_01,
    sfs.so_02,
    sfs.so_03,
    sfs.so_04,
    sfs.so_05,
    sfs.so_06,
    sfs.so_07,
    sfs.so_08,
    sfs.so_09,
    sfs.so_10;
Run Code Online (Sandbox Code Playgroud)

结果:

id          title
----------- ------------------------------------------
2           100 Season 1
4           100 Season 2
1           100 Season 03
5           100 Season 4
3           100 Season 10
11          Another 1st Anniversary Season 01
12          Another 2nd Anniversary Season 01
13          Another 10th Anniversary Season 01
9           Another Season 01
10          Another Season 02
16          Show 2 Season 1
6           Show Season 1 (2008)
7           Show Season 2 (2008)
8           Show Season 10 (2008)
14          Some Show Another No Season Number
15          Some Show No Season Number
17          Some Show With Season Number 1
18          Some Show With Season Number 2
19          Some Show With Season Number 10
Run Code Online (Sandbox Code Playgroud)

--================================================ ======================

[编辑 2020-09-23] 我正在回顾我的一些旧帖子,当我遇到这个帖子时,想看看我是否可以开始使用单值输出。将 10 列添加到 ORDER BY 只是笨拙的......经过一番思考后,我想到将 FLOAT 转换为 BINARY 并将 BINARY 转换回 VARCHAR,我可以使用 STRING_AGG() 函数重新组装字符串。最终结果将是产生所需排序的字符串。

CREATE FUNCTION dbo.human_sort_string
/* ===================================================================
09/23/2020 JL, Created: Just a test 
=================================================================== */
--===== Define I/O parameters
(
    @string varchar(8000)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
    WITH 
        cte_n1 (n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)),   -- 10
        cte_n2 (n) AS (SELECT 1 FROM cte_n1 a CROSS JOIN cte_n1 b),                             -- 100
        cte_Tally (n) AS (
            SELECT TOP (LEN(@string))
                ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            FROM
                cte_n2 a CROSS JOIN cte_n2 b                                                    -- 10,000
            ),
        cte_Parsed AS (
            SELECT 
                t.n,
                parsed_val = SUBSTRING(@string, ISNULL(NULLIF(t.n, 1), 0) + 1, LEAD(t.n, 1, 8000) OVER (ORDER BY t.n) - ISNULL(NULLIF(t.n, 1), 0))
            FROM 
                cte_Tally t
                CROSS APPLY ( VALUES (SUBSTRING(@string, t.n, 2)) ) sv (sub_val)
            WHERE 
                t.n = 1
                OR 
                sv.sub_val LIKE '[0-9][^0-9]'
                OR 
                sv.sub_val LIKE '[^0-9][0-9]'
            )
    SELECT 
        sort_string = STRING_AGG(ISNULL(CONVERT(varchar(8000), CONVERT(binary(8), TRY_CONVERT(float, p.parsed_val)), 2), p.parsed_val), '') WITHIN GROUP (ORDER BY p.n)
    FROM
        cte_Parsed p;
GO
Run Code Online (Sandbox Code Playgroud)

现在,外部查询看起来像这样......

SELECT 
    ts.id,
    td.title
FROM
    #tvseason ts
    CROSS APPLY dbo.human_sort_string(ts.title) hss
ORDER BY
    hss.sort_string;
Run Code Online (Sandbox Code Playgroud)

实际结果与之前的函数相同。


Xed*_*dni 2

就我个人而言,我会尽量避免在 SQL 中进行复杂的字符串操作。我可能会将其转储到文本文件中,并使用 C# 或 Python 等正则表达式对其进行处理。然后将其写回数据库中的单独列中。众所周知,SQL 不擅长字符串操作。

然而,这是我对 SQL 方法的尝试。Season [number]这个想法基本上是首先消除其中没有字符串的任何行。这处理了没有季节可供解析的情况。我选择将它们包含在 null 中,但是您可以轻松地在 where 子句中省略它们,或者为它们提供一些默认值。我使用该stuff()函数来截断字符串之前的所有内容Season [number],因此更容易使用。

现在我们的字符串以季号开头,并可能以一些垃圾结尾。我使用 case 语句来查看是否有垃圾(任何非数字的内容),如果有,我会取出最左边的数字字符并丢弃其余的。如果一开始只有数字,我就保持原样。

最后,将其转换为 int,并按它排序。

if object_id('tempdb.dbo.#titles') is not null drop table #titles
create table #titles (Title varchar(100))
insert into #titles (TItle)
select title = '100 Season 1'
union all select '100 Season 2'
union all select '100 Season 03'
union all select '100 Season 4'
union all select '100 Season 10'
union all select 'Another 10th Anniversary Season 01'
union all select 'Another 1st Anniversary Season 01'
union all select 'Another 2nd Anniversary Season 01'
union all select 'Another Season 01'
union all select 'Another Season 02'
union all select 'Show (2008) Season 1'
union all select 'Show (2008) Season 2'
union all select 'Show 2 The 75th Anniversary Season 1'
union all select 'Show Season 1 (2008)'
union all select 'Show Season 2 (2008)'
union all select 'Show Season 10 (2008)'
union all select 'Some Show Another No Season Number'
union all select 'Some Show No Season Number'
union all select 'Some Show With Season Number 1'
union all select 'Some Show With Season Number 2'
union all select 'Some Show With Season Number 10'

;with src as
(
    select 
        Title, 
        Trimmed = case when Title like '%Season [0-9]%' 
                       then stuff(title, 1, patindex('%season [0-9]%', title) + 6, '')
                       else null
                  end
    from #titles
)
select 
    Season = cast(case when Trimmed like '%[^0-9]%' then left(Trimmed, patindex('%[^0-9]%', Trimmed))
         else Trimmed
    end as int),
    Title
from src
order by Season 
Run Code Online (Sandbox Code Playgroud)