WHERE IN (1,2,3,4) 与 IN 之间的性能差距(select * from STRING_SPLIT('1,2,3,4',','))

rae*_*dor 3 performance sql-server execution-plan string-splitting query-performance

我似乎在对 aSELECT IN和 a使用硬编码值之间存在巨大的性能差距STRING_SPLIT。除了最后一个阶段为STRING_SPLIT代码多次执行索引查找之外,查询计划是相同的。结果是大约 90000 与大约 15000(根据dm_exec_query_stats)的 CPU 时间,因此差异是巨大的。我已经在这里发布了两个计划......

  1. 硬编码计划
  2. 字符串拆分计划

有趣的是查询计划显示的成本几乎相同,但是当我检查dm_exec_query_stats成本 ( last_worker_time)时却大不相同。

这是查询计划的 2 个输出...

0x79DEAD79D1F149CD 16199 
select * 
from fn_get_samples(1) s 
where s.sample_id in 
    (2495,2496,2497,2498,2499,2500,2501,2502,2503,2504)

0x4A073840486B252C 86689 
select * 
from fn_get_samples(1) s 
where s.sample_id in 
    (select value as id 
     from 
     STRING_SPLIT('2495,2496,2497,2498,2499,2500,2501,2502,2503,2504',','))
Run Code Online (Sandbox Code Playgroud)

功能代码是...

CREATE FUNCTION [dbo].[fn_get_samples]
(
    @user_id int
)
RETURNS TABLE
AS
RETURN (
    -- get samples
    select s.sample_id,language_id,native_language_id,s.source_sentence,s.markup_sentence,s.latin_sentence,
        s.translation_source_sentence,s.translation_markup_sentence,s.translation_latin_sentence,
        isnull(sample_vkl.knowledge_level_id,1) as vocab_knowledge_level_id,
        isnull(sample_gkl.grammar_knowledge_level_id,0) as grammar_knowledge_level_id,
        s.polite_level_id,
        case when isnull(tr1.leitner_deck_index,0)=0 then 0 else cast((tr1.leitner_deck_index-1)  as float)/cast((max_leitner_deck_index-1) as float) end as progress_percentage,
        case when isnull(tr2.leitner_deck_index,0)=0 then 0 else cast((tr2.leitner_deck_index-1)  as float)/cast((max_leitner_deck_index-1) as float) end as listening_progress_percentage,
        case when f.object_id is null then 0 else 1 end as is_favorite,
        case when st.object_id is null then 0 else 1 end as is_studied,
        s.has_error,
        s.is_deleted,
        f.create_datetime as favorite_datetime,
        st.create_datetime as studied_datetime,
        s.create_user_id,
        s.create_datetime,
        isnull(s.modify_user_id,s.create_user_id) as modify_user_id,
        isnull(s.modify_datetime,s.create_datetime) as modify_datetime,
        s.display_order
        from samples s
        left outer join sample_knowledge_level_votes klv on klv.sample_id=s.sample_id and klv.user_id=@user_id
        left outer join favorites f on f.user_id=@user_id and f.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and f.object_id=s.sample_id
        left outer join studied st on st.user_id=@user_id and st.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and st.object_id=s.sample_id
        left outer join leitner_tracking tr1 on tr1.user_id=@user_id and tr1.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and tr1.object_id=s.sample_id and tr1.skill_type_id=(select skill_type_id from skill_types where skill_type_name=N'Guess Pronunciation from Meaning')
        left outer join leitner_tracking tr2 on tr2.user_id=@user_id and tr2.object_type_id=(select object_type_id from object_types ot where ot.object_type_name='Pattern Sample') and tr2.object_id=s.sample_id and tr2.skill_type_id=(select skill_type_id from skill_types where skill_type_name=N'Guess Meaning from Pronunciation')
        cross join (select max(leitner_deck_index) as max_leitner_deck_index from leitner_decks) dm
        left outer join vw_sample_user_grammar_kl sample_gkl on sample_gkl.user_id=@user_id and sample_gkl.sample_id=s.sample_id
        left outer join vw_sample_avg_kl sample_vkl on sample_vkl.sample_id=s.sample_id
        where is_deleted=0
)
Run Code Online (Sandbox Code Playgroud)

这似乎与“vw_sample_avg_kl”连接有关。如果我注释掉该连接和 'vocab_knowledge_level_id' 的计算列,那么这两个查询时间将变得非常相似。我把它加回来后,它们就大不相同了。这是该视图的代码...

CREATE VIEW [dbo].[vw_sample_avg_kl]
    AS 
    select sample_id,knowledge_level_id from (
        select sample_id,knowledge_level_id,count(*) as frequency,RANK() over (partition by sample_id order by count(*) desc,knowledge_level_id) as myrank
        from sample_knowledge_level_votes
        group by sample_id,knowledge_level_id
    ) sample_kl_ranking
    where myrank=1
Run Code Online (Sandbox Code Playgroud)

id字段是INT。显示时我的两个查询如下所示dm_exec_query_stats......

0x4A073840486B252C  41096   select * from fn_get_samples(@user_id) s    where s.sample_id in (select * from STRING_SPLIT(@sample_id_list,','))

0x79DEAD79D1F149CD  7849    select * from fn_get_samples(1) s where s.sample_id in (2495,2496,2497,2498,2499,2500,2501,2502,2503,2504)
Run Code Online (Sandbox Code Playgroud)

(与上面的示例数据集略有不同,因此时间略有不同,但您可以看到性能上的巨大差距)

Pau*_*ite 5

硬编码IN列表和由其生成的值之间的最大区别STRING_SPLIT是:

  1. 优化器会IN在编译时自动对文字列表中的重复项进行排序和删除。这为优化器提供了有关值的数量和分布的准确信息。
  2. 文字值可以嵌入到数据访问运算符中(例如索引查找)。字符串拆分的结果可用于驱动数据访问,但将需要驱动嵌套循环运算符。
  3. STRING_SPLIT函数value列的返回类型是varcharnvarchar取决于输入参数。字符串value列的长度与输入字符串相同。IN如有必要,在编译时将文字值强制转换为兼容类型。
  4. STRING_SPLIT函数返回的行数是猜测的,值的分布是未知的。

简而言之,使用文字IN列表为优化器提供了更好的信息。