尝试优化选择"近似最接近记录"的查询

And*_*ker 5 t-sql sql-server sql-server-2005

我有一个包含大量数据的表,我们特别关心这个date领域.原因是数据量上升了大约30倍,旧的方式很快就会崩溃.我希望您可以帮助我优化需求的查询:

  • 获取日期列表(由基于cte的表值函数生成)
  • 检索每个日期的单个记录
    • 基于'最近'的一些定义

例如,当前表包含5秒(+/-一点)间隔的数据.我需要对该表进行采样并获得最接近30秒间隔的记录.

我现在所做的工作得很好.我只是好奇是否有办法更优化它.如果我能在Linq To SQL中做到这一点,那也是很好的.考虑到日期值的数量(约200万行最小值),我甚至对索引的建议感兴趣.

declare @st  datetime ; set @st  = '2012-01-31 05:05:00';
declare @end datetime ; set @end = '2012-01-31 05:10:00';

select distinct
    log.*   -- id, 
from 
    dbo.fn_GenerateDateSteps(@st, @end, 30) as d
        inner join lotsOfLogData log on l.Id = (
            select top 1 e.[Id]
            from 
                lotsOfLogData as log  -- contains data in 5 second intervals
            where
                log.stationId = 1000 
                -- search for dates in a certain range
                AND utcTime between DateAdd(s, -10, dt) AND DateAdd(s, 5, dt)
            order by
                -- get the 'closest'. this can change a little, but will always 
                -- be based on a difference between the date
                abs(datediff(s, dt, UtcTime)) 
        )
    -- updated the query to be correct. stadionId should be inside the subquery
Run Code Online (Sandbox Code Playgroud)

lotsOfLogData的表结构如下.站点ID(可能是50个)相对较少,但每个站点都有很多记录.我们查询时知道了站号.

create table ##lotsOfLogData (
    Id          bigint      identity(1,1) not null
,   StationId   int         not null
,   UtcTime     datetime    not null
    -- 20 other fields, used for other calculations
)
Run Code Online (Sandbox Code Playgroud)

对于给定的参数,fn_GenerateDateSteps返回这样的数据集:

[DT]
2012-01-31 05:05:00.000
2012-01-31 05:05:30.000
2012-01-31 05:06:00.000
2012-01-31 05:06:30.000  (and so on, every 30 seconds)
Run Code Online (Sandbox Code Playgroud)

我也用这样的方式用临时表做了这个,但是出来的只是稍贵一点.

declare @dates table ( dt datetime, ClosestId bigint); 
insert into @dates (dt) select dt from dbo.fn_GenerateDateSteps(@st, @end, 30)
update @dates set closestId = ( -- same subquery as above )
select * from lotsOfLogData inner join @dates on Id = ClosestId
Run Code Online (Sandbox Code Playgroud)

编辑:修正了

现在有200K +行可以使用.我尝试了两种方式,交叉应用适当的索引(id/time + includes(..所有列......)工作正常.但是,我最终得到了我开始的查询,使用更简单(和现有)关于[id + time]的索引.更容易理解的查询是我为什么选择那个.也许还有更好的方法来做,但我看不到它:D

-- subtree cost (crossapply) : .0808
-- subtree cost (id based)   : .0797

-- see above query for what i ended up with
Run Code Online (Sandbox Code Playgroud)

Lie*_*ers 1

你可以尝试

  • 将 更改inner joincross apply.
  • 将 移至子where log.stationid选择。

SQL语句

SELECT  DISTINCT log.*   -- id, 
FROM    dbo.fn_GenerateDateSteps(@st, @end, 30) AS d
        CROSS APPLY (
            SELECT  TOP 1 log.*
            FROM    lotsOfLogData AS log  -- contains data in 5 second intervals
            WHERE   -- search for dates in a certain range
                    utcTime between DATEADD(s, -10, d.dt) AND DATEADD(s, 5, d.dt)
                    AND log.stationid = 1000
            ORDER BY
                    -- get the 'closest'. this can change a little, but will always 
                    -- be based on a difference between the date
                    ABS(DATEDIFF(s, d.dt, UtcTime)) 
        ) log
Run Code Online (Sandbox Code Playgroud)