为什么使用序列比在此示例中使用列表慢得多

Question

为什么使用序列比在此示例中使用列表慢得多

Tre*_*rog 20 performance f# yield lazy-evaluation seq.unfold

背景:我有一系列连续的,带时间戳的数据.数据序列中有孔,有些大,有些只有一个缺失值.
每当孔只是一个缺失值时,我想使用虚拟值修补孔(较大的孔将被忽略).

我想使用延迟生成修补序列,因此我使用Seq.unfold.

我已经制作了两个版本的方法来修补数据中的漏洞.

第一个消耗带有孔的数据序列并产生修补序列.这就是我想要的,但是当输入序列中的元素数量超过1000时,方法运行得非常慢,并且输入序列包含的元素越多,它就会越来越差.

第二种方法使用带有孔的数据列表并生成修补序列并且运行速度很快.然而,这不是我想要的,因为这会强制整个输入列表在内存中的实例化.

我想使用(sequence - > sequence)方法而不是(list - > sequence)方法,以避免在内存中同时存在整个输入列表.

问题:

1)为什么第一种方法如此缓慢(使用较大的输入列表逐渐变得更糟)(我怀疑它与使用Seq.skip 1重复创建新序列有关,但我不确定)

2)如何使用输入序列而不是输入列表快速修补数据中的空洞？

代码:

open System

// Method 1 (Slow)
let insertDummyValuesWhereASingleValueIsMissing1 (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
    let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
    (None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->  
        if restOfValues |> Seq.isEmpty then
            None // Reached the end of the input seq
        else
            let currentValue = Seq.hd restOfValues
            if prevValue.IsNone then
                Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues  )) // Only happens to the first item in the seq
            else
                let currentTime = fst currentValue
                let prevTime = fst prevValue.Value
                let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
                if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
                    let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
                    Some(dummyValue, (Some(dummyValue), restOfValues))
                else
                    Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch

// Method 2 (Fast)
let insertDummyValuesWhereASingleValueIsMissing2 (timeBetweenContiguousValues : TimeSpan) (values : (DateTime * float) list) =
    let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
    (None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->  
        match restOfValues with
        | [] -> None // Reached the end of the input list
        | currentValue::restOfValues -> 
            if prevValue.IsNone then
                Some(currentValue, (Some(currentValue), restOfValues  )) // Only happens to the first item in the list
            else
                let currentTime = fst currentValue
                let prevTime = fst prevValue.Value
                let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
                if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
                    let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) 
                    Some(dummyValue, (Some(dummyValue), currentValue::restOfValues))
                else
                    Some(currentValue, (Some(currentValue), restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch

// Test data
let numbers = {1.0..10000.0}
let contiguousTimeStamps = seq { for n in numbers -> DateTime.Now.AddMinutes(n)}

let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items

let timeBetweenContiguousValues = (new TimeSpan(0,1,0))

// The fast sequence-patching (method 2)
dataWithOccationalHoles |> List.of_seq |> insertDummyValuesWhereASingleValueIsMissing2 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))

// The SLOOOOOOW sequence-patching (method 1)
dataWithOccationalHoles |> insertDummyValuesWhereASingleValueIsMissing1 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))

Run Code Online (Sandbox Code Playgroud)

Answer 1

Bri*_*ian 33

任何时候你拆分seq使用Seq.hd,Seq.skip 1你几乎肯定陷入了O(N ^ 2)的陷阱. IEnumerable<T>对于递归算法(包括例如Seq.unfold)来说是一种糟糕的类型,因为这些算法几乎总是具有"第一个元素"和"元素的剩余部分"的结构,并且没有有效的方法来创建IEnumerable表示"元素的剩余部分"的新元素.(IEnumerator<T>是可行的,但它的API编程模型不那么有趣/易于使用.)

如果您需要原始数据"保持懒惰",那么您应该使用LazyList(在F#PowerPack中).如果你不需要懒惰,那么你应该使用像'list'这样的具体数据类型,你可以在O(1)中'尾随'.

(您还应该检查避免堆栈溢出(使用F#无限序列序列)作为FYI,尽管它只是切向适用于此问题.)

好吧,构建它实际上很快:O(1).但是然后评估它的第一个元素意味着为原始序列创建一个枚举器,计算它的第一个值,扔掉它,计算下一个值,然后产生它.因此,两个"Seq.skip 1"产生一个seq,当评估时将:在内部创建枚举器,它创建枚举器超过orig,计算第一个值,抛弃,产生下一个值到内部,然后内部扔掉,计算一个更多的价值和收益.每个嵌套的Seq.skip 1都会增加更多的工作量,从而产生O(N)时间和空间. (14认同)

Answer 2

Jas*_*son 15

Seq.skip构造一个新序列.我认为这就是你原来的方法很慢的原因.

我的第一个倾向是使用序列表达式和Seq.pairwise.这是快速且易于阅读的.

let insertDummyValuesWhereASingleValueIsMissingSeq (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
  let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
  seq {
    yield Seq.hd values
    for ((prevTime, _), ((currentTime, _) as next)) in Seq.pairwise values do
      let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
      if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
        let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
        yield dummyValue
      yield next
  }

Run Code Online (Sandbox Code Playgroud)

+1:当我学习F#时,我通过消除所有命令性构造进入了函数式编程槽.我看到我的代码的可读性使用Seq.unfold而不是无限简单的"循环和参考"方法. (4认同)

归档时间：	16 年，6 月前
查看次数：	4017 次
最近记录：	8 年，5 月前