FSharp.Data CsvProvider性能

kli*_*ron 7 mono f# f#-data

我有一个包含6列和678,552行的csv文件.不幸的是,我不能共享任何数据样本,但类型是直截了当:int64,int64,date,date,string,string并且没有缺失值.

是时候在R中的数据帧中加载这些数据read.table:~3秒.

是时候在F#中使用CsvFile.Load加载这些数据:~3秒.

是时候在F#中的Deedle数据帧中加载这些数据:~7秒.

inferTypes=falseDeedle 添加并提供架构Frame.ReadCsv可将时间缩短至约3秒

是时候在F#中使用CsvProvider加载这些数据:~5分钟.

在我定义Schema参数中的类型后,这5分钟甚至可以消除F#用来推断它们的时间.

我知道类型提供程序需要做的不仅仅是R或CsvFile.Load,以便将数据解析为正确的数据类型,但我对x100速度惩罚感到惊讶.更令人困惑的是Deedle加载数据所需的时间,因为它还需要推断类型和适当的转换,组合系列等.我实际上预计Deedle需要比CsvProvider更长的时间.

这个问题中,CsvProvider的不良性能是由大量的列引起的,这不是我的情况.

我想知道我是做错了什么,或者是否有任何方法可以加快速度.

只是为了澄清:创建提供者几乎是即时的.当我迫使生成的序列被实现时,Seq.length df.Rowsfsharpi提示返回需要大约5分钟.

我在Linux系统上,单声道v4.6.1上的F#v4.1.

这是CsvProvider的代码

let [<Literal>] SEP = "|"
let [<Literal>] CULTURE = "sv-SE"
let [<Literal>] DATAFILE = dataroot + "all_diagnoses.csv"

type DiagnosesProvider = CsvProvider<DATAFILE, Separators=SEP, Culture=CULTURE>
let diagnoses = DiagnosesProvider()
Run Code Online (Sandbox Code Playgroud)

EDIT1: 我添加了Deedle将数据加载到帧中的时间.

EDIT2: 添加了Deedle所采用的时间inferTypes=false和提供的模式.

此外,CacheRows=false按照注释中的建议在CsvProvider中提供对加载时间没有明显影响.

EDIT3: 好的,我们到了某个地方.由于一些特殊的原因,它似乎Culture是罪魁祸首.如果我省略这个参数,CsvProvider会在~7秒内加载数据.我不确定是什么原因引起的.我的系统的语言环境是en_US.但是,数据来自瑞典语语言环境中的SQL Server,其中十进制数字由","而不是"."分隔.这个特定的数据集没有任何小数,所以我可以完全省略Culture.然而,另一组有2个十进制列和超过1,000,000行.我的下一个任务是在Windows系统上对此进行测试,目前我还没有.

编辑4: 问题似乎已经解决但我仍然不明白是什么导致它.如果我通过以下方式"全球化"改变文化:

System.Globalization.CultureInfo.DefaultThreadCurrentCulture = CultureInfo("sv-SE")
System.Threading.Thread.CurrentThread.CurrentCulture = CultureInfo("sv-SE")
Run Code Online (Sandbox Code Playgroud)

然后Culture="sv-SE"从CsvProvider中删除参数,加载时间减少到约6秒,并正确解析小数.如果有人能对这种行为作出解释,我将保持开放状态.

Jus*_*mer 6

我正在尝试重现您所看到的问题,因为您无法共享我尝试生成一些测试数据的数据.但是,在我的机器上(.NET 4.6.2 F#4.1)我看不到它需要几分钟,需要几秒钟.

也许您可以尝试查看我的示例应用程序在您的设置中的执行情况,我们可以从中工作吗?

open System
open System.Diagnostics
open System.IO

let clock =
  let sw = Stopwatch ()
  sw.Start ()
  fun () ->
    sw.ElapsedMilliseconds

let time a =
  let before  = clock ()
  let v       = a ()
  let after   = clock ()
  after - before, v

let generateDataSet () =
  let random            = Random 19740531

  let firstDate         = DateTime(1970, 1, 1)

  let randomInt     ()  = random.Next () |> int64 |> (+) 10000000000L |> string
  let randomDate    ()  = (firstDate + (random.Next () |> float |> TimeSpan.FromSeconds)).ToString("s")
  let randomString  ()  = 
    let inline valid ch =
      match ch with
      | '"'
      | '\\'  -> ' '
      | _     -> ch
    let c   = random.Next () % 16
    let g i =
      if i = 0 || i = c + 1 then '"'
      else 32 + random.Next() % (127 - 32) |> char |> valid
    Array.init (c + 2) g |> String

  let columns =
    [|
      "Id"          , randomInt
      "ForeignId"   , randomInt
      "BirthDate"   , randomDate
      "OtherDate"   , randomDate
      "FirstName"   , randomString
      "LastName"    , randomString
    |]

  use sw      = new StreamWriter ("perf.csv")
  let headers = columns |> Array.map fst |> String.concat ";"
  sw.WriteLine headers
  for i = 0 to 700000 do
    let values = columns |> Array.map (fun (_, f) -> f ()) |> String.concat ";"
    sw.WriteLine values

open FSharp.Data

[<Literal>]
let sample = """Id;ForeignId;BirthDate;OtherDate;FirstName;LastName
11795679844;10287417237;2028-09-14T20:33:17;1993-07-21T17:03:25;",xS@ %aY)N*})Z";"ZP~;"
11127366946;11466785219;2028-02-22T08:39:57;2026-01-24T05:07:53;"H-/QA(";"g8}J?k~"
"""

type PerfFile = CsvProvider<sample, ";">

let readDataWithTp () =
  use streamReader  = new StreamReader ("perf.csv")
  let csvFile       = PerfFile.Load streamReader
  let length        = csvFile.Rows |> Seq.length
  printfn "%A" length

[<EntryPoint>]
let main argv = 
  Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory

  printfn "Generating dataset..."
  let ms, _ = time generateDataSet
  printfn "  took %d ms" ms

  printfn "Reading dataset..."
  let ms, _ = time readDataWithTp
  printfn "  took %d ms" ms

  0
Run Code Online (Sandbox Code Playgroud)

性能数字(我桌面上的.NET462):

Generating dataset...
  took 2162 ms
Reading dataset...
  took 6156 ms
Run Code Online (Sandbox Code Playgroud)

性能数字(Macbook Pro上的Mono 4.6.2):

Generating dataset...
  took 4432 ms
Reading dataset...
  took 8304 ms
Run Code Online (Sandbox Code Playgroud)

更新

事实证明,明确指定CultureCsvProvider似乎会降低性能.它可以是任何文化,不仅仅是sv-SE为什么?

如果检查提供者为快速和慢速情况生成的代码,则会发现差异:

快速

internal sealed class csvFile@78
{
  internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
  {
    Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
    long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[1]);
    long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[2]);
    System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[3]);
    System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[4]);
    string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[5]);
    return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
  }
}
Run Code Online (Sandbox Code Playgroud)

internal sealed class csvFile@78
{
  internal System.Tuple<long, long, System.DateTime, System.DateTime, string, string> Invoke(object arg1, string[] arg2)
  {
    Microsoft.FSharp.Core.FSharpOption<string> fSharpOption = TextConversions.AsString(arg2[0]);
    long arg_C9_0 = TextRuntime.GetNonOptionalValue<long>("Id", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[1]);
    long arg_C9_1 = TextRuntime.GetNonOptionalValue<long>("ForeignId", TextRuntime.ConvertInteger64("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[2]);
    System.DateTime arg_C9_2 = TextRuntime.GetNonOptionalValue<System.DateTime>("BirthDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[3]);
    System.DateTime arg_C9_3 = TextRuntime.GetNonOptionalValue<System.DateTime>("OtherDate", TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[4]);
    string arg_C9_4 = TextRuntime.GetNonOptionalValue<string>("FirstName", TextRuntime.ConvertString(fSharpOption), fSharpOption);
    fSharpOption = TextConversions.AsString(arg2[5]);
    return new System.Tuple<long, long, System.DateTime, System.DateTime, string, string>(arg_C9_0, arg_C9_1, arg_C9_2, arg_C9_3, arg_C9_4, TextRuntime.GetNonOptionalValue<string>("LastName", TextRuntime.ConvertString(fSharpOption), fSharpOption));
  }
}
Run Code Online (Sandbox Code Playgroud)

更具体地说,这是区别:

// Fast
TextRuntime.ConvertDateTime("", fSharpOption), fSharpOption)
// Slow
TextRuntime.ConvertDateTime("sv-SE", fSharpOption), fSharpOption)
Run Code Online (Sandbox Code Playgroud)

当我们指定一种文化时,会传递给ConvertDateTimeGetCulture

static member GetCulture(cultureStr) =
  if String.IsNullOrWhiteSpace cultureStr 
  then CultureInfo.InvariantCulture 
  else CultureInfo cultureStr
Run Code Online (Sandbox Code Playgroud)

这意味着对于默认情况,我们使用CultureInfo.InvariantCulture但对于每个字段和行的任何其他情况,我们正在创建一个CultureInfo对象.缓存可以完成,但事实并非如此.创建过程本身似乎不会占用太多时间,但每当我们使用新CultureInfo对象进行解析时会发生一些事情.

解析DateTimeFSharp.Data本质上是这样的

let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
match DateTime.TryParse(text, cultureInfo, dateTimeStyles) with
Run Code Online (Sandbox Code Playgroud)

因此,让我们进行性能测试,我们使用缓存CultureInfo对象和另一个我们每次创建一个缓存对象的对象.

open System
open System.Diagnostics
open System.Globalization

let clock =
  let sw = Stopwatch ()
  sw.Start ()
  fun () ->
    sw.ElapsedMilliseconds

let time a =
  let before  = clock ()
  let v       = a ()
  let after   = clock ()
  after - before, v

let perfTest c cf () =
  let dateTimeStyles = DateTimeStyles.AllowWhiteSpaces ||| DateTimeStyles.RoundtripKind
  let text = DateTime.Now.ToString ("", cf ())
  for i = 1 to c do
    let culture = cf ()
    DateTime.TryParse(text, culture, dateTimeStyles) |> ignore

[<EntryPoint>]
let main argv = 
  Environment.CurrentDirectory <- AppDomain.CurrentDomain.BaseDirectory

  let ct    = "sv-SE"
  let cct   = CultureInfo ct
  let count = 10000

  printfn "Using cached CultureInfo object..."
  let ms, _ = time (perfTest count (fun () -> cct))
  printfn "  took %d ms" ms

  printfn "Using fresh CultureInfo object..."
  let ms, _ = time (perfTest count (fun () -> CultureInfo ct))
  printfn "  took %d ms" ms

  0
Run Code Online (Sandbox Code Playgroud)

.NET 4.6.2 F#4.1上的性能数字:

Using cached CultureInfo object...
  took 16 ms
Using fresh CultureInfo object...
  took 5328 ms
Run Code Online (Sandbox Code Playgroud)

因此,当指定文化时,似乎缓存CultureInfo对象FSharp.Data应该CsvProvider显着提高性能.


kli*_*ron 2

该问题是由 CsvProvider 未记忆显式设置引起的Culture拉取请求解决了该问题。