为什么 SSIS 枚举目录中的许多文件并导入它们很慢？

Question

为什么 SSIS 枚举目录中的许多文件并导入它们很慢？

Chr*_*gna 7 sql-server ssis ssis-2012

我有一个非常慢的 SSIS 包。处理一个文件相当快，处理 100 个或更少的文件相当快。（每个文件大约一秒）

但是，如果我的目录有数千个（非常小的）文件，则该过程会非常缓慢地拖延。我的偏好是仅在工作时间之后运行此过程，但等到那时，要导入的平面文件数量将达到数千个。

包非常简单：

外循环是 For Every（文件枚举，将文件路径读入变量）
在里面，只需导入而不对数据进行任何转换

这就对了。

数千个文件的性能每个文件运行 15 秒或更长时间。UI（状态）的绘制/滚动速度非常慢，我什至看不到它的位置 - 在 18 小时前开始的执行中，标记时间超过 15 小时。

版本：MSSQL 2012

Answer 1

bil*_*nkc 9

我认为您遇到了 UI/调试器的限制。

我创建了两个包：MakeAllTheFiles 和 ReadAllTheFiles

MakeAllTheFiles 接受要创建的文件数作为输入。它将利用伪随机函数将数据分布在多个 (7) 子文件夹中。

制作所有文件

    public void Main()
    {
        int NumberOfFilesToGenerate = (Int32)Dts.Variables["User::FilesToGenerate"].Value;
        string baseFolder = Dts.Variables["User::FolderInput"].Value.ToString();
        System.Random rand = null;
        int fileRows = 0;
        DateTime current = DateTime.Now;
        int currentRandom = -1;
        int seed = 0;
        string folder = string.Empty;
        string currentFile = string.Empty;

        for (int i = 0; i < NumberOfFilesToGenerate; i++)
        {
            seed = i * current.Month * current.Day * current.Hour * current.Minute * current.Second;
            rand = new Random(seed);
            currentRandom = rand.Next();

            // Create files in sub folders
            folder = System.IO.Path.Combine(baseFolder, string.Format("f_{0}", currentRandom % 7));

            // Create the folder if it does not exist
            if (!System.IO.Directory.Exists(folder))
            {
                System.IO.Directory.CreateDirectory(folder);
            }

            currentFile = System.IO.Path.Combine(folder, string.Format("input_{0}.txt", currentRandom));

            System.IO.FileInfo f = new FileInfo(currentFile);
            using (System.IO.StreamWriter writer = f.CreateText())
            {
                int upperBound = rand.Next(50);
                for (int row = 0; row < upperBound; row++)
                {
                    if (row == 0)
                    {
                        writer.WriteLine(string.Format("{0}|{1}", "Col1", "Col2"));                        }

                    writer.WriteLine(string.Format("{0}|{1}", row, seed));
                }
            }
            ;
        }
        Dts.TaskResult = (int)ScriptResults.Success;
    }

Run Code Online (Sandbox Code Playgroud)

读取所有文件

包装的一般外观是这样的

阅读所有文件！

我定义了两个连接管理器：一个是我的数据库，另一个是在 ConnectionString 属性上带有表达式的平面文件，以便它使用我的变量 @[User::CurrentFileName]

变量，我喜欢很多变量，所以有很多。

在此处输入图片说明

我的执行 SQL 任务只是建立一个表供我写入，如果它已经存在，则将其删除。

IF EXISTS
(
    SELECT * FROM sys.tables AS T WHERE T.name = 'dbase_54462' AND T.schema_id = SCHEMA_ID('dbo')
)
BEGIN
    DROP TABLE dbo.dbase_54462;
END

CREATE TABLE
    dbo.dbase_54462
(
    CurrentFile varchar(256) NOT NULL
,   Col1 int NOT NULL
,   Col2 varchar(50) NOT NULL
,   InsertDate datetime NOT NULL DEFAULT(CURRENT_TIMESTAMP)
);

Run Code Online (Sandbox Code Playgroud)

我的 Foreach Enumerator 只是根据 *.txt 的文件掩码查看 Input 文件夹中的所有内容并遍历子文件夹。当前文件名分配给我的变量@[User::CurrentFileName]`

在此处输入图片说明

数据流是沼泽标准。那里的派生列转换只是将 Current File Name 变量添加到数据流中，以便我可以将其记录在我的表中。

在此处输入图片说明

分析

我很懒，不想做任何特殊的事情来记录处理时间，所以我将我的包部署到 SSISDB 目录中并从那里运行它们。

此查询查看目录数据以找出包运行的时间、它处理的文件数量，然后生成文件计数的运行平均值。运行 10047 不好，被排除在分析之外。

SELECT
    E.execution_id
,   DATEDIFF(s, E.start_time, E.end_time) As duration_s
,   ES.rc AS FilesProcessed
,   AVG(ES.rc / (1.0 * DATEDIFF(s, E.start_time, E.end_time))) OVER (PARTITION BY ES.rc ORDER BY E.execution_id) AS running_average
FROM
    catalog.executions As E
    INNER JOIN
    (
        SELECT
            MIN(ES.start_time) As start_time
        ,   MAX(ES.end_time) AS end_time
        ,   count(1) As rc
        ,   ES.execution_id
        FROm
            catalog.executable_statistics AS ES
        GROUP BY
            ES.execution_id
    ) AS ES 
    ON ES.execution_id = E.execution_id
WHERE
    E.package_name = 'ReadAllTheFiles.dtsx'
    AND E.execution_id <> 10047
ORDER BY 1,2

Run Code Online (Sandbox Code Playgroud)

结果数据（免费的SQLFiddle）

execution_id    duration_s  FilesProcessed  running_average
10043   15  104 6.93333333333333
10044   13  104 7.46666666666666
10045   13  104 7.64444444444444
10050   102 1004    9.84313725490196
10051   101 1004    9.89186565715395
10052   102 1004    9.87562285640328
10053   106 1004    9.77464167060435
10055   1103    10004   9.06980961015412
10056   1065    10004   9.23161842010053
10057   1033    10004   9.38255038913446
10058   957 10004   9.65028792246735
10059   945 10004   9.83747901522255

Run Code Online (Sandbox Code Playgroud)

根据此抽样大小，我认为使用此处所述的 SSIS 处理 100、1000 或 10,000 个文件之间没有明显差异。

根本原因假设

根据有关的评论DTExecUI.exe，您正在 Visual Studio (BIDS/SSDT/name-of-the-week) 中运行包。为了获得漂亮的颜色变化和调试功能，本机执行 (dtexec.exe) 包含在调试过程中。这对执行造成了明显的拖累。

使用设计环境创建包并为较小的数据集运行它们。较大的最好通过非图形和非调试器执行界面处理（VS 中的 shift-F5，部署到 SSIS 目录并从那里执行，或 shell 到命令行界面并使用 dtutil.exe）

归档时间：	11 年，9 月前
查看次数：	3565 次
最近记录：	11 年，9 月前