SQL Server批量插入带有不一致引号的CSV文件

Question

SQL Server批量插入带有不一致引号的CSV文件

BULK INSERT(SQL Server)是否有可能是一个CSV文件,其中的字段只被OCCASSIONALLY用引号括起来？具体来说,引号只包含那些包含","的字段.

换句话说,我的数据看起来像这样(第一行包含标题):

id, company, rep, employees
729216,INGRAM MICRO INC.,"Stuart, Becky",523
729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114
721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253

Run Code Online (Sandbox Code Playgroud)

因为引号不一致,我不能使用'","作为分隔符,我不知道如何创建一个格式文件来解释这个问题.

我尝试使用','作为分隔符并将其加载到临时表中,其中每列都是varchar,然后使用一些kludgy处理来去除引号,但这也不起作用,因为包含','的字段被分成多列.

不幸的是,我无法预先操作CSV文件.

这是绝望的吗？

非常感谢任何建议.

顺便说一句,我从csv看到了这篇帖子SQL批量导入,但在这种情况下,每个字段始终用引号括起来.因此,在这种情况下,他可以使用','作为分隔符,然后删除引号.

Answer 1

Mac*_*ros 19

从MSDN无法对此文件执行批量插入:

要用作批量导入的数据文件,CSV文件必须符合以下限制:

数据字段从不包含字段终止符.
数据字段中的任何值或全部值都不包含在引号("")中.

(http://msdn.microsoft.com/en-us/library/ms188609.aspx)

一些简单的文本处理应该是准备导入文件所需的全部内容.或者,您的用户可能需要根据se指南格式化文件或使用逗号之外的其他内容作为分隔符(例如|)

您可能认为Excel可以创建的CSV文件是批量插入SQL Server的有效格式,或者,您希望批量插入器能够获取Excel创建的文件并导入它们.不是吗？好吧,我想.也许这就是我. (2认同)

Answer 2

Chr*_*ark 18

您将需要预处理文件,期间.

如果你真的需要这样做,这里是代码.我这样写是因为我别无选择.它是实用程序代码,我并不以此为荣,但它确实有效.方法不是让SQL理解引用的字段,而是操纵文件以使用完全不同的分隔符.

编辑:这是github仓库中的代码.它已得到改进,现在还有单元测试!https://github.com/chrisclark/Redelim-it

此函数接受一个输入文件,并使用新的分隔符替换所有字段分隔逗号(引号文本字段中的逗号,只是实际分隔的逗号).然后,您可以告诉sql server使用新的字段分隔符而不是逗号.在这里的函数版本中,占位符是< TMP >(我觉得这不会出现在原始的csv中 - 如果是的话,支持爆炸).

因此在运行此函数后,您可以通过执行以下操作来导入sql:

BULK INSERT MyTable
FROM 'C:\FileCreatedFromThisFunction.csv'
WITH
(
FIELDTERMINATOR = '<*TMP*>',
ROWTERMINATOR = '\n'
)

Run Code Online (Sandbox Code Playgroud)

而且没有进一步的麻烦,我提前向你道歉的可怕,可怕的功能(编辑 - 我已经发布了一个工作程序来执行此操作,而不仅仅是我博客上的功能):

Private Function CsvToOtherDelimiter(ByVal InputFile As String, ByVal OutputFile As String) As Integer

        Dim PH1 As String = "<*TMP*>"

        Dim objReader As StreamReader = Nothing
        Dim count As Integer = 0 'This will also serve as a primary key'
        Dim sb As New System.Text.StringBuilder

        Try
            objReader = New StreamReader(File.OpenRead(InputFile), System.Text.Encoding.Default)
        Catch ex As Exception
            UpdateStatus(ex.Message)
        End Try

        If objReader Is Nothing Then
            UpdateStatus("Invalid file: " & InputFile)
            count = -1
            Exit Function
        End If

        'grab the first line
    Dim line = reader.ReadLine()
    'and advance to the next line b/c the first line is column headings
    If hasHeaders Then
        line = Trim(reader.ReadLine)
    End If

    While Not String.IsNullOrEmpty(line) 'loop through each line

        count += 1

        'Replace commas with our custom-made delimiter
        line = line.Replace(",", ph1)

        'Find a quoted part of the line, which could legitimately contain commas.
        'In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
        Dim starti = line.IndexOf(ph1 & """", 0)
        If line.IndexOf("""",0) = 0 then starti=0

        While starti > -1 'loop through quoted fields

            Dim FieldTerminatorFound As Boolean = False

            'Find end quote token (originally  a ",)
            Dim endi As Integer = line.IndexOf("""" & ph1, starti)

            If endi < 0 Then
                FieldTerminatorFound = True
                If endi < 0 Then endi = line.Length - 1
            End If

            While Not FieldTerminatorFound

                'Find any more quotes that are part of that sequence, if any
                Dim backChar As String = """" 'thats one quote
                Dim quoteCount = 0
                While backChar = """"
                    quoteCount += 1
                    backChar = line.Chars(endi - quoteCount)
                End While

                If quoteCount Mod 2 = 1 Then 'odd number of quotes. real field terminator
                    FieldTerminatorFound = True
                Else 'keep looking
                    endi = line.IndexOf("""" & ph1, endi + 1)
                End If
            End While

            'Grab the quoted field from the line, now that we have the start and ending indices
            Dim source = line.Substring(starti + ph1.Length, endi - starti - ph1.Length + 1)

            'And swap the commas back in
            line = line.Replace(source, source.Replace(ph1, ","))

            'Find the next quoted field
            '                If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
            starti = line.IndexOf(ph1 & """", starti + ph1.Length)

        End While

            line = objReader.ReadLine

        End While

        objReader.Close()

        SaveTextToFile(sb.ToString, OutputFile)

        Return count

    End Function

Run Code Online (Sandbox Code Playgroud)

Answer 3

小智 8

我发现Chris的答案非常有帮助,但我想使用T-SQL(而不是使用CLR)在SQL Server中运行它,因此我将其代码转换为T-SQL代码.但后来我通过将所有内容包装在执行以下操作的存储过程中更进了一步:

使用批量插入来初始导入CSV文件
使用Chris的代码清理线条
以表格格式返回结果

根据我的需要,我通过删除值周围的引号并将两个双引号转换为一个双引号来进一步清理行(我认为这是正确的方法).

CREATE PROCEDURE SSP_CSVToTable

-- Add the parameters for the stored procedure here
@InputFile nvarchar(4000)
, @FirstLine int

AS

BEGIN

-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;

--convert the CSV file to a table
--clean up the lines so that commas are handles correctly

DECLARE @sql nvarchar(4000)
DECLARE @PH1 nvarchar(50)
DECLARE @LINECOUNT int -- This will also serve as a primary key
DECLARE @CURLINE int
DECLARE @Line nvarchar(4000)
DECLARE @starti int
DECLARE @endi int
DECLARE @FieldTerminatorFound bit
DECLARE @backChar nvarchar(4000)
DECLARE @quoteCount int
DECLARE @source nvarchar(4000)
DECLARE @COLCOUNT int
DECLARE @CURCOL int
DECLARE @ColVal nvarchar(4000)

-- new delimiter
SET @PH1 = '†'

-- create single column table to hold each line of file
CREATE TABLE [#CSVLine]([line] nvarchar(4000))

-- bulk insert into temp table
-- cannot use variable path with bulk insert
-- so we must run using dynamic sql
SET @Sql = 'BULK INSERT #CSVLine
FROM ''' + @InputFile + '''
WITH
(
FIRSTROW=' + CAST(@FirstLine as varchar) + ',
FIELDTERMINATOR = ''\n'',
ROWTERMINATOR = ''\n''
)'

-- run dynamic statement to populate temp table
EXEC(@sql)

-- get number of lines in table
SET @LINECOUNT = @@ROWCOUNT

-- add identity column to table so that we can loop through it
ALTER TABLE [#CSVLine] ADD [RowId] [int] IDENTITY(1,1) NOT NULL

IF @LINECOUNT > 0
BEGIN
    -- cycle through each line, cleaning each line
    SET @CURLINE = 1
    WHILE @CURLINE <= @LINECOUNT
    BEGIN
        -- get current line
        SELECT @line = line
          FROM #CSVLine
         WHERE [RowId] = @CURLINE

        -- Replace commas with our custom-made delimiter
        SET @Line = REPLACE(@Line, ',', @PH1)

        -- Find a quoted part of the line, which could legitimately contain commas.
        -- In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
        SET @starti = CHARINDEX(@PH1 + '"' ,@Line, 0)
        If CHARINDEX('"', @Line, 0) = 0 SET @starti = 0

        -- loop through quoted fields
        WHILE @starti > 0 
        BEGIN
            SET @FieldTerminatorFound = 0

            -- Find end quote token (originally  a ",)
            SET @endi = CHARINDEX('"' + @PH1, @Line, @starti)  -- sLine.IndexOf("""" & PH1, starti)

            IF @endi < 1
            BEGIN
                SET @FieldTerminatorFound = 1
                If @endi < 1 SET @endi = LEN(@Line) - 1
            END

            WHILE @FieldTerminatorFound = 0
            BEGIN
                -- Find any more quotes that are part of that sequence, if any
                SET @backChar = '"' -- thats one quote
                SET @quoteCount = 0

                WHILE @backChar = '"'
                BEGIN
                    SET @quoteCount = @quoteCount + 1
                    SET @backChar = SUBSTRING(@Line, @endi-@quoteCount, 1) -- sLine.Chars(endi - quoteCount)
                END

                IF (@quoteCount % 2) = 1
                BEGIN
                    -- odd number of quotes. real field terminator
                    SET @FieldTerminatorFound = 1
                END
                ELSE 
                BEGIN
                    -- keep looking
                    SET @endi = CHARINDEX('"' + @PH1, @Line, @endi + 1) -- sLine.IndexOf("""" & PH1, endi + 1)
                END

            END

            -- Grab the quoted field from the line, now that we have the start and ending indices
            SET @source = SUBSTRING(@Line, @starti + LEN(@PH1), @endi - @starti - LEN(@PH1) + 1) 
            -- sLine.Substring(starti + PH1.Length, endi - starti - PH1.Length + 1)

            -- And swap the commas back in
            SET @Line = REPLACE(@Line, @source, REPLACE(@source, @PH1, ','))
            --sLine.Replace(source, source.Replace(PH1, ","))

            -- Find the next quoted field
            -- If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
            SET @starti = CHARINDEX(@PH1 + '"', @Line, @starti + LEN(@PH1))
            --sLine.IndexOf(PH1 & """", starti + PH1.Length)

        END

        -- get table based on current line
        IF OBJECT_ID('tempdb..#Line') IS NOT NULL
            DROP TABLE #Line

        -- converts a delimited list into a table
        SELECT *
        INTO #Line
        FROM dbo.iter_charlist_to_table(@Line,@PH1)

        -- get number of columns in line
        SET @COLCOUNT = @@ROWCOUNT

        -- dynamically create CSV temp table to hold CSV columns and lines
        -- only need to create once
        IF OBJECT_ID('tempdb..#CSV') IS NULL
        BEGIN
            -- create initial structure of CSV table
            CREATE TABLE [#CSV]([Col1] nvarchar(100))

            -- dynamically add a column for each column found in the first line
            SET @CURCOL = 1
            WHILE @CURCOL <= @COLCOUNT
            BEGIN
                -- first column already exists, don't need to add
                IF @CURCOL > 1 
                BEGIN
                    -- add field
                    SET @sql = 'ALTER TABLE [#CSV] ADD [Col' + Cast(@CURCOL as varchar) + '] nvarchar(100)'

                    --print @sql

                    -- this adds the fields to the temp table
                    EXEC(@sql)
                END

                -- go to next column
                SET @CURCOL = @CURCOL + 1
            END
        END

        -- build dynamic sql to insert current line into CSV table
        SET @sql = 'INSERT INTO [#CSV] VALUES('

        -- loop through line table, dynamically adding each column value
        SET @CURCOL = 1
        WHILE @CURCOL <= @COLCOUNT
        BEGIN
            -- get current column
            Select @ColVal = str 
              From #Line 
             Where listpos = @CURCOL

            IF LEN(@ColVal) > 0
            BEGIN
                -- remove quotes from beginning if exist
                IF LEFT(@ColVal,1) = '"'
                    SET @ColVal = RIGHT(@ColVal, LEN(@ColVal) - 1)

                -- remove quotes from end if exist
                IF RIGHT(@ColVal,1) = '"'
                    SET @ColVal = LEFT(@ColVal, LEN(@ColVal) - 1)
            END

            -- write column value
            -- make value sql safe by replacing single quotes with two single quotes
            -- also, replace two double quotes with a single double quote
            SET @sql = @sql + '''' + REPLACE(REPLACE(@ColVal, '''',''''''), '""', '"') + ''''

            -- add comma separater except for the last record
            IF @CURCOL <> @COLCOUNT
                SET @sql = @sql + ','

            -- go to next column
            SET @CURCOL = @CURCOL + 1
        END

        -- close sql statement
        SET @sql = @sql + ')'

        --print @sql

        -- run sql to add line to table
        EXEC(@sql)

        -- move to next line
        SET @CURLINE = @CURLINE + 1

    END

END

-- return CSV table
SELECT * FROM [#CSV]

END

GO

Run Code Online (Sandbox Code Playgroud)

存储过程使用这个帮助函数将字符串解析成表(感谢Erland Sommarskog!):

CREATE FUNCTION [dbo].[iter_charlist_to_table]
                (@list      ntext,
                 @delimiter nchar(1) = N',')
     RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL,
                         str     varchar(4000),
                         nstr    nvarchar(2000)) AS

BEGIN
  DECLARE @pos      int,
          @textpos  int,
          @chunklen smallint,
          @tmpstr   nvarchar(4000),
          @leftover nvarchar(4000),
          @tmpval   nvarchar(4000)

  SET @textpos = 1
  SET @leftover = ''
  WHILE @textpos <= datalength(@list) / 2
  BEGIN
     SET @chunklen = 4000 - datalength(@leftover) / 2
     SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen)
     SET @textpos = @textpos + @chunklen

     SET @pos = charindex(@delimiter, @tmpstr)

     WHILE @pos > 0
     BEGIN
        SET @tmpval = ltrim(rtrim(left(@tmpstr, @pos - 1)))
        INSERT @tbl (str, nstr) VALUES(@tmpval, @tmpval)
        SET @tmpstr = substring(@tmpstr, @pos + 1, len(@tmpstr))
        SET @pos = charindex(@delimiter, @tmpstr)
     END

     SET @leftover = @tmpstr
  END

  INSERT @tbl(str, nstr) VALUES (ltrim(rtrim(@leftover)), ltrim(rtrim(@leftover)))

RETURN

END

Run Code Online (Sandbox Code Playgroud)

这是我从T-SQL中调用它的方式.在这种情况下,我将结果插入到临时表中,因此我首先创建临时表:

    -- create temp table for file import
CREATE TABLE #temp
(
    CustomerCode nvarchar(100) NULL,
    Name nvarchar(100) NULL,
    [Address] nvarchar(100) NULL,
    City nvarchar(100) NULL,
    [State] nvarchar(100) NULL,
    Zip nvarchar(100) NULL,
    OrderNumber nvarchar(100) NULL,
    TimeWindow nvarchar(100) NULL,
    OrderType nvarchar(100) NULL,
    Duration nvarchar(100) NULL,
    [Weight] nvarchar(100) NULL,
    Volume nvarchar(100) NULL
)

-- convert the CSV file into a table
INSERT #temp
EXEC [dbo].[SSP_CSVToTable]
     @InputFile = @FileLocation
    ,@FirstLine = @FirstImportRow

Run Code Online (Sandbox Code Playgroud)

我没有对性能进行太多测试,但它适用于我需要的东西 - 导入少于1000行的CSV文件.但是,它可能会阻塞非常大的文件.

希望其他人也发现它很有用.

干杯!

Answer 4

Ven*_*nts 5

我还创建了一个函数,将CSV转换为可用于批量插入的格式.我使用Chris Clark的回答帖子作为创建以下C#函数的起点.

我最终使用正则表达式来查找字段.然后我逐行重新创建文件,在我去的时候将它写入一个新文件,从而避免将整个文件加载到内存中.

private void CsvToOtherDelimiter(string CSVFile, System.Data.Linq.Mapping.MetaTable tbl)
{
    char PH1 = '|';
    StringBuilder ln;

    //Confirm file exists. Else, throw exception
    if (File.Exists(CSVFile))
    {
        using (TextReader tr = new StreamReader(CSVFile))
        {
            //Use a temp file to store our conversion
            using (TextWriter tw = new StreamWriter(CSVFile + ".tmp"))
            {
                string line = tr.ReadLine();
                //If we have already converted, no need to reconvert.
                //NOTE: We make the assumption here that the input header file 
                //      doesn't have a PH1 value unless it's already been converted.
                if (line.IndexOf(PH1) >= 0)
                {
                    tw.Close();
                    tr.Close();
                    File.Delete(CSVFile + ".tmp");
                    return;
                }
                //Loop through input file
                while (!string.IsNullOrEmpty(line))
                {
                    ln = new StringBuilder();

                    //1. Use Regex expression to find comma separated values 
                    //using quotes as optional text qualifiers 
                    //(what MS EXCEL does when you import a csv file)
                    //2. Remove text qualifier quotes from data
                    //3. Replace any values of PH1 found in column data 
                    //with an equivalent character
                    //Regex:  \A[^,]*(?=,)|(?:[^",]*"[^"]*"[^",]*)+|[^",]*"[^"]*\Z|(?<=,)[^,]*(?=,)|(?<=,)[^,]*\Z|\A[^,]*\Z
                    List<string> fieldList = Regex.Matches(line, @"\A[^,]*(?=,)|(?:[^"",]*""[^""]*""[^"",]*)+|[^"",]*""[^""]*\Z|(?<=,)[^,]*(?=,)|(?<=,)[^,]*\Z|\A[^,]*\Z")
                            .Cast<Match>()
                            .Select(m => RemoveCSVQuotes(m.Value).Replace(PH1, '¦'))
                            .ToList<string>();

                    //Add the list of fields to ln, separated by PH1
                    fieldList.ToList().ForEach(m => ln.Append(m + PH1));

                    //Write to file. Don't include trailing PH1 value.
                    tw.WriteLine(ln.ToString().Substring(0, ln.ToString().LastIndexOf(PH1)));

                    line = tr.ReadLine();
                }


                tw.Close();
            }
            tr.Close();

            //Optional:  replace input file with output file
            File.Delete(CSVFile);
            File.Move(CSVFile + ".tmp", CSVFile);
        }
    }
    else
    {
        throw new ArgumentException(string.Format("Source file {0} not found", CSVFile));
    }
}
//The output file no longer needs quotes as a text qualifier, so remove them
private string RemoveCSVQuotes(string value)
{
    //if is empty string, then remove double quotes
    if (value == @"""""") value = "";
    //remove any double quotes, then any quotes on ends
    value = value.Replace(@"""""", @"""");
    if (value.Length >= 2)
        if (value.Substring(0, 1) == @"""")
            value = value.Substring(1, value.Length - 2);
    return value;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	17 年，2 月前
查看次数：	61346 次
最近记录：	6 年，10 月前