我可以导入CSV文件并自动推断分隔符吗?

rom*_*rom 43 python csv import file delimiter

我想导入两种CSV文件,有些使用";" 用于分隔符和其他人使用",".到目前为止,我一直在接下来的两行之间切换:

reader=csv.reader(f,delimiter=';')
Run Code Online (Sandbox Code Playgroud)

要么

reader=csv.reader(f,delimiter=',')
Run Code Online (Sandbox Code Playgroud)

是否有可能不指定分隔符并让程序检查正确的分隔符?

下面的解决方案(Blender和sharth)似乎适用于以逗号分隔的文件(使用Libroffice生成),但不适用于以分号分隔的文件(使用MS Office生成).以下是一个以分号分隔的文件的第一行:

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
Run Code Online (Sandbox Code Playgroud)

Bil*_*nch 47

csv模块似乎建议使用csv嗅探器解决此问题.

他们给出了以下示例,我已根据您的情况进行了调整.

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...
Run Code Online (Sandbox Code Playgroud)

我们来试试吧.

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)

我们的样本输入

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
Run Code Online (Sandbox Code Playgroud)

如果我们执行示例程序:

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
Run Code Online (Sandbox Code Playgroud)

它也可能值得注意我使用的是什么版本的python.

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2
Run Code Online (Sandbox Code Playgroud)


And*_*ile 8

给定一个处理两者的项目,(逗号)和| (垂直条)分隔的CSV文件,格式正确,我尝试了以下内容(如https://docs.python.org/2/library/csv.html#csv.Sniffer所示):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')
Run Code Online (Sandbox Code Playgroud)

但是,在| -delimited文件上,返回了"无法确定分隔符"异常.如果每条线具有相同数量的分隔符(不包括引号中可能包含的内容),推测嗅探启发式可能最有效是合理的.因此,我没有读取文件的前1024个字节,而是尝试完整地阅读前两行:

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')
Run Code Online (Sandbox Code Playgroud)

到目前为止,这对我来说效果很好.

  • 这对我很有帮助!我遇到了数据问题,其中一个"挂钩"值是带有逗号的数字,所以它一直都失败了.将它限制在前两行确实有帮助. (2认同)

rom*_*rom 7

为了解决这个问题,我创建了一个函数,它读取文件的第一行(标题)并检测分隔符.

def detectDelimiter(csvFile):
    with open(csvFile, 'r') as myCsvfile:
        header=myCsvfile.readline()
        if header.find(";")!=-1:
            return ";"
        if header.find(",")!=-1:
            return ","
    #default delimiter (MS Office export)
    return ";"
Run Code Online (Sandbox Code Playgroud)

  • 如果分隔符是值的一部分,即使它被scaped或引用,您的函数也将无效.例如,像"嗨彼得;","你好吗?","再见约翰!"这样的行将返回`;`作为分隔符,这是错误的. (8认同)

Vla*_*ruz 6

如果你正在使用DictReader你可以这样做:

#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect=dialect)

        for line in reader:
            print(line['ReleveAnnee'])
Run Code Online (Sandbox Code Playgroud)

我使用它Python 3.5并且它以这种方式工作.