从csv中的组中选择数据并将数据附加到文本文件

SMN*_*LLY 0 python csv python-itertools python-2.7

我有一个问题,我不知道目前如何解决.我有一个csv,格式如下所示.现在我需要做的是执行一些匹配方案并将一些文本字符串附加到文件.

x,classA,uniqueclassindicator1,1,125,21.8,1,5.22,
x,classc,uniqueclassindicator1,3,125,21.8,2,5.22,
x,classd,uniqueclassindicator2,1,125,21.8,,,
x,classe,uniqueclassindicator2,2,125,21.8,,,
x,classBa,uniqueclassindicator2,3,125,21.8,,,
x,classBc,uniqueclassindicator2,4,125,21.8,,,
x,classAd,uniqueclassindicator3,1,125,21.8,2,2.56,
x,classc,uniqueclassindicator3,2,125,21.8,1,2.56,
x,classD,uniqueclassindicator3,3,125,21.8,,,
x,classa,uniqueclassindicator3,4,125,21.8,,,
x,classn,uniqueclassindicator4,1,125,21.8,,,
x,classm,uniqueclassindicator4,2,125,21.8,,,
x,classt,uniqueclassindicator4,3,125,21.8,,,
x,classd,uniqueclassindicator4,4,125,30.8,,,
x,classa,uniqueclassindicator4,5,125,31.8,,,
x,classn,uniqueclassindicator4,6,125,30.8,,,
x,classq,uniqueclassindicator5,1,125,35.8,1,3.31,3.1
x,classqe,uniqueclassindicator5,2,125,21.8,2,3.31,3.1 
x,classS,uniqueclassindicator5,3,125,21.8,3.31,3.1
x,classK,uniqueclassindicator5,4,125,21.8,,,
x,classL,uniqueclassindicator5,5,125,21.8,,,
x,classG,uniqueclassindicator5,6,125,21.8,,,
x,classH,uniqueclassindicator6,1,125,35.8,1,2.89,2.25   
x,classF,uniqueclassindicator6,2,125,21.8,3,2.89,2.25
x,classP,uniqueclassindicator6,3,125,21.8,2,2.89,2.25
x,classY,uniqueclassindicator6,4,125,21.8,,,
x,classU,uniqueclassindicator6,5,125,21.8,,,
x,classR,uniqueclassindicator6,6,125,21.8,,,
Run Code Online (Sandbox Code Playgroud)

在整个示例中,假设基于零的索引

您会注意到在csv中,第2列是uniqueclassindicator,我需要为每个类执行以下操作.

1.

如果第3列和第6列为1,并且第3列和第6列中相同的唯一类(不同的行)都是2,则生成字符串:

   "text data text" (column [1]) #where row = 1# "text data" column [1] #where row =2# "text" (column[17])`
Run Code Online (Sandbox Code Playgroud)

例如,在第15行中我们有这个确切的场景.所以字符串文本字符串需要读取: text data text classq text data classqe text 3.31

在上面的文本字符串中,从第1列第15行拉出"classq",从第1列第16行拉出"classqe",从第8列第15行拉出"3.31".

重申一下,产生此字符串的匹配是针对此类中的uniqueclassindicator5,第3列和第6列匹配(1-1和2-2)

2.

与1基本相同,但是当第3列和第6列为1,2和2,1时.这发生在uniqueclassindicator3中,请参阅第7行作为示例.所以我们需要追加字符串:

text data text classc text data classAd text 2.56 #Note I have listed the class which had a 1 in column 16 first.`
Run Code Online (Sandbox Code Playgroud)

3.

这种情况适用于给定类的第3列1,2,3与第6列的1,2,3匹配时,幸运的是,我们只需要在字符串中返回8值,例如:

test data test data (column[8]) test data test
Run Code Online (Sandbox Code Playgroud)

4

像方案2一样,当相同的事情发生但不是正确的顺序.因此,如果给定uniqueclassinidcator的第3列= 1-3且第6列= 1-3(除了方案3之外,则为descibed).然后创建以下字符串.

data data (column[8]) data data.
Run Code Online (Sandbox Code Playgroud)

我知道执行此操作所需的代码并不是最简单的,但如果有人能帮助我实现这一点,那么我将非常负债.如果有什么不清楚请不要犹豫与我联系.非常感谢SMNALLY

编辑 - 运行Martijn Pieters提供的代码

我试图运行以下代码来匹配目标1,2和3.虽然我可以让目标1和2工作得足够轻松.我无法让目标3起作用.

from collections import defaultdict
import csv

# you probably can think up better names
fields = ('x', 'class', 'indicator', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8')

entries = defaultdict(list)

with open('test.csv', 'rb') as fd:
    reader = csv.DictReader(fd, fields)

    for row in reader:
        # each row is now a dictionary
        # make your numbers, numbers
        for field in fields[3:]:
            row[field] = row[field] and float(row[field])

        previous = entries[row['indicator']]
        for p in previous:

            ##Objective 1
            if (row['col3'], row['col6']) == (2, 2) and (p['col3'], p['col6']) == (1, 1):
                print 'text {p[class]} text {r[class]} text {r[col7]}'.format(p=p, r=row)
            # etc, testing againts previous rows with the same indicator
            ##Objective 2
            if (row['col3'], row['col6']) == (2, 1) and (p['col3'], p['col6']) == (1, 2):
                print 'data {p[class]} & {r[class]} data {r[col7]}'.format(p=p, r=row)
            ##Objective 3
            if (row['col3'], row['col6']) == (3, 3) and (p['col3'], p['col6']) == (2, 2) and (p['col3'], p['col6']) == (1, 1):
                print 'text data text data {r[col8]}'.format(p=p, r=row)     

        # remember this row for later rows to match against.
        previous.append(row)
Run Code Online (Sandbox Code Playgroud)

你能告诉我我对目标3的错误吗?我收到没有追溯但收到没有文字字符串.

Mar*_*ers 5

您可以将第2列键入的信息存储在字典中以便于查找; 对于每个唯一列值,保留要在以后匹配的条目列表.

一个collections.defaultdict()对象使第一部分变得容易.我用csv.DictReader()它给每一列一个有意义的名字; 而不是在心理上必须将每个列号映射到含义,然后列具有名称,更容易跟踪:

from collections import defaultdict
import csv

# you probably can think up better names
fields = ('x', 'class', 'indicator', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8')

entries = defaultdict(list)

with open(filename, 'rb') as fd:
    reader = csv.DictReader(fd, fields)

    for row in reader:
        # each row is now a dictionary
        # make your numbers, numbers
        for field in fields[3:]:
            row[field] = row[field] and float(row[field])

        previous = entries[row['indicator']]
        for p in previous:
            if (row['col3'], row['col6']) == (2, 2) and (p['col3'], p['col6']) == (1, 1):
                print 'text data text {p[class]} text data {r[class]} text {r[col8]}'.format(p=p, r=row)
            # etc, testing againts previous rows with the same indicator

        # remember this row for later rows to match against.
        previous.append(row)
Run Code Online (Sandbox Code Playgroud)

这恰好与您的第一个场景相匹配,但其他场景也很容易匹配.

如果每个唯一类指示符的条目很少,这应该足够有效.如果每个指标遇到数百(或更差)的行,则需要开始研究每个方案的有效匹配结构(因为它们以不同的方式匹配),以加快查找速度.这可能需要更多内存,交易内存以提高速度.

根据输入数据集打印测试上面的内容:

text data text classq text data classqe text 3.1
Run Code Online (Sandbox Code Playgroud)

调整代码以支持唯一(col3, col6)元组:

from collections import defaultdict
import csv

# you probably can think up better names
fields = ('x', 'class', 'indicator', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8')

entries = defaultdict(dict)

with open(filename, 'rb') as fd:
    reader = csv.DictReader(fd, fields)

    for row in reader:
        # each row is now a dictionary
        # make your numbers, numbers
        for field in fields[3:]:
            row[field] = row[field] and float(row[field])

        key = (row['col3'], row['col6'])
        previous = entries[row['indicator']]

        # scenario 1
        if key == (2, 2) and (1, 1) in previous:
            p = previous[(1, 1)]
            print 'text data text {p[class]} text data {r[class]} text {r[col8]}'.format(p=p, r=row)

        # scenario 3
        if key = (3, 3) and (1, 1) in previous and (2, 2) in previous:
            print 'text data text data {r[col8]}'.format(r=row)

        # remember this row for later rows to match against.
        previous[key] = row
Run Code Online (Sandbox Code Playgroud)