如何将以前数据中的X量拉入CSV行

Question

如何将以前数据中的X量拉入CSV行

SMN*_*LLY 4 python csv datetime date python-2.7

我有一个非常大的CSV数据,我需要将第2列中每个名称的前一个数据附加到第2列中的当前日期之前的日期.我认为表示此问题的最简单方法是提供类似于我的实际数据的详细示例,但显着缩小:

Datatitle,Date,Name,Score,Parameter
data,01/09/13,george,219,dataa,text
data,01/09/13,fred,219,datab,text
data,01/09/13,tom,219,datac,text
data,02/09/13,george,229,datad,text
data,02/09/13,fred,239,datae,text
data,02/09/13,tom,219,dataf,text
data,03/09/13,george,209,datag,text
data,03/09/13,fred,217,datah,text
data,03/09/13,tom,213,datai,text
data,04/09/13,george,219,dataj,text
data,04/09/13,fred,212,datak,text
data,04/09/13,tom,222,datal,text
data,05/09/13,george,319,datam,text
data,05/09/13,fred,225,datan,text
data,05/09/13,tom,220,datao,text
data,06/09/13,george,202,datap,text
data,06/09/13,fred,226,dataq,text
data,06/09/13,tom,223,datar,text
data,06/09/13,george,219,dataae,text

Run Code Online (Sandbox Code Playgroud)

因此,对于此csv的三个第一行,没有先前的数据.因此,如果我们说我们希望在当前日期之前的日期将第3列和第4列用于george(row1)的最后3次出现,那么它将会是:

data,01/09/13,george,219,dataa,text,x,y,x,y,x,y

Run Code Online (Sandbox Code Playgroud)

但是,当以前的数据开始变得有用时,我们希望生成如下的csv:

Datatitle,Date,Name,Score,Parameter,LTscore,LTParameter,LTscore+1,LTParameter+1,LTscore+2,LTParameter+3,
data,01/09/13,george,219,dataa,text,x,y,x,y,x,y
data,01/09/13,fred,219,datab,text,x,y,x,y,x,y
data,01/09/13,tom,219,datac,text,x,y,x,y,x,y
data,02/09/13,george,229,datad,text,219,dataa,x,y,x,y
data,02/09/13,fred,239,datae,text,219,datab,x,y,x,y
data,02/09/13,tom,219,dataf,text,219,datac,x,y,x,y
data,03/09/13,george,209,datag,text,229,datad,219,dataa,x,y
data,03/09/13,fred,217,datah,text,239,datae,219,datab,x,y
data,03/09/13,tom,213,datai,text,219,dataf,219,datac,x,y
data,04/09/13,george,219,dataj,text,209,datag,229,datad,219,dataa
data,04/09/13,fred,212,datak,text,217,datah,239,datae,219,datab
data,04/09/13,tom,222,datal,text,213,datai,219,dataf,219,datac
data,05/09/13,george,319,datam,text,219,dataj,209,datag,229,datad
data,05/09/13,fred,225,datan,text,212,datak,217,datah,239,datae
data,05/09/13,tom,220,datao,text,222,datal,213,datai,219,dataf
data,06/09/13,george,202,datap,text,319,datam,219,dataj,209,datag
data,06/09/13,fred,226,dataq,text,225,datan,212,datak,217,datah
data,06/09/13,tom,223,datar,text,220,datao,222,datal,213,datai
data,06/09/13,george,219,datas,text,319,datam,219,dataj,209,datag

Run Code Online (Sandbox Code Playgroud)

您将注意到06/09/13乔治发生两次,并且两次他都有相同的字符串319,datam,219,dataj,209,datag附加到他的行.乔治第二次出现时,他会附上相同的字符串,因为上面的乔治3行是在同一天.(这只是强调"在当前日期之前的日期".

从列标题中可以看出,我们正在收集最后3个分数和相关的3个参数,并将它们附加到每一行.请注意,这是一个非常简单的例子.实际上每个日期将包含几千行,在真实数据中也没有名称的模式,所以我们不希望看到fred,tom,george在重复模式上彼此相邻.如果有人能帮助我弄清楚如何最好地实现这一目标(最有效率),我会非常感激.如果有什么不清楚请告诉我,我会添加更多细节.任何建设性意见表示赞赏.谢谢SMNALLY

Answer 1

Jon*_*nts 11

您的文件似乎是按日期顺序排列的.如果我们在每个日期的每个名称中使用最后一个条目,并在写出每一行时将其添加到每个名称的大小的双端队列中,那么应该这样做:

import csv
from collections import deque, defaultdict
from itertools import chain, islice, groupby
from operator import itemgetter

# defaultdict whose first access of a key will create a deque of size 3
# defaulting to [['x', 'y'], ['x', 'y'], ['x' ,'y']]
# Since deques are efficient at head/tail manipulation, then an insert to
# the start is efficient, and when the size is fixed it will cause extra
# elements to "fall off" the end... 
names_previous = defaultdict(lambda: deque([['x', 'y']] * 3, 3))
with open('sample.csv', 'rb') as fin, open('sample_new.csv', 'wb') as fout:
    csvin = csv.reader(fin)
    csvout = csv.writer(fout)
    # Use groupby to detect changes in the date column. Since the data is always
    # asending, the items within the same data are contigious in the data. We use
    # this to identify the rows within the *same* date.
    # date=date we're looking at, rows=an iterable of rows that are in that date...
    for date, rows in groupby(islice(csvin, 1, None), itemgetter(1)):
        # After we've processed entries in this date, we need to know what items of data should
        # be considered for the names we've seen inside this date. Currently the data
        # is taken from the last occurring row for the name.
        to_add = {}
        for row in rows:
            # Output the row present in the file with a *flattened* version of the extra data
            # (previous items) that we wish to apply. eg:
            # [['x, 'y'], ['x', 'y'], ['x', 'y']] becomes ['x', 'y', 'x', 'y', 'x', y'] 
            # So we're easily able to store 3 pairs of data, but flatten it into one long
            # list of 6 items...
            # If the name (row[2]) doesn't exist yet, then by trying to do this, defaultdict
            # will automatically create the default key as above.
            csvout.writerow(row + list(chain.from_iterable(names_previous[row[2]])))
            # Here, we store for the name any additional data that should be included for the name
            # on the next date group. In this instance we store the information seen for the last
            # occurrence of that name in this date. eg: If we've seen it more than once, then
            # we only include data from the last occurrence. 
            # NB: If you wanted to include more than one item of data for the name, then you could
            # utilise a deque again by building it within this date group
            to_add[row[2]] = row[3:5]            
        for key, val in to_add.iteritems():
            # We've finished the date, so before processing the next one, update the previous data
            # for the names. In this case, we push a single item of data to the front of the deck.
            # If, we were storing multiple items in the data loop, then we could .extendleft() instead
            # to insert > 1 set of data from above.
            names_previous[key].appendleft(val)

Run Code Online (Sandbox Code Playgroud)

这样在运行期间仅保留内存中的名称和最后3个值.

可能希望调整为包括正确/写入新标题,而不是仅仅跳过输入上的标题.

归档时间：	11 年，10 月前
查看次数：	411 次
最近记录：	11 年，10 月前