我有一些代码来解析apache日志文件(start_search并且end_search是在apache日志中找到的格式的日期字符串):
with open("/var/log/apache2/access.log",'r') as log:
from itertools import takewhile, dropwhile
s_log = dropwhile(lambda L: start_search not in L, log)
e_log = takewhile(lambda L: end_search not in L, s_log)
query = [line for line in e_log if re.search(r'GET /(.+veggies|.+fruits)',line)]
import csv
query_dict = csv.DictReader(query,fieldnames=('ip','na-1','na-2','time', 'zone', 'url', 'refer', 'client'),quotechar='"',delimiter=" ")
import re
veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]
fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]
Run Code Online (Sandbox Code Playgroud)
第二个列表生成器始终为空; 也就是说,如果我切换最后两行的顺序:
fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]
veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]
Run Code Online (Sandbox Code Playgroud)
第二个列表始终为空.
为什么?(以及如何填充fruits和veggies列表?)
你只能遍历迭代器一次 ; query_dict是一个迭代器,一旦扫描,veggies无法再次迭代搜索fruits.
不要在这里使用列表推导.循环query_dict 一次,检查两个条目veggies并fruits:
veggies = []
fruits = []
for x in query_dict:
if re.search('veggies',x['url']):
veggies.append(x)
if re.search('fruits',x['url']):
fruits.append(x)
Run Code Online (Sandbox Code Playgroud)
替代方案是:
重新创建列表的csv.DictReader()对象fruits:
query_dict = csv.DictReader(query,fieldnames=('ip','na-1','na-2','time', 'zone', 'url', 'refer', 'client'),quotechar='"',delimiter=" ")
veggies = [ x for x in query_dict if re.search('veggies',x['url']) ]
query_dict = csv.DictReader(query,fieldnames=('ip','na-1','na-2','time', 'zone', 'url', 'refer', 'client'),quotechar='"',delimiter=" ")
fruits = [ x for x in query_dict if re.search('fruits',x['url']) ]
Run Code Online (Sandbox Code Playgroud)
这确实是双重工作; 你遍历整个数据集两次.
用于itertools.tee()"克隆"迭代器:
from itertools import tee
veggies_query_dict, fruits_query_dict = tee(query_dict)
veggies = [ x for x in veggies_query_dict if re.search('veggies',x['url']) ]
fruits = [ x for x in fruits_query_dict if re.search('fruits',x['url']) ]
Run Code Online (Sandbox Code Playgroud)
这结束了缓存所有query_dict的tee缓冲区,需要两次为同一任务的记忆,直到fruits再次清空了缓冲区.