leo*_*_21 10 python regex sorting dictionary python-3.x
我是Python的新手,我遇到了我需要解决的下面的问题.我有来自Apache Log的日志文件,如下所示:
[01/Aug/1995:00:54:59 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:55:04 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:06 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 403 298
[01/Aug/1995:00:55:09 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
[01/Aug/1995:00:55:18 -0400] "GET /images/opf-logo.gif HTTP/1.0" 200 32511
[01/Aug/1995:00:56:52 -0400] "GET /images/ksclogosmall.gif HTTP/1.0" 200 3635
Run Code Online (Sandbox Code Playgroud)
我将返回10个请求最多的对象及其传输的累积字节数.我只需要包含成功(HTTP 2xx)响应的GET请求.
所以上面的日志会导致:
/images/ksclogosmall.gif 10905
/images/opf-logo.gif 65022
Run Code Online (Sandbox Code Playgroud)
到目前为止,我有以下代码:
import re
from collections import Counter, defaultdict
from operator import itemgetter
import itertools
import sys
log_file = "web.log"
pattern = re.compile(
r'\[(?P<date>[^\[\]:]+):(?P<time>\d+:\d+:\d+) (?P<timezone>[\-+]?\d\d\d\d)\] '
+ r'"(?P<method>\w+) (?P<path>[\S]+) (?P<protocol>[^"]+)" (?P<status>\d+) (?P<bytes_xfd>-|\d+)')
dict_list = []
with open(log_file, "r") as f:
for line in f.readlines():
if re.search("GET", line) and re.search(r'HTTP/[\d.]+"\s[2]\d{2}', line):
try:
log_line_data = pattern.match(line)
path = log_line_data["path"]
bytes_transferred = int(log_line_data["bytes_xfd"])
dict_list.append({path: bytes_transferred})
except:
print("Unexpected Error: ", sys.exc_info()[0])
raise
f.close()
print(dict_list)
Run Code Online (Sandbox Code Playgroud)
此代码打印以下字典列表.
[{'/images/opf-logo.gif': 32511},
{'/images/ksclogosmall.gif': 3635},
{'/images/ksclogosmall.gif': 3635},
{'/images/opf-logo.gif': 32511},
{'/images/ksclogosmall.gif': 3635}]
Run Code Online (Sandbox Code Playgroud)
我不知道如何从这里得到结果:
/images/ksclogosmall.gif 10905
/images/opf-logo.gif 65022
Run Code Online (Sandbox Code Playgroud)
该结果基本上是对应于类似键的值的添加,该类别按照按顺序排列的特定键出现的次数排序.
注意:我尝试使用colllections.Counter无济于事,在这里我想按键发生的次数排序.
任何帮助,将不胜感激.
您可以使用collections.Counter和update它来为每个对象添加传输的字节:
from collections import Counter
c = Counter()
for d in dict_list:
c.update(d)
occurrences=Counter([list(x.keys())[0] for x in dict_list])
sorted(c.items(), key=lambda x: occurrences[x[0]], reverse=True)
Run Code Online (Sandbox Code Playgroud)
输出:
[('/images/ksclogosmall.gif', 10905), ('/images/opf-logo.gif', 65022)]
Run Code Online (Sandbox Code Playgroud)
首先,字典列表对这种类型的数据没有意义.由于每个字典只有一个键值对,因此只需构造一个元组列表(namedtuples如果想要更多可读性,则列出一个列表).
tuple_list.append((path, bytes_transferred))
Run Code Online (Sandbox Code Playgroud)
现在,获得您想要的结果将更加直截了当.我个人使用了defaultdict.
from collections import defaultdict
tracker = defaultdict(list)
for path, bytes_transferred in tuple_list:
tracker[path].append(bytes_transferred)
# {'/images/ksclogosmall.gif': [3635, 3635, 3635], '/images/opf-logo.gif': [32511, 32511]}
print([(p, sum(b)) for p, b in sorted(tracker.items(), key=lambda i: -len(i[1]))])
# [('/images/ksclogosmall.gif', 10905), ('/images/opf-logo.gif', 65022)]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
243 次 |
| 最近记录: |