从 syslog 日志文件中快速提取时间范围?

mik*_*ike 12 linux log-files grep

我有一个标准 syslog 格式的日志文件。它看起来像这样,除了每秒数百行:

Jan 11 07:48:46 blahblahblah...
Jan 11 07:49:00 blahblahblah...
Jan 11 07:50:13 blahblahblah...
Jan 11 07:51:22 blahblahblah...
Jan 11 07:58:04 blahblahblah...
Run Code Online (Sandbox Code Playgroud)

它不会恰好在午夜滚动,但它永远不会超过两天。

我经常需要从这个文件中提取一个时间片。我想为此编写一个通用脚本,我可以调用它:

$ timegrep 22:30-02:00 /logs/something.log
Run Code Online (Sandbox Code Playgroud)

...并让它从 22:30 开始拉出线路,越过午夜边界,直到第二天凌晨 2 点。

有几个注意事项:

  • 我不想费心在命令行上输入日期,只是时间。该程序应该足够聪明以解决它们。
  • 日志日期格式不包括年份,因此它应该根据当前年份进行猜测,但仍然在元旦前后做正确的事情。
  • 我希望它很快——它应该使用这样的事实,即行是为了在文件中寻找并使用二进制搜索。

在我花大量时间写这篇文章之前,它是否已经存在?

Den*_*son 9

更新:我用更新版本替换了原始代码,并进行了大量改进。让我们称之为(实际?)alpha 质量。

此版本包括:

  • 命令行选项处理
  • 命令行日期格式验证
  • 一些try街区
  • 行阅读移动到一个函数中

原文:

好吧,你知道什么?“寻找”你就会找到!这是一个 Python 程序,它在文件中四处寻找并使用或多或少的二进制搜索。这大大高于AWK脚本更快,其他人写道。

它是(预?)alpha 质量。它应该有try块和输入验证以及大量的测试,并且毫无疑问可以更加 Pythonic。但这里是为了你的消遣。哦,它是为 Python 2.6 编写的。

新代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# timegrep.py by Dennis Williamson 20100113
# in response to http://serverfault.com/questions/101744/fast-extraction-of-a-time-range-from-syslog-logfile

# thanks to serverfault user http://serverfault.com/users/1545/mike
# for the inspiration

# Perform a binary search through a log file to find a range of times
# and print the corresponding lines

# tested with Python 2.6

# TODO: Make sure that it works if the seek falls in the middle of
#       the first or last line
# TODO: Make sure it's not blind to a line where the sync read falls
#       exactly at the beginning of the line being searched for and
#       then gets skipped by the second read
# TODO: accept arbitrary date

# done: add -l long and -s short options
# done: test time format

version = "0.01a"

import os, sys
from stat import *
from datetime import date, datetime
import re
from optparse import OptionParser

# Function to read lines from file and extract the date and time
def getdata():
    """Read a line from a file

    Return a tuple containing:
        the date/time in a format such as 'Jan 15 20:14:01'
        the line itself

    The last colon and seconds are optional and
    not handled specially

    """
    try:
        line = handle.readline(bufsize)
    except:
        print("File I/O Error")
        exit(1)
    if line == '':
        print("EOF reached")
        exit(1)
    if line[-1] == '\n':
        line = line.rstrip('\n')
    else:
        if len(line) >= bufsize:
            print("Line length exceeds buffer size")
        else:
            print("Missing newline")
        exit(1)
    words = line.split(' ')
    if len(words) >= 3:
        linedate = words[0] + " " + words[1] + " " + words[2]
    else:
        linedate = ''
    return (linedate, line)
# End function getdata()

# Set up option handling
parser = OptionParser(version = "%prog " + version)

parser.usage = "\n\t%prog [options] start-time end-time filename\n\n\
\twhere times are in the form hh:mm[:ss]"

parser.description = "Search a log file for a range of times occurring yesterday \
and/or today using the current time to intelligently select the start and end. \
A date may be specified instead. Seconds are optional in time arguments."

parser.add_option("-d", "--date", action = "store", dest = "date",
                default = "",
                help = "NOT YET IMPLEMENTED. Use the supplied date instead of today.")

parser.add_option("-l", "--long", action = "store_true", dest = "longout",
                default = False,
                help = "Span the longest possible time range.")

parser.add_option("-s", "--short", action = "store_true", dest = "shortout",
                default = False,
                help = "Span the shortest possible time range.")

parser.add_option("-D", "--debug", action = "store", dest = "debug",
                default = 0, type = "int",
                help = "Output debugging information.\t\t\t\t\tNone (default) = %default, Some = 1, More = 2")

(options, args) = parser.parse_args()

if not 0 <= options.debug <= 2:
    parser.error("debug level out of range")
else:
    debug = options.debug    # 1 = print some debug output, 2 = print a little more, 0 = none

if options.longout and options.shortout:
    parser.error("options -l and -s are mutually exclusive")

if options.date:
    parser.error("date option not yet implemented")

if len(args) != 3:
    parser.error("invalid number of arguments")

start = args[0]
end   = args[1]
file  = args[2]

# test for times to be properly formatted, allow hh:mm or hh:mm:ss
p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$')

if not p.match(start) or not p.match(end):
    print("Invalid time specification")
    exit(1)

# Determine Time Range
yesterday = date.fromordinal(date.today().toordinal()-1).strftime("%b %d")
today     = datetime.now().strftime("%b %d")
now       = datetime.now().strftime("%R")

if start > now or start > end or options.longout or options.shortout:
    searchstart = yesterday
else:
    searchstart = today

if (end > start > now and not options.longout) or options.shortout:
    searchend = yesterday
else:
    searchend = today

searchstart = searchstart + " " + start
searchend = searchend + " " + end

try:
    handle = open(file,'r')
except:
    print("File Open Error")
    exit(1)

# Set some initial values
bufsize = 4096  # handle long lines, but put a limit them
rewind  =  100  # arbitrary, the optimal value is highly dependent on the structure of the file
limit   =   75  # arbitrary, allow for a VERY large file, but stop it if it runs away
count   =    0
size    =    os.stat(file)[ST_SIZE]
beginrange   = 0
midrange     = size / 2
oldmidrange  = midrange
endrange     = size
linedate     = ''

pos1 = pos2  = 0

if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstart, searchend))

# Seek using binary search
while pos1 != endrange and oldmidrange != 0 and linedate != searchstart:
    handle.seek(midrange)
    linedate, line = getdata()    # sync to line ending
    pos1 = handle.tell()
    if midrange > 0:             # if not BOF, discard first read
        if debug > 1: print("...partial: (len: {0}) '{1}'".format((len(line)), line))
        linedate, line = getdata()

    pos2 = handle.tell()
    count += 1
    if debug > 0: print("#{0} Beg: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".format(count, beginrange, midrange, endrange, pos1, pos2, linedate))
    if  searchstart > linedate:
        beginrange = midrange
    else:
        endrange = midrange
    oldmidrange = midrange
    midrange = (beginrange + endrange) / 2
    if count > limit:
        print("ERROR: ITERATION LIMIT EXCEEDED")
        exit(1)

if debug > 0: print("...stopping: '{0}'".format(line))

# Rewind a bit to make sure we didn't miss any
seek = oldmidrange
while linedate >= searchstart and seek > 0:
    if seek < rewind:
        seek = 0
    else:
        seek = seek - rewind
    if debug > 0: print("...rewinding")
    handle.seek(seek)

    linedate, line = getdata()    # sync to line ending
    if debug > 1: print("...junk: '{0}'".format(line))

    linedate, line = getdata()
    if debug > 0: print("...comparing: '{0}'".format(linedate))

# Scan forward
while linedate < searchstart:
    if debug > 0: print("...skipping: '{0}'".format(linedate))
    linedate, line = getdata()

if debug > 0: print("...found: '{0}'".format(line))

if debug > 0: print("Beg: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".format(beginrange, midrange, endrange, pos1, pos2, linedate))

# Now that the preliminaries are out of the way, we just loop,
#     reading lines and printing them until they are
#     beyond the end of the range we want

while linedate <= searchend:
    print line
    linedate, line = getdata()

if debug > 0: print("Start: '{0}' End: '{1}'".format(searchstart, searchend))
handle.close()
Run Code Online (Sandbox Code Playgroud)


Mic*_*aff 0

从网上的快速搜索来看,有些东西是根据关键字提取的(例如 FIRE 等:),但没有任何东西可以从文件中提取日期范围。

做你建议的事情似乎并不难:

  1. 搜索开始时间。
  2. 打印出该行。
  3. 如果结束时间<开始时间,并且一行的日期>结束且<开始,则停止。
  4. 如果结束时间 > 开始时间,并且一行的日期 > 结束,则停止。

看起来很简单,如果你不介意 Ruby,我可以为你写它:)