split file into time period files based on time unix stamp

swh*_*mwo 1 unix bash awk timestamp

I have some thousands of log (.txt) files (their names or order does not matter, neither does the order of entries in the final output files) which consist of a unix time stamp and a value, such as:

infile1.txt:
1361775157 a
1361775315 b            
1379007707 c
1379014884 d

infile2.txt:
1360483293 e
1361384920 f
1372948120 g
1373201928 h
Run Code Online (Sandbox Code Playgroud)

My goal is to split them based into arbitrarily defined time intervals (e.g. in this case with 1360000000, 1370000000 and 1380000000 as the bounds), so that I get as many files as intervals:

1360000000-1370000000.txt:
1361775157 a 
1361775315 b    
1360483293 e
1361384920 f        

1370000000-1380000000.txt:
1379007707 c
1379014884 d
1372948120 g
1373201928 h
Run Code Online (Sandbox Code Playgroud)

我目前的方法是运行一个脚本,该脚本在每个时间段的循环中过滤每个时间段的条目(开始和结束作为第一个和第二个参数),并将它们添加到文件中:

#!/bin/bash

for i in *txt; do
    awk -v t1=$1 -v t2=$2 '$1 >= t1 && $1 < t2' $i >> "elsewhere/$1-$2.txt"
done
Run Code Online (Sandbox Code Playgroud)

但是,这意味着在每个时间段都读取了所有文件,这对我来说似乎效率很低。有没有办法只读取一次每个文件,并将每一行追加到对应于其时间段的文件中?

Ed *_*ton 5

我会使用这样的方法:

$ cat tst.awk
{
    bucket = int($1/inc)
    print $0 " > " ( (inc*bucket) "-" (inc*(bucket+1)-1) ".txt" )
}

$ awk -v inc='10000000' -f tst.awk file1 file2
1361775157 a > 1360000000-1369999999.txt
1361775315 b > 1360000000-1369999999.txt
1379007707 c > 1370000000-1379999999.txt
1379014884 d > 1370000000-1379999999.txt
1360483293 e > 1360000000-1369999999.txt
1361384920 f > 1360000000-1369999999.txt
1372948120 g > 1370000000-1379999999.txt
1373201928 h > 1370000000-1379999999.txt
Run Code Online (Sandbox Code Playgroud)

如果您使用的是GNU awk(可在需要时为您处理关闭/重新打开文件),则只需在完成测试后更改$0 " > "为即可>,否则进行以下操作:

{
    bucket = int($1/inc)
    if ( bucket != prev ) {
        close(out)
        out = (inc*bucket) "-" (inc*(bucket+1)-1) ".txt"
        prev = bucket
    }
    print >> out
}
Run Code Online (Sandbox Code Playgroud)

在任何awk中工作。