alv*_*vas 9 bash awk split sed file
如何按百分比分割文件.线?
假设我想将文件分成3个部分(60%/ 20%/ 20%部分),我可以手动执行此操作,-_-:
$ wc -l brown.txt
57339 brown.txt
$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339
$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
34398 part1.txt
11466 part2.txt
11475 part3.txt
57339 total
Run Code Online (Sandbox Code Playgroud)
但我相信有更好的方法!
$ cat file
a
b
c
d
e
$ cat tst.awk
BEGIN {
split(pcts,p)
nrs[1]
for (i=1; i in p; i++) {
pct += p[i]
nrs[int(size * pct / 100) + 1]
}
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }
$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt
Run Code Online (Sandbox Code Playgroud)
更改" > "
为仅>
实际写入输出文件.
有一个实用程序将行号作为参数,这些行号应成为每个相应新文件的第一个:csplit
.这是它的POSIX版本的包装:
#!/bin/bash
usage () {
printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}
# Collect csplit options
while getopts "ksf:n:" opt; do
case "$opt" in
k|s) args+=(-"$opt") ;; # k: no remove on error, s: silent
f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
*) usage; exit 1 ;;
esac
done
shift $(( OPTIND - 1 ))
fname=$1
shift
ratios=("$@")
len=$(wc -l < "$fname")
# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[@]}"; do
(( total += ratio ))
cumsums+=("$total")
done
# Don't need the last element
unset cumsums[-1]
# Array of numbers of first line in each split file
for sum in "${cumsums[@]}"; do
linenums+=( $(( sum * len / total + 1 )) )
done
csplit "${args[@]}" "$fname" "${linenums[@]}"
Run Code Online (Sandbox Code Playgroud)
在要拆分的文件的名称之后,它采用拆分文件的大小相对于它们的总和的比率,即,
percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1
Run Code Online (Sandbox Code Playgroud)
都是等价的.
与问题中的案例类似的用法如下:
$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
34403 part0
11468 part1
11468 part2
57339 total
Run Code Online (Sandbox Code Playgroud)
但编号从零开始,并且没有txt
扩展名.在GNU版本支持--suffix-format
的选项将允许.txt
扩展,这可能被添加到接受的论点,但这需要的更精致,比getopts
来分析它们.
这个解决方案适用于非常短的文件(将两行分成两行),繁重的工作csplit
本身就完成了.