我有一个大文件要解析和重新格式化,最好使用sed
(在 bash 下)。该文件包含以 开头PATTERN_START
和结尾的重复序列PATTERN_END
。这些序列与我必须保持不变的其他文本混合在一起。在序列中,有几条记录(编号从 1 到n,其中n可以是从 1 到 12)。记录是一组以 形式的行开头的行,其中i是 1 和n之间的整数,并以另一个这样的行 ( ) 或一行结束。记录的长度可以从 1 行到 30 行。Record i
Record (i+1)
PATTERN_END
这是输入文件的通用表示:
不相关的数据 (可能有很多行) ? PATTERN_START | 记录 1 ? | 记录 1 的数据 ? (最多 30 行) ?| | (多次重复) ? ? | (最多 12 条记录) | 记录 2 | | 记录 2 的数据 ?? | 模式_END? 不相关的数据 (可能有很多行)
因此,我希望,仅对于位于PATTERN_START
和之间的记录PATTERN_END
,将每个记录的所有数据行都收集在该Record
行上。
有人可以帮忙吗?
下面是我必须解析的文件示例,以及我想要的结果类型:
Blabla
Blabla
PATTERN_OTHER
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Record 3 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Data
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2 <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Record 2 <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla
Run Code Online (Sandbox Code Playgroud)
Blabla
Blabla
PATTERN_OTHER
Record 1 <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1 Data Data Data <- record data grouped in one line
Record 2 Data Data <- record data grouped in one line
Record 3 Data Data Data Data <- record data grouped in one line
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1 Data Data Data <- record data grouped in one line
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1 <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
Record 2 <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1 Data <- record data grouped in one line
Record 2 Data Data Data <- record data grouped in one line
PATTERN_END
Blabla
Blabla
Run Code Online (Sandbox Code Playgroud)
认为这就是你想要的使用 GNU sed
sed -n '/^PATTERN_START/,/^PATTERN_END/{
//!{H;/^Record/!{x;s/\n\([^\n]*\)$/ \1/;x}};
/^PATTERN_START/{h};/^PATTERN_END/{x;p;x;p};d
};p' file
Run Code Online (Sandbox Code Playgroud)
sed -n #Non printing
'/^PATTERN_START/,/^PATTERN_END/{
#If the line falls between these two patterns execute the next block
//!{
#If the previous pattern matched from the line above is not on matched(so skip
the start and end lines), then execute next block
H;
#append the line to the hold buffer, so this appends all lines between
#`/^PATTERN_START/` and `/^PATTERN_END/` not including those.
/^Record/!{
#If the line does not begin with record then execute next block
x;s/\n\([^\n]*\)$/ \1/;x
#Swap current line with pattern buffer holding all our other lines
#up to now.Then remove the last newline. As this only executed when
#record is not matched it just removes the newline from the start
#of `data`.
#The line is then put switched back into the hold buffer.
}
#End of not record block
};
#End of previous pattern match block
/^PATTERN_START/{h};
#If line begins with `PATTERN_START` then the hold buffer is overwritten
#with this line removing all the previous matched lines.
/^PATTERN_END/{x;p;x;p}
#If line begins with `PATTERN_END` the swap in our saved lines, print them,
#then swap back in the PATTERN END line and print that as well.
;d
#Delete all the lines within the range, as we print them explicitly in the
#Pattern end block above
};p' file
# Print everything that's not in the range print, and the name of the file
Run Code Online (Sandbox Code Playgroud)
This was the best I could come up with:
sed -n '/^PATTERN_START/, /^PATTERN_END/{
/^PATTERN_START/{x;s/^.*$//;x};
/^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};
/^PATTERN_END/{x;/^\n/{s/^\n//p;d};s/\n/ /gp;g;p};
/^Record/!H
};
/^PATTERN_START/, /^PATTERN_END/!p'
Run Code Online (Sandbox Code Playgroud)
Explanation
I assume you are familiar with the idea of hold space and pattern space in sed
. In this solution, we will be doing lot of manipulations in pattern space. So, first point is to disable automatic printing with -n
option and print wherever required.
First task is to join all the lines that are between Record
lines.
Consider the following file:
a
b
Record 1
c
d
Record 2
e
f
Record 3
Run Code Online (Sandbox Code Playgroud)
After joining lines, we want it to be
a
b
Record 1 c d
Record 2 e f
Record 3
Run Code Online (Sandbox Code Playgroud)
So, here is the plan:
Record
, it means that the previous record has finished and a new record has started. So we print out the hold space, flush it and start with point 1 again.Point 1 is implemented by the code /^Record/!H
(5th line in the command). What it means is "if the line doesn't start with Record
, add a new line to the hold space and append this line to the hold space".
Point 2 can be implemented by the code
/^Record/{x;s/\n/ /gp;}
where x
swaps hold and pattern spaces, s
command replaces all \n
s with s and
p
flag prints the pattern space. Usage of x
also has the advantage that now the hold space contains the current Record
line so that we can begin another cycle of points 1 and 2.
But, this has a problem. In the given example, there are two lines
a
b
before the first Record
line. We don't want to substitute \n
for in these lines. Since they don't begin with
Record
, according to point 1, \n
is added to hold space and then these lines are appended. So, if the first character of the hold space is \n
, it means that no Record
has been encountered before and we should not substitute \n
with . This is done with the command
/^\n/{s/^\n//p;d}
Run Code Online (Sandbox Code Playgroud)
So the entire command becomes
/^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};
Run Code Online (Sandbox Code Playgroud)
Now, the second complication is, we want to join lines, even if a Record
line is not terminated by a Record
line but by a PATTERN_END
line. We want to do the exact same things as in point 2, even when the line starts with PATTERN_END
. So the command becomes
/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp}
Run Code Online (Sandbox Code Playgroud)
But, there is a problem with this. As in the case of Record
lines, the PATTERN_END
line now ends up in the hold space. But we know that there will be no more joining of lines after PATTERN_END
line. So, we can print this out. So, we bring the PATTERN_END
line to pattern space with g
and print it with p
. So the final command becomes
/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp;g;p}
Run Code Online (Sandbox Code Playgroud)
Next issue is with the PATTERN_START
lines. In the above explanation we assumed that at the start, hold space is empty. But after a PATTERN_END
, there is something in the hold space. (That something is just PATTERN_END
line). When we start a new cycle with PATTERN_START
, we want to clear the hold space.
So, what we do is when we encounter PATTERN_START
, swap the contents of hold and pattern spaces, clear the pattern space and swap again. This makes hold space clean. This is exactly what the following command does:
/^PATTERN_START/{x;s/^.*$//;x}
Run Code Online (Sandbox Code Playgroud)
The final stroke is that we want to do all this fiddling only between PATTERN_START
and PATTERN_END
lines. Others, we just print them. This is done by the commands
/^PATTERN_START/, /^PATTERN_END/{
----above commands go here----
};
/^PATTERN_START/, /^PATTERN_END/!p
Run Code Online (Sandbox Code Playgroud)
Put all these together and this gives the final command :)