如何删除位于 2 个模式之间的每条记录的数据之间的换行符?

Syl*_*l33 5 sed

我有一个大文件要解析和重新格式化,最好使用sed(在 bash 下)。该文件包含以 开头PATTERN_START和结尾的重复序列PATTERN_END。这些序列与我必须保持不变的其他文本混合在一起。在序列中,有几条记录(编号从 1 到n,其中n可以是从 1 到 12)。记录是一组以 形式的行开头的行,其中i是 1 和n之间的整数,并以另一个这样的行 ( ) 或一行结束。记录的长度可以从 1 行到 30 行。Record iRecord (i+1)PATTERN_END

这是输入文件的通用表示:

不相关的数据          (可能有很多行)                       ?
PATTERN_START |
记录 1 ? |
记录 1 的数据(最多 30 行)    ?| |  (多次重复) 
      ? ? |  (最多 12 条记录)     |
记录 2 | |
记录 2 的数据                       ?? |
模式_END?
不相关的数据          (可能有很多行)

因此,我希望,仅对于位于PATTERN_START和之间的记录PATTERN_END,将每个记录的所有数据行都收集在该Record行上。

有人可以帮忙吗?

下面是我必须解析的文件示例,以及我想要的结果类型:

输入

Blabla
Blabla
PATTERN_OTHER
Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Record 3         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
Data
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
Data
Record 2         <- record not between PATTERN_START and PATTERN_END tags => do not touch it
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Record 2         <- record between PATTERN_START and PATTERN_END tags => to put in one line
Data
Data
Data
PATTERN_END
Blabla
Blabla
Run Code Online (Sandbox Code Playgroud)

输出

Blabla
Blabla
PATTERN_OTHER
Record 1         <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
PATTERN_END
Blabla
PATTERN_START
Record 1 Data Data Data        <- record data grouped in one line
Record 2 Data Data             <- record data grouped in one line
Record 3 Data Data Data Data   <- record data grouped in one line
PATTERN_END
Blabla
Blabla
Blabla
Blabla
PATTERN_START
Record 1 Data Data Data        <- record data grouped in one line
PATTERN_END
Blabla
Blabla
PATTERN_OTHER
Record 1         <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
Data
Record 2         <- was not between PATTERN_START and PATTERN_END tags => not modified
Data
PATTERN_END
Blabla
Blabla
PATTERN_START
Record 1 Data                  <- record data grouped in one line
Record 2 Data Data Data        <- record data grouped in one line
PATTERN_END
Blabla
Blabla
Run Code Online (Sandbox Code Playgroud)

123*_*123 8

认为这就是你想要的使用 GNU sed

 sed -n '/^PATTERN_START/,/^PATTERN_END/{
         //!{H;/^Record/!{x;s/\n\([^\n]*\)$/ \1/;x}};
         /^PATTERN_START/{h};/^PATTERN_END/{x;p;x;p};d
         };p' file
Run Code Online (Sandbox Code Playgroud)

解释

sed -n #Non printing


'/^PATTERN_START/,/^PATTERN_END/{
#If the line falls between these two patterns execute the next block

  //!{
  #If the previous pattern matched from the line above is not on matched(so skip 
         the start and end lines), then execute next block

        H;
        #append the line to the hold buffer, so this appends all lines between 
       #`/^PATTERN_START/` and `/^PATTERN_END/` not including those.

        /^Record/!{
        #If the line does not begin with record then execute next block

            x;s/\n\([^\n]*\)$/ \1/;x
            #Swap current line with pattern buffer holding all our other lines 
            #up to now.Then remove the last newline. As this only executed when 
            #record is not matched it just removes the newline from the start 
            #of `data`.
            #The line is then put switched back into the hold buffer.

        }
        #End of not record block

    }; 
    #End of previous pattern match block

    /^PATTERN_START/{h};

    #If line begins with `PATTERN_START` then the hold buffer is overwritten 
    #with this line removing all the previous matched lines.

    /^PATTERN_END/{x;p;x;p}
    #If line begins with `PATTERN_END` the swap in our saved lines, print them,
    #then swap back in the PATTERN END line and print that as well.

    ;d
    #Delete all the lines within the range, as we print them explicitly in the 
    #Pattern end block above


         };p' file
         # Print everything that's not in the range print, and the name of the file
Run Code Online (Sandbox Code Playgroud)


nit*_*hch 6

This was the best I could come up with:

sed -n '/^PATTERN_START/, /^PATTERN_END/{
            /^PATTERN_START/{x;s/^.*$//;x};
            /^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};
            /^PATTERN_END/{x;/^\n/{s/^\n//p;d};s/\n/ /gp;g;p};
            /^Record/!H
        };   
        /^PATTERN_START/, /^PATTERN_END/!p'
Run Code Online (Sandbox Code Playgroud)

Explanation

I assume you are familiar with the idea of hold space and pattern space in sed. In this solution, we will be doing lot of manipulations in pattern space. So, first point is to disable automatic printing with -n option and print wherever required.

First task is to join all the lines that are between Record lines.

Consider the following file:

a
b
Record 1
c
d
Record 2
e
f
Record 3
Run Code Online (Sandbox Code Playgroud)

After joining lines, we want it to be

a
b
Record 1 c d
Record 2 e f
Record 3
Run Code Online (Sandbox Code Playgroud)

So, here is the plan:

  1. We read a line, append it to the hold space.
  2. If the line starts with Record, it means that the previous record has finished and a new record has started. So we print out the hold space, flush it and start with point 1 again.

Point 1 is implemented by the code /^Record/!H (5th line in the command). What it means is "if the line doesn't start with Record, add a new line to the hold space and append this line to the hold space".

Point 2 can be implemented by the code /^Record/{x;s/\n/ /gp;} where x swaps hold and pattern spaces, s command replaces all \ns with s and p flag prints the pattern space. Usage of x also has the advantage that now the hold space contains the current Record line so that we can begin another cycle of points 1 and 2.

But, this has a problem. In the given example, there are two lines a b before the first Record line. We don't want to substitute \n for in these lines. Since they don't begin with Record, according to point 1, \n is added to hold space and then these lines are appended. So, if the first character of the hold space is \n, it means that no Record has been encountered before and we should not substitute \n with . This is done with the command

/^\n/{s/^\n//p;d}
Run Code Online (Sandbox Code Playgroud)

So the entire command becomes

/^Record/{x;/^\n/{s/^\n//p;d};s/\n/ /gp};
Run Code Online (Sandbox Code Playgroud)

Now, the second complication is, we want to join lines, even if a Record line is not terminated by a Record line but by a PATTERN_END line. We want to do the exact same things as in point 2, even when the line starts with PATTERN_END. So the command becomes

/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp}
Run Code Online (Sandbox Code Playgroud)

But, there is a problem with this. As in the case of Record lines, the PATTERN_END line now ends up in the hold space. But we know that there will be no more joining of lines after PATTERN_END line. So, we can print this out. So, we bring the PATTERN_END line to pattern space with g and print it with p. So the final command becomes

/^PATTERN_END/{x;/^\n/?s/^\n//p;d};s/\n/ /gp;g;p}
Run Code Online (Sandbox Code Playgroud)

Next issue is with the PATTERN_START lines. In the above explanation we assumed that at the start, hold space is empty. But after a PATTERN_END, there is something in the hold space. (That something is just PATTERN_END line). When we start a new cycle with PATTERN_START, we want to clear the hold space.

So, what we do is when we encounter PATTERN_START, swap the contents of hold and pattern spaces, clear the pattern space and swap again. This makes hold space clean. This is exactly what the following command does:

/^PATTERN_START/{x;s/^.*$//;x}
Run Code Online (Sandbox Code Playgroud)

The final stroke is that we want to do all this fiddling only between PATTERN_START and PATTERN_END lines. Others, we just print them. This is done by the commands

/^PATTERN_START/, /^PATTERN_END/{
    ----above commands go here----
};
/^PATTERN_START/, /^PATTERN_END/!p
Run Code Online (Sandbox Code Playgroud)

Put all these together and this gives the final command :)