从一行中提取没有定界符的固定宽度记录

Question

从一行中提取没有定界符的固定宽度记录

我需要从包含一行很长文本的单个文件中提取文本字符串，没有分隔符。使用下面的示例行，这些是以下已知事实：

???????A1XXXXXXXXXX???????B1XXXX???????A1XXXXXXXXXX???????C1XXXXXXX

1.  It contains 38 fixed width record types 
2.  The record marker is a 7 alphanumeric character followed by, for example, ‘A1’.
3.  Each record type has varying widths, for example, A1 record type will have 10 characters following it, if B1 then 4, and if C1 then 7.
4.  The record types aren’t clumped together and can be in any order. As in the example, its A1,B1,A1,C1
5.  The example above has 4 records and each record type needs to go to separate files. In this case 38 of them.

Run Code Online (Sandbox Code Playgroud)

???????A1XXXXXXXXXX

???????B1XXXX

???????A1XXXXXXXXXX

???????C1XXXXXXX

6.  The record identifier, e.g. ????????A1, can appear in the body of the record so cannot use grep. 
7.  With the last point in mind, I was proposing 3 solutions but not sure on how to script this and of course would greatly appreciate some help. 
a. Traverse through the file from the beginning and sequentially strip out the record to the appropriate output file. For example, strip out first record type A1 to A1file which I know is 10 characters long then re-interrogate the file which will then have B1 which I know is 4 chars long, strip this out to B1file etc.. <<< this seems painful >>
b. Traverse through the file and append some obscure character to each record marker within the same file. Much like above but not strip out. I understand it still will use the same logic but seems more elegant
c. I did think of simply using the proposed grep -oE solution but then re-interrogate the output files to see if any of the 38 record markers exist anywhere other than at the beginning. But this might not always work.

Run Code Online (Sandbox Code Playgroud)

Answer 1

iru*_*var 5

grep怎么样

grep -oE 'A1.{10}|B1.{4}|C1.{7}' input.txt

Run Code Online (Sandbox Code Playgroud)

这将在单独的行上打印每种记录类型的每条记录。为了重定向grep输出到一个名为3个文件A1，B1，C1分别

grep -oE 'A1.{10}|B1.{4}|C1.{7}' input.txt| 
awk -v OFS= -v FS= '{f=$1$2; $1=$2=""; print>f}'

Run Code Online (Sandbox Code Playgroud)

Answer 2

rzy*_*mek 4

这是使用gawk 的 FPAT 的可能解决方案

BEGIN { 
    FPAT="A1.{10}|B1.{4}|C1.{7}" #define field contents
} 
{
    for(i=1;i<=NF;i++) 
        print $i >> substr($i,0,2) #print the field to file A1,B1,etc
}

Run Code Online (Sandbox Code Playgroud)

作为单行：

gawk 'BEGIN{FPAT="A1.{10}|B1.{4}|C1.{7}"} {for(i=1;i<=NF;i++)print $i >> substr($i,0,2)}' < datafile

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，11 月前
查看次数：	4240 次
最近记录：	11 年，10 月前