我需要处理一个包含多行记录的大数据文件,例如输入:
1 Name Dan
1 Title Professor
1 Address aaa street
1 City xxx city
1 State yyy
1 Phone 123-456-7890
2 Name Luke
2 Title Professor
2 Address bbb street
2 City xxx city
3 Name Tom
3 Title Associate Professor
3 Like Golf
4 Name
4 Title Trainer
4 Likes Running
Run Code Online (Sandbox Code Playgroud)
请注意,第一个整数字段是唯一的,并且确实标识了整个记录.所以在上面的输入中我确实有4条记录,虽然我不知道每条记录可能有多少行属性.我需要: - 识别有效记录(必须具有"名称"和"标题"字段) - 输出每个有效记录的可用属性,例如"名称","标题","地址"是必填字段.
示例输出:
1 Name Dan
1 Title Professor
1 Address aaa street
2 Name Luke
2 Title Professor
2 Address bbb street
3 Name Tom
3 Title Associate Professor
Run Code Online (Sandbox Code Playgroud)
所以在输出文件中,记录4被删除,因为它没有"名称"字段.记录3没有地址字段但仍然打印到输出,因为它是具有"名称"和"标题"的有效记录.
我能用awk做这个吗?但是如何使用每行上的第一个"id"字段识别整个记录?
非常感谢unix shell脚本专家帮助我!:)
这似乎有效.有很多方法可以做到这一点,即使在awk中也是如此.
为了方便阅读,我把它分开了.
请注意,记录3未显示,因为它缺少"地址"字段,您已将其标识为必需.
#!/usr/bin/awk -f
BEGIN {
# Set your required fields here...
required["Name"]=1;
required["Title"]=1;
required["Address"]=1;
# Count the required fields
for (i in required) enough++;
}
# Note that this will run on the first record, but only to initialize variables
$1 != last1 {
if (hits >= enough) {
printf("%s",output);
}
last1=$1; output=""; hits=0;
}
# This appends the current line to a buffer, followed by the record separator (RS)
{ output=output $0 RS }
# Count the required fields; used to determine whether to print the buffer
required[$2] { hits++ }
END {
# Print the final buffer, since we only print on the next record
if (hits >= enough) {
printf("%s",output);
}
}
Run Code Online (Sandbox Code Playgroud)