尝试使用 grep 从 HTML 文件中删除所有 ID

Question

尝试使用 grep 从 HTML 文件中删除所有 ID

Dᴀʀ*_*ᴅᴇʀ 2 grep regular-expression html

我正在尝试id=" "从.html文件中删除所有s，但我不确定我哪里出错了。我尝试使用正则表达式，但我得到的只是.html我的 Ubuntu 终端中的文件渲染。

代码：

grep -Ev '^$id\="[a-zA-Z][0-9]"' *.html

Run Code Online (Sandbox Code Playgroud)

我正在执行它bash ex.sh。

Answer 1

Run*_*ium 8

虽然这违背了我更好的判断，但我还是会发布它（sed部分）。

也就是说：如果是为了快速而肮脏的修复，请继续。如果它有点严重或者你会经常做的事情等等。使用其他的东西，比如 python、perl 等，你不依赖正则表达式，而是使用模块来处理 HTML 文档。

一种更简单的方法是使用例如 sed。

sed 's/\(<[^>]*\) \+id="[^"]*"\([^>]*>\)/\1\2/' sample.html > noid.html

Run Code Online (Sandbox Code Playgroud)

解释：

            +--------------------------------- Match group 1
            |                      +---------- Match group 2
         ___|___                ___|___
        |       |              |       |  
sed 's/\(<[^>]*\) \+id="[^"]*"\([^>]*>\)/\1\2/' sample.html > noid.html
     |   |  | |   |  |    | ||    |  |      |
     |   |  | |   |  |    | ||    |  |      +- \1\2  Subst. with group 1 and 2
     |   |  | |   |  |    | ||    |  +-------- >     Closing bracket
     |   |  | |   |  |    | ||    +----------- [^>]* Same as below
     |   |  | |   |  |    | |+---------------- "     Followed by "
     |   |  | |   |  |    | +----------------- *     Zero or more times
     |   |  | |   |  |    +------------------- [^"]  Not double-quote
     |   |  | |   |  +------------------------ id="  Literal string
     |   |  | |   +---------------------------  \+   Space 1 or more times
     |   |  | +------------------------------- *     Zero or more times 
     |   |  +--------------------------------- [^>]  Not closing bracket
     |   +------------------------------------ <     Opening bracket
     +---------------------------------------- s     Substitute

Run Code Online (Sandbox Code Playgroud)

用于sed -i就地编辑文件。（遗憾可能但无法撤消。）

更好的; 使用 perl 的示例：

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
use HTML::Entities;
use utf8;

die "$0 [file]\n" unless defined $ARGV[0];

my $parser = HTML::TokeParser::Simple->new(file => $ARGV[0]);

if (!$parser) {
    die "No HTML file found.\n";
}

while (my $token = $parser->get_token) {
    $token->delete_attr('id');
    print $token->as_is;
}

Run Code Online (Sandbox Code Playgroud)

您的 grep 命令不会匹配任何内容。但是当您使用 invert 选项时，-v它会打印所有不匹配的内容——因此是整个文件。

grep 不是就地文件修饰符，而是通常用于在文件中查找内容的工具。尝试例如：

grep -o '\(<[^>]*\)id="[^"]*"[^>]*>' sample.html

Run Code Online (Sandbox Code Playgroud)

-o意味着只打印匹配的模式。（不是整行）

sed，awk等常常被用来编辑流或文件。例如上面的例子。

从你的 grep 有一些误解：

 id\="[a-zA-Z][0-9]"

Run Code Online (Sandbox Code Playgroud)

将完全匹配：

id=
范围内的一个字符a-z或A-Z

后跟一位数

换句话说，它将匹配：

id="a0" id="a1" id="a2" ... id="Z9"
Run Code Online (Sandbox Code Playgroud)
没有什么像：id="foo99"或id="blah-gah"。

此外，它将匹配：

^ <-- start of line (As it is first in pattern or group) $ <-- end of line (As you use the `-E` option) # Else it would be: ^ <-- start of line (As it is first in pattern or group) $ <-- dollar sign (Does not mean end of line unless it is at end of pattern or group)
Run Code Online (Sandbox Code Playgroud)
因此什么都没有。

归档时间：	12 年，7 月前
查看次数：	2898 次
最近记录：	12 年，5 月前