如何使用awk删除标点符号？

Question

如何使用awk删除标点符号？

我需要一个shell中的命令行,给出一个文本文件"novel"在一行中显示每个单词以及它对应的行数,将其写在一个名为"words"的文件中.问题是单词不能有标点符号.这就是我所拥有的

$ awk '{for(i=1; i<=NF; ++i) {printf $i "\t" NR "\n", $0 > "words"}}' novel

Run Code Online (Sandbox Code Playgroud)

该文件包含:

$ cat novel 
ver a don Quijote, y ellas le defendían la puerta:
-¿Qué quiere este mostrenco en esta casa?

Run Code Online (Sandbox Code Playgroud)

预期产量:

ver 1
a 1
don 1
Quijote 1
...
puerta 1
Qué 2
...
casa 2

Run Code Online (Sandbox Code Playgroud)

这是一个非常简单的学术用途.

Answer 1

Joh*_*024 5

使用 awk

尝试这个命令：

awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words

Run Code Online (Sandbox Code Playgroud)

作为一个例子，考虑这个文件：

$ cat novel
It was a "dark" and stormy
night; the rain fell in torrents.

$ awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel
It
was
a
dark
and
stormy
night
the
rain
fell
in
torrents

Run Code Online (Sandbox Code Playgroud)

或者，要将输出保存在文件中words，请使用：

awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words

Run Code Online (Sandbox Code Playgroud)

怎么运行的：

gsub(/[[:punct:]]/, "")

这告诉 awk 查找任何标点符号并将其替换为空字符串。

[:punct:]是一个包含所有标点符号的字符类。该形式包括 unicode 定义的所有标点符号。例如，Unicode 定义了多种类型的引号字符。这将包括他们所有人。
1

这是 awk 的 print-the-record 的简写。
RS='[[:space:]]'

这告诉 awk 使用任何空白序列作为记录分隔符。这意味着每个单词定义一个单独的记录，awk 将读入一个单词作为处理时间。

数着单词

在 Unix 中计算项目的常用方法sort如下uniq -c：

$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, "")} 1' RS='[[:space:]]' | sort | uniq -c
      1 one
      3 three
      2 two

Run Code Online (Sandbox Code Playgroud)

或者，awk 可以完成这一切：

$ echo 'one two two three three three' | awk '{gsub(/^[[:punct:]]|[[:punct:]]$/, ""); a[$0]++} END{for (w in a) print w,a[w]}' RS='[[:space:]]'
three 3
two 2
one 1

Run Code Online (Sandbox Code Playgroud)

替代 awk 方法

Andriy Makukha建议我们可能不想删除单词中的标点符号，例如中的单引号I've。同样，我们可能不想删除 URL 中的句点，以便google.com保留google.com。要仅删除位于单词开头或结尾的标点符号，我们将该gsub命令替换为：

gsub(/^[[:punct:]]|[[:punct:]]$/, "")

Run Code Online (Sandbox Code Playgroud)

例如：

awk '{gsub(/[[:punct:]]/, "")} 1' RS='[[:space:]]' novel >words

Run Code Online (Sandbox Code Playgroud)

使用 sed

此 sed 命令将删除所有标点符号并将每个单词放在单独的行上：

sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel

Run Code Online (Sandbox Code Playgroud)

如果我们对其运行命令，我们将获得：

$ sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel
It
was
a
dark
and
stormy
night
the
rain
fell
in
torrents

Run Code Online (Sandbox Code Playgroud)

如果您希望将单词保存在 file 中words，请尝试：

sed 's/[[:punct:]]//g; s/[[:space:]]/\n/g' novel >words

Run Code Online (Sandbox Code Playgroud)

__怎么运行的：_

s/[[:punct:]]//g

这告诉 sed 查找所有出现的标点符号并将其替换为空。我们再次使用它，[:punct:]因为它将处理所有 unicode 定义的标点符号。
s/[[:space:]]/\n/g

这告诉 sed 查找任何空白序列并将其替换为单个换行符。

归档时间：	8 年，5 月前
查看次数：	1784 次
最近记录：	8 年，2 月前