Sun*_*her 17 postgresql regular-expression utf-8 regex unicode
[:ascii:]
课堂在 Postgres 中是否有效?它没有列在他们的帮助中,但是我在网上看到了使用它的例子。
我有一个 UTF-8 数据库,其中collation和c_typ e 是en_US.UTF-8
,Postgres 版本是 9.6.2。当我像这样搜索非 ASCII 行时:
select title from wallabag_entry where title ~ '[^[:ascii:]]';
Run Code Online (Sandbox Code Playgroud)
我得到了Unicode 和非 Unicode 符号(完整输出在这里):
?????????? ??????????????: ???? ????????? ??????? ?????
??????? ???????? ????????: ????? ?? ?????? ????????? ?? ???????
??? ?? ?????? ? ??????? ?? ????: ??? ? ????????????? ?????????? ???????????
??? ???????? ??????? ? 1740-? ???? ?? ??????? ??????? ??????
Have you heard of Saint Death? Don’t pray to her.
???????? ?????????? ????: ???????? ?? ????????
??????? ?? ??
China’s marriage rate is plummeting because women are choosing autonomy over
Run Code Online (Sandbox Code Playgroud)
这个查询有什么问题?
joa*_*olo 33
回答你的问题:[:ascii:]
有效。您的文本中可能有一些您无法识别为non-ASCII 的字符,但它们确实存在。例如,它们可以是不可破坏的空格或任何其他Unicode 空格字符。
从网页复制粘贴的文本中包含不可破坏的空格(
)并不奇怪,但您没有注意到它们的存在。
下面是一个示例:
WITH t(t) AS
(
VALUES
( '?????????? ??????????????: ???? ????????? ??????? ?????' ),
( '??????? ???????? ????????: ????? ?? ?????? ????????? ?? ???????' ),
( '??? ?? ?????? ? ??????? ?? ????: ??? ? ????????????? ?????????? ???????????' ),
( '??? ???????? ??????? ? 1740-? ???? ?? ??????? ??????? ??????' ),
( 'Have you heard of Saint Death? Don’t pray to her.' ),
( '???????? ?????????? ????: ???????? ?? ????????' ),
( '??????? ?? ??' ),
( 'China’s marriage rate is plummeting because women are choosing autonomy over ' )
)
SELECT
t, regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
FROM
t
WHERE
t ~ '[^[:ascii:]]' ;
Run Code Online (Sandbox Code Playgroud)
这就是你得到的:
t | t_marked
-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
?????????? ??????????????: ???? ????????? ??????? ????? | [?][?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?][?][?][?][?]: [?][?][?][?] [?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?] [?][?][?][?][?]
??????? ???????? ????????: ????? ?? ?????? ????????? ?? ??????? | [?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?]: [?][?][?][?][?] [?][?] [?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?] [?][?] [?][?][?][?][?][?]?
??? ?? ?????? ? ??????? ?? ????: ??? ? ????????????? ?????????? ??????????? | [?][?][?] [?][?] [?][?][?][?][?][?] [?] [?][?][?][?][?][?][?] [?][?] [?][?][?][?]: [?][?][?] [?] [?][?][?][?][?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?][?]
??? ???????? ??????? ? 1740-? ???? ?? ??????? ??????? ?????? | [?][?][?] [?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?] [?] 1740-[?] [?][?][?][?] [?][?] [?][?][?][?][?][?][?] [?][?][?][?][?][?][?] [?][?][?][?][?][?]
Have you heard of Saint Death? Don’t pray to her. | Have you heard of Saint Death? Don[’]t pray to her.
???????? ?????????? ????: ???????? ?? ???????? | [?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?] [?][?][?][?]: [?][?][?][?][?][?][?][?] [?][?] [?][?][?][?][?][?][?]?
??????? ?? ?? | [?][?][?][?][?][?][?] [?][?] [?][?]
China’s marriage rate is plummeting because women are choosing autonomy over | China[’]s marriage rate is plummeting because women are choosing autonomy over
Run Code Online (Sandbox Code Playgroud)
从中可以看出,您的问题是右撇号字符。ASCII 仅支持撇号。左撇号和右撇号是印刷上正确的 Unicode 扩展。
dbfiddle在这里
您也可以在http://rextester.com/UKIQ48014 (PostgreSQL 9.5) 和http://sqlfiddle.com/#!15/4c563/1/0 (PostgreSQL 9.3) 上使用以前的版本检查它
我猜你认为的文本是纯 ASCII,而不是:
WITH t(t) AS
(
VALUES
('A fully ASCII text!'),
('Have you heard of Saint Death? Don’t pray to her.'),
('China’s marriage rate is plummeting because women are choosing autonomy over ')
)
SELECT
regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
FROM
t
WHERE
t ~ '[^[:ascii:]]' ;
Run Code Online (Sandbox Code Playgroud)
| t_marked | | :------------------------------------------------- ----------------------------- | | 你听说过圣死吗?不要向她祈祷。| | 中国[']结婚率直线下降,因为女性选择自主
dbfiddle在这里
这些文本使用'而不是'来标记撇号。
检查标点符号:为什么 Unicode 中首选的撇号字符是正确的单引号 (U+2019),而不是语义上不同的撇号 (U+0027)?...看到你不是第一个遇到这个问题的人。