从 Postgres 查询非 ASCII 行

Sun*_*her 17 postgresql regular-expression utf-8 regex unicode

[:ascii:]课堂在 Postgres 中是否有效?它没有列在他们的帮助中,但是我在网上看到使用它的例子

我有一个 UTF-8 数据库,其中collat​​ionc_typ e 是en_US.UTF-8,Postgres 版本是 9.6.2。当我像这样搜索非 ASCII 行时:

select title from wallabag_entry where title ~ '[^[:ascii:]]';
Run Code Online (Sandbox Code Playgroud)

我得到Unicode 和非 Unicode 符号(完整输出在这里):

?????????? ??????????????: ???? ????????? ??????? ?????
??????? ???????? ????????: ????? ?? ?????? ????????? ?? ???????
??? ?? ?????? ? ??????? ?? ????: ??? ? ????????????? ?????????? ???????????
??? ???????? ??????? ? 1740-? ???? ?? ??????? ??????? ??????
Have you heard of Saint Death? Don’t pray to her.
???????? ?????????? ????: ???????? ?? ????????
??????? ?? ??
China’s marriage rate is plummeting because women are choosing autonomy over 
Run Code Online (Sandbox Code Playgroud)

这个查询有什么问题?

joa*_*olo 33

回答你的问题:[:ascii:]有效。您的文本中可能有一些您无法识别为non-ASCII 的字符,但它们确实存在。例如,它们可以是不可破坏的空格或任何其他Unicode 空格字符

从网页复制粘贴的文本中包含不可破坏的空格(  )并不奇怪,但您没有注意到它们的存在。

下面是一个示例:

WITH t(t) AS
(
    VALUES 
      ( '?????????? ??????????????: ???? ????????? ??????? ?????' ),
      ( '??????? ???????? ????????: ????? ?? ?????? ????????? ?? ???????' ),
      ( '??? ?? ?????? ? ??????? ?? ????: ??? ? ????????????? ?????????? ???????????' ),
      ( '??? ???????? ??????? ? 1740-? ???? ?? ??????? ??????? ??????' ),
      ( 'Have you heard of Saint Death? Don’t pray to her.' ),
      ( '???????? ?????????? ????: ???????? ?? ????????' ),
      ( '??????? ?? ??' ),
      ( 'China’s marriage rate is plummeting because women are choosing autonomy over ' )

)
SELECT 
    t,  regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
FROM 
    t 
WHERE 
    t ~ '[^[:ascii:]]' ;
Run Code Online (Sandbox Code Playgroud)

这就是你得到的:

                                       t                                       |                                                                                                 t_marked                                                                                                  
-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ?????????? ??????????????: ???? ????????? ??????? ?????                       | [?][?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?][?][?][?][?]: [?][?][?][?] [?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?] [?][?][?][?][?]
 ??????? ???????? ????????: ????? ?? ?????? ????????? ?? ???????               | [?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?]: [?][?][?][?][?] [?][?] [?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?] [?][?] [?][?][?][?][?][?]?
 ??? ?? ?????? ? ??????? ?? ????: ??? ? ????????????? ?????????? ???????????   | [?][?][?] [?][?] [?][?][?][?][?][?] [?] [?][?][?][?][?][?][?] [?][?] [?][?][?][?]: [?][?][?] [?] [?][?][?][?][?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?][?]
 ??? ???????? ??????? ? 1740-? ???? ?? ??????? ??????? ??????                  | [?][?][?] [?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?] [?] 1740-[?] [?][?][?][?] [?][?] [?][?][?][?][?][?][?] [?][?][?][?][?][?][?] [?][?][?][?][?][?]
 Have you heard of Saint Death? Don’t pray to her.                             | Have you heard of Saint Death? Don[’]t pray to her.
 ???????? ?????????? ????: ???????? ?? ????????                                | [?][?][?][?][?][?][?][?] [?][?][?][?][?][?][?][?][?][?] [?][?][?][?]: [?][?][?][?][?][?][?][?] [?][?] [?][?][?][?][?][?][?]?
 ??????? ?? ??                                                                 | [?][?][?][?][?][?][?] [?][?] [?][?]
 China’s marriage rate is plummeting because women are choosing autonomy over  | China[’]s marriage rate is plummeting because women are choosing autonomy over 
Run Code Online (Sandbox Code Playgroud)

从中可以看出,您的问题是右撇号字符。ASCII 仅支持撇号。左撇号和右撇号是印刷上正确的 Unicode 扩展。

dbfiddle在这里

您也可以在http://rextester.com/UKIQ48014 (PostgreSQL 9.5) 和http://sqlfiddle.com/#!15/4c563/1/0 (PostgreSQL 9.3) 上使用以前的版本检查它


我猜你认为的文本是纯 ASCII,而不是

 WITH t(t) AS
 (
     VALUES 
       ('A fully ASCII text!'),
       ('Have you heard of Saint Death? Don’t pray to her.'),
       ('China’s marriage rate is plummeting because women are choosing autonomy over ')
 )
 SELECT 
    regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
 FROM 
    t 
 WHERE 
    t ~ '[^[:ascii:]]' ;
Run Code Online (Sandbox Code Playgroud)
| t_marked |
 | :------------------------------------------------- ----------------------------- |
 | 你听说过圣死吗?不要向她祈祷。|
 | 中国[']结婚率直线下降,因为女性选择自主
 

dbfiddle在这里

这些文本使用'而不是'来标记撇号。

检查标点符号:为什么 Unicode 中首选的撇号字符是正确的单引号 (U+2019),而不是语义上不同的撇号 (U+0027)?...看到你不是第一个遇到这个问题的人。

  • 这是一个非常棒的答案,因为它向您展示了非 ascii 字符。这就是我回答这个问题的方式。 (3认同)