Dan*_*scu 17 javascript unicode diacritics combining-marks zalgo
我已经了解了Zalgo文本是如何工作的,我正在研究如何通过聊天或论坛软件来防止这种烦恼.更确切地说,需要的完整的Unicode组合字符是什么:
a)要么被剥离,假设聊天参与者只使用不需要组合标记的语言(即你可以用组合标记写"未婚夫",但是如果你坚持做的话,你自己会有点Zalgo'ed所以); 要么,
编辑:与此同时,我发现了一个完全不同的措辞问题(" 如何防止......变音符号? "),这与此基本相同.我让它的标题更明确,所以其他人也会发现它.
nwk*_*nwk 17
假设您对此非常认真并想要一个技术解决方案,您可以按如下方式进行:
这可能很有趣,但在实践中,最好立即前往第四步.
编辑:这是Python 2.7中更实用,更直率的解决方案.分类为"Mark,nonspacing"和"Mark,enclosing"的 Unicode字符似乎是用于创建Zalgo效果的主要工具.与上述想法不同,这不会试图确定文本的"美学",而只是删除所有这些字符.(不用说,这将废除许多语言的文本.继续阅读以获得更好的解决方案.)要过滤掉更多的字符类别,请将它们添加到ZALGO_CHAR_CATEGORIES.
#!/usr/bin/env python
import unicodedata
import codecs
ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
with codecs.open("zalgo", 'r', 'utf-8') as infile:
for line in infile:
print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),
Run Code Online (Sandbox Code Playgroud)
输入示例:
1
H??????????o??????????w?????????? ???????d??????????o??????????e??????????s?????????? ??????????Z??????????a?????????l?????????g?????o?????????? ??????????t?????????e??????????x??????????t?????????? ??????w??????????o??????????r???????k???????????????????
2
H??????????o??????????w?????????? ???????d??????????o??????????e??????????s?????????? ??????????Z??????????a?????????l?????????g?????o?????????? ??????????t?????????e??????????x??????????t?????????? ??????w??????????o??????????r???????k???????????????????
3
Run Code Online (Sandbox Code Playgroud)
输出:
1
How does Zalgo text work?
2
How does Zalgo text work?
3
Run Code Online (Sandbox Code Playgroud)
最后,如果您想要检测而不是无条件删除Zalgo文本,您可以执行字符频率分析.下面的程序为输入文件的每一行执行此操作.该函数is_zalgo为给出的字符串的每个单词计算"Zalgo分数"(分数是潜在的Zalgo字符数除以字符总数).然后查看"得分" 的第三个四分位数是否大于THRESHOLD.如果THRESHOLD等于0.5它意味着我们试图检测每四个单词中有一个是否有超过50%的Zalgo字符.(THRESHOLD0.5是猜测的,可能需要调整以适应实际使用.)这种类型的算法在支付/编码工作方面可能是最好的.
#!/usr/bin/env python
from __future__ import division
import unicodedata
import codecs
import numpy
ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
THRESHOLD = 0.5
DEBUG = True
def is_zalgo(s):
if len(s) == 0:
return False
word_scores = []
for word in s.split():
cats = [unicodedata.category(c) for c in word]
score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
word_scores.append(score)
total_score = numpy.percentile(word_scores, 75)
if DEBUG:
print total_score
return total_score > THRESHOLD
with codecs.open("zalgo", 'r', 'utf-8') as infile:
for line in infile:
print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line
Run Code Online (Sandbox Code Playgroud)
样本输出:
0.911483990148
True Señor, could you or your fiancé explain, H??????????o??????????w?????????? ???????d??????????o??????????e??????????s?????????? ??????????Z??????????a?????????l?????????g?????o?????????? ??????????t?????????e??????????x??????????t?????????? ??????w??????????o??????????r???????k???????????????????
0.333333333333
False P?íliš žlu?ou?ký k?? úp?l ?ábelské ódy.
Run Code Online (Sandbox Code Playgroud)
制作盒子overflow:hidden.它实际上并不禁用Zalgo文本,但它可以防止它损坏其他注释.
.comment {
/* the overflow: hidden is what prevents one comment's combining marks from affecting its siblings */
overflow: hidden;
/* the padding gives space for any legitimate combining marks */
padding: 0.5em;
/* the rest are just to visually divide the three comments */
border: solid 1px #ccc;
margin-top: -1px;
margin-bottom: -1px;
}Run Code Online (Sandbox Code Playgroud)
之前有人问过一个相关的问题:https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented但是在这里进行预防很有意思.
在防止这种情况方面,您可以选择以下几种策略:
| 归档时间: |
|
| 查看次数: |
6416 次 |
| 最近记录: |