Bin*_*ven 7 python algorithm text data-visualization word-cloud
我想创建一个词云。当我的字符串是英文时,一切正常:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
text="""Softrock 40 - close to the 6 MHz that the P6D requires (6.062 according) - https://groups.yahoo.com/neo/groups/softrock40/conversations/messages
I want the USB model that has a controllable (not fixed) central frequency."""
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Run Code Online (Sandbox Code Playgroud)
但是当我在希伯来语中做同样的事情时,它没有检测到字体,我只得到空的矩形:
text="""?????? ?? ???? ????? ?????, ????? ???? ?????? ??????? ?? ??????? ???? ???? ????? ???? ????? ???????? ????? ????? ???? ????"""
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?
Wil*_*sem 11
这与 wordcloud 本身没有太大关系,而与渲染关系更大:您使用(默认是)一种根本不包含任何希伯来语字符“定义”的字体。因此,它只是简单地渲染矩形。
但是,我们可以使用支持希伯来语字符的字体,例如FreeSansBold。我们可以通过WordCloud构造函数传递字体的路径:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
text="""?????? ?? ???? ????? ?????, ????? ???? ?????? ??????? ?? ??????? ???? ???? ????? ???? ????? ???????? ????? ????? ???? ????"""
wordcloud = WordCloud(font_path='/usr/share/fonts/truetype/freefont/FreeSansBold.ttf').generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()Run Code Online (Sandbox Code Playgroud)
那么这会生成以下词云:
我对希伯来语不是很熟悉,但我的印象是这些词是从左到右写的,而不是从右到左。无论如何,如果这是一个问题,我们可以使用python-bidi来首先处理语言的方向,例如:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from bidi.algorithm import get_display
text="""?????? ?? ???? ????? ?????, ????? ???? ?????? ??????? ?? ??????? ???? ???? ????? ???? ????? ???????? ????? ????? ???? ????"""
bidi_text = get_display(text)
wordcloud = WordCloud(font_path='/usr/share/fonts/truetype/freefont/FreeSansBold.ttf').generate(bidi_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()Run Code Online (Sandbox Code Playgroud)
对于给定的文本,我们将获得以下图像: