使用 Python 的二元词云

Question

使用 Python 的二元词云

我使用python中的Wordcloud包直接从文本文件生成词云。这是我从 stckoverflow 重新使用的代码：

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS


def random_color_func(word=None, font_size=None, position=None, orientation=None, font_path=None, random_state=None):
    h = int(360.0 * 45.0 / 255.0)
    s = int(100.0 * 255.0 / 255.0)
    l = int(100.0 * float(random_state.randint(60, 120)) / 255.0)

    return "hsl({}, {}%, {}%)".format(h, s, l)

file_content=open ("xyz.txt").read()

wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
                            stopwords = STOPWORDS,
                            background_color = 'white',
                            width = 1200,
                            height = 1000,
                            color_func = random_color_func
                            ).generate(file_content)

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis('off')
plt.show()

Run Code Online (Sandbox Code Playgroud)

它给了我单个单词的 wordcloud。WordCloud() 函数中是否有任何参数可以在不格式化文本文件的情况下传递 n-gram。

我想要 bigram 的词云。或带有下划线的文字显示。喜欢：machine_learning（机器和学习是两个不同的词）

Answer 1

Kav*_*yal 6

感谢迭戈的回答。这只是迭戈用 python 代码回答的延续。

\n\n

import nltk\nfrom wordcloud import WordCloud, STOPWORDS\n\nWNL = nltk.WordNetLemmatizer()\ntext = \'your input text goes here\'\n# Lowercase and tokenize\ntext = text.lower()\n# Remove single quote early since it causes problems with the tokenizer.\ntext = text.replace("\'", "")\n# Remove numbers from text\nremove_digits = str.maketrans(\'\', \'\', digits)\ntext = text.translate(remove_digits)\ntokens = nltk.word_tokenize(text)\ntext1 = nltk.Text(tokens)\n\n# Remove extra chars and remove stop words.\ntext_content = [\'\'.join(re.split("[ .,;:!?\xe2\x80\x98\xe2\x80\x99``\'\'@#$%^_&*()<>{}~\\n\\t\\\\\\-]", word)) for word in text1]\n\n#set the stopwords list\nstopwords_wc = set(STOPWORDS)\ncustomised_words = [\'xxx\', \'yyy\'] # If you want to remove any particular word form text which does not contribute much in meaning\n\nnew_stopwords = stopwords_wc.union(customized_words)\ntext_content = [word for word in text_content if word not in new_stopwords]\n\n# After the punctuation above is removed it still leaves empty entries in the list.\ntext_content = [s for s in text_content if len(s) != 0]\n\n# Best to get the lemmas of each word to reduce the number of similar words\ntext_content = [WNL.lemmatize(t) for t in text_content]\n\nnltk_tokens = nltk.word_tokenize(text)  \nbigrams_list = list(nltk.bigrams(text_content))\nprint(bigrams_list)\ndictionary2 = [\' \'.join(tup) for tup in bigrams_list]\nprint (dictionary2)\n\n#Using count vectoriser to view the frequency of bigrams\nvectorizer = CountVectorizer(ngram_range=(2, 2))\nbag_of_words = vectorizer.fit_transform(dictionary2)\nvectorizer.vocabulary_\nsum_words = bag_of_words.sum(axis=0) \nwords_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]\nwords_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)\nprint (words_freq[:100])\n\n#Generating wordcloud and saving as jpg image\nwords_dict = dict(words_freq)\nWC_height = 1000\nWC_width = 1500\nWC_max_words = 200\nwordCloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width,stopwords=new_stopwords)\nwordCloud.generate_from_frequencies(words_dict)\nplt.title(\'Most frequently occurring bigrams connected by same colour and font size\')\nplt.imshow(wordCloud, interpolation=\'bilinear\')\nplt.axis("off")\nplt.show()\nwordCloud.to_file(\'wordcloud_bigram.jpg\')\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 2

小智 6

通过减少 WordCloud 中 collocation_threshold 参数的值，可以轻松生成 Bigram wordcloud。

编辑词云：

wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
                            stopwords = STOPWORDS,
                            background_color = 'white',
                            width = 1200,
                            height = 1000,
                            color_func = random_color_func,
                            collocation_threshold = 3               --added this to your question code, try changing this value between 1-50
                            ).generate(file_content)

Run Code Online (Sandbox Code Playgroud)

欲了解更多信息：

collocation_threshold: int, default=30 Bigrams 必须有一个大于这个参数的 Dunning 似然搭配分数才能算作 bigrams。默认 30 是任意的。

您还可以在此处找到 wordcloud.WordCloud 的源代码：https ://amueller.github.io/word_cloud/_modules/wordcloud/wordcloud.html

Answer 3

Die*_*ego 3

您应该使用 vectorizer = CountVectorizer(ngram_range=(2,2)) 来获取频率，然后使用 wordcloud 中的 .generate_from_frequencies 方法

归档时间：	8 年，1 月前
查看次数：	8076 次
最近记录：	5 年，6 月前