我使用python中的Wordcloud包直接从文本文件生成词云。这是我从 stckoverflow 重新使用的代码:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
def random_color_func(word=None, font_size=None, position=None, orientation=None, font_path=None, random_state=None):
h = int(360.0 * 45.0 / 255.0)
s = int(100.0 * 255.0 / 255.0)
l = int(100.0 * float(random_state.randint(60, 120)) / 255.0)
return "hsl({}, {}%, {}%)".format(h, s, l)
file_content=open ("xyz.txt").read()
wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
stopwords = STOPWORDS,
background_color = 'white',
width = 1200,
height = 1000,
color_func = random_color_func
).generate(file_content)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis('off')
plt.show()
Run Code Online (Sandbox Code Playgroud)
它给了我单个单词的 wordcloud。WordCloud() 函数中是否有任何参数可以在不格式化文本文件的情况下传递 n-gram。
我想要 bigram 的词云。或带有下划线的文字显示。喜欢:machine_learning(机器和学习是两个不同的词)
感谢迭戈的回答。这只是迭戈用 python 代码回答的延续。
\n\nimport nltk\nfrom wordcloud import WordCloud, STOPWORDS\n\nWNL = nltk.WordNetLemmatizer()\ntext = \'your input text goes here\'\n# Lowercase and tokenize\ntext = text.lower()\n# Remove single quote early since it causes problems with the tokenizer.\ntext = text.replace("\'", "")\n# Remove numbers from text\nremove_digits = str.maketrans(\'\', \'\', digits)\ntext = text.translate(remove_digits)\ntokens = nltk.word_tokenize(text)\ntext1 = nltk.Text(tokens)\n\n# Remove extra chars and remove stop words.\ntext_content = [\'\'.join(re.split("[ .,;:!?\xe2\x80\x98\xe2\x80\x99``\'\'@#$%^_&*()<>{}~\\n\\t\\\\\\-]", word)) for word in text1]\n\n#set the stopwords list\nstopwords_wc = set(STOPWORDS)\ncustomised_words = [\'xxx\', \'yyy\'] # If you want to remove any particular word form text which does not contribute much in meaning\n\nnew_stopwords = stopwords_wc.union(customized_words)\ntext_content = [word for word in text_content if word not in new_stopwords]\n\n# After the punctuation above is removed it still leaves empty entries in the list.\ntext_content = [s for s in text_content if len(s) != 0]\n\n# Best to get the lemmas of each word to reduce the number of similar words\ntext_content = [WNL.lemmatize(t) for t in text_content]\n\nnltk_tokens = nltk.word_tokenize(text) \nbigrams_list = list(nltk.bigrams(text_content))\nprint(bigrams_list)\ndictionary2 = [\' \'.join(tup) for tup in bigrams_list]\nprint (dictionary2)\n\n#Using count vectoriser to view the frequency of bigrams\nvectorizer = CountVectorizer(ngram_range=(2, 2))\nbag_of_words = vectorizer.fit_transform(dictionary2)\nvectorizer.vocabulary_\nsum_words = bag_of_words.sum(axis=0) \nwords_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]\nwords_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)\nprint (words_freq[:100])\n\n#Generating wordcloud and saving as jpg image\nwords_dict = dict(words_freq)\nWC_height = 1000\nWC_width = 1500\nWC_max_words = 200\nwordCloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width,stopwords=new_stopwords)\nwordCloud.generate_from_frequencies(words_dict)\nplt.title(\'Most frequently occurring bigrams connected by same colour and font size\')\nplt.imshow(wordCloud, interpolation=\'bilinear\')\nplt.axis("off")\nplt.show()\nwordCloud.to_file(\'wordcloud_bigram.jpg\')\nRun Code Online (Sandbox Code Playgroud)\n
小智 6
通过减少 WordCloud 中 collocation_threshold 参数的值,可以轻松生成 Bigram wordcloud。
编辑词云:
wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
stopwords = STOPWORDS,
background_color = 'white',
width = 1200,
height = 1000,
color_func = random_color_func,
collocation_threshold = 3 --added this to your question code, try changing this value between 1-50
).generate(file_content)
Run Code Online (Sandbox Code Playgroud)
欲了解更多信息:
collocation_threshold: int, default=30 Bigrams 必须有一个大于这个参数的 Dunning 似然搭配分数才能算作 bigrams。默认 30 是任意的。
您还可以在此处找到 wordcloud.WordCloud 的源代码:https ://amueller.github.io/word_cloud/_modules/wordcloud/wordcloud.html
您应该使用 vectorizer = CountVectorizer(ngram_range=(2,2)) 来获取频率,然后使用 wordcloud 中的 .generate_from_frequencies 方法
| 归档时间: |
|
| 查看次数: |
8076 次 |
| 最近记录: |