成功运行我的 stm 几次后,现在每次尝试运行它时都会收到此错误消息:
\nUNRELIABLE VALUE: Future (\xe2\x80\x98<none>\xe2\x80\x99) unexpectedly generated random numbers without specifying argument 'seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".\nRun Code Online (Sandbox Code Playgroud)\n这是我运行的代码:
\nmany_models <- data_frame(K = c(10, 20, 30, 40, 50, 60)) %>%\nmutate(topic_model = …Run Code Online (Sandbox Code Playgroud) 在 Albrecht、Jens、Sidharth Ramachandran 和 Christian Winkler 的解释后,我尝试进行主题建模(使用德语停用词和德语文本)。使用 Python 进行文本分析的蓝图:针对常见现实世界 (NLP) 应用程序的基于机器学习的解决方案。第一版。加利福尼亚州塞巴斯托波尔:O\xe2\x80\x99Reilly Media, Inc,2020。,第 209 页 ff。
\n# Load Data\nimport pandas as pd\n# csv Datei \xc3\xbcber read_csv laden\nxlsx = pd.ExcelFile("Priorisierung_der_Anforderungen.xlsx")\ndf = pd.read_excel(xlsx)\n\n# Anforderungsbeschreibung in String umwandlen\ndf=df.astype({'Anforderungsbeschreibung':'string'})\ndf.info()\n\n# "Ignore spaces after the stop..."\nimport re\ndf["paragraphs"] = df["Anforderungsbeschreibung"].map(lambda text:re.split('\\.\\s*\\n', text))\ndf["number_of_paragraphs"] = df["paragraphs"].map(len)\n\n%matplotlib inline\ndf.groupby('Title').agg({'number_of_paragraphs': 'mean'}).plot.bar(figsize=(24,12))\n\n\n# Preparations\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom spacy.lang.de.stop_words import STOP_WORDS as stopwords\n\ntfidf_text_vectorizer = TfidfVectorizer(stop_words=stopwords, min_df=5, max_df=0.7)\ntfidf_text_vectors = tfidf_text_vectorizer.fit_transform(df['Anforderungsbeschreibung'])\ntfidf_text_vectors.shape\nRun Code Online (Sandbox Code Playgroud)\n我收到此错误消息:
\n InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, …Run Code Online (Sandbox Code Playgroud) he, she, it在执行NLP或IR/IE相关任务时,是否存在人们通常用于删除标点符号和关闭类词(例如)的停用词列表?
我一直在尝试使用gibbs采样进行主题建模,用于词义消歧,并且它不断地给出标点符号和高级概率,因为它们经常出现在语料库中.https://github.com/christianscheible/BNB/blob/master/nb_gibbs.py
我对 mallet 比较陌生,需要知道: - mallet 产生排名的每个主题中的单词是否以某种方式排序?- 如果是这样,排序是什么(即)主题列表中的第一个,即在整个语料库中分布最高的那个?
谢谢!
这是我将使用的数据集的一部分:
u'tff prep normalized clean water permability ncwp result outside operating range',
u'technician inadvertently omitted documenting initial room \u201c cleaned sanitized field form',
u'sunflower seed observed floor room 1',
Run Code Online (Sandbox Code Playgroud)
这是我正在使用的代码:
tfidf_model = vectorizer.fit_transform(input_document_lower)
tfidf_feature_names = vectorizer.get_feature_names()
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf_model)
Run Code Online (Sandbox Code Playgroud)
就像标题所述,我收到以下错误:
IndexError: index 4 is out of bounds for axis 1 with size 4
Run Code Online (Sandbox Code Playgroud)
老实说,我不确定如何开始调试。我使用相同的数据集构建了一个 LDA,没有任何问题。任何帮助将非常感激
我正在使用 gensim (在 jupyter 笔记本中)进行主题建模。我成功创建了一个模型并将其可视化。下面是代码:
import time
start_time = time.time()
import re
import spacy
import nltk
import pyLDAvis
import pyLDAvis.gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
# nlp = spacy.load('en')
stop_word_list = nltk.corpus.stopwords.words('english')
stop_word_list.extend(['from', 'subject', 're', 'edu', 'use'])
df = pd.read_csv('Topic_modeling.csv')
data = df.Articles.values.tolist()
# Remove Emails …Run Code Online (Sandbox Code Playgroud) 在 Python 3.9.2 中训练 Top2Vec 模型时,出现以下错误:
AttributeError Traceback (most recent call last)
<ipython-input-17-edc5d3cec713> in <module>
----> 1 model = Top2Vec(documents=data, speed="learn", workers=12)
~/opt/anaconda3/envs/py39/lib/python3.9/site-packages/top2vec/Top2Vec.py in __init__(self, documents, min_count, embedding_model, embedding_model_path, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, use_embedding_model_tokenizer, umap_args, hdbscan_args, verbose)
353 'metric': 'cosine'}
354
--> 355 umap_model = umap.UMAP(**umap_args).fit(self._get_document_vectors(norm=False))
356
357 # find dense areas of document vectors
~/opt/anaconda3/envs/py39/lib/python3.9/site-packages/top2vec/Top2Vec.py in _get_document_vectors(self, norm)
545 return self.model.docvecs.vectors_docs_norm
546 else:
--> 547 return self.model.docvecs.vectors_docs
548 else:
549 return self.document_vectors
AttributeError: 'KeyedVectors' object has no …Run Code Online (Sandbox Code Playgroud) 我使用以下函数将topicmodels输出转换为JSON输出以在ldavis中使用.
topicmodels_json_ldavis <- function(fitted, corpus, doc_term){
## Required packages
library(topicmodels)
library(dplyr)
library(stringi)
library(tm)
library(LDAvis)
## Find required quantities
phi <- posterior(fitted)$terms %>% as.matrix
theta <- posterior(fitted)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
temp <- paste(corpus[[i]]$content, collapse = ' ')
doc_length <- c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- inspect(doc_term)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
Freq = colSums(temp_frequency))
rm(temp_frequency)
## Convert to json
json_lda <- LDAvis::createJSON(phi = phi, theta = theta,
vocab = …Run Code Online (Sandbox Code Playgroud) lda ×3
python ×3
nlp ×2
doc2vec ×1
gensim ×1
mallet ×1
nmf ×1
python-3.7 ×1
r ×1
text-mining ×1
topicmodels ×1
typeerror ×1
wsd ×1