获取 TF-IDF 数据时的内存问题

Question

获取 TF-IDF 数据时的内存问题

Geo*_* Y. 3 r classification tf-idf tidytext tidymodels

介绍

\n

我正在努力对大型推文数据集进行文本分类，如果有人能给我指出正确的方向，我将不胜感激。

\n

总体而言，我需要训练一个分类器来区分庞大数据集（最多 600 万个文本）上的两个类。我一直在食谱框架中执行此操作，然后通过tidymodels运行 glmnet lasso 。具体问题是我在计算 tf-idf 时内存不足。

\n

问题

\n

我应该朝哪个方向努力来解决这个问题？我基本上可以批量手动获取所有 tf-idf 值，然后再次手动将它们组合成稀疏矩阵对象。这听起来很肛门，肯定有人以前遇到过这个问题并解决了它？另一种选择是 Spark，但它远远超出了我目前的能力范围，并且对于一次性任务来说可能有些过大了。或者也许我遗漏了一些东西，而现有的工具能够做到这一点？

\n

具体来说，我在运行以下命令时遇到两种问题（变量应该是不言自明的，但我稍后将提供完整的可重现代码）：

\n

recipe <-\n  recipe(Class ~ text, data = corpus) %>% \n  step_tokenize(text) %>%\n  step_stopwords(text) %>% \n  step_tokenfilter(text, max_tokens = m) %>% \n  step_tfidf(text) %>% \n  prep()\n

Run Code Online (Sandbox Code Playgroud)\n

如果corpus太大或者m太大，Rstudio就会崩溃。如果它们相当大，它会发出警告：

\n

In asMethod(object) :\n  sparse->dense coercion: allocating vector of size 1.2 GiB\n

Run Code Online (Sandbox Code Playgroud)\n

我在网上没有找到相关内容，我也不太明白。为什么它试图强迫某些东西从稀疏变成密集？这对于任何大型数据集来说肯定会带来麻烦。难道我做错了什么？如果这是可以预防的，也许我的完整数据集会有更好的运气？

\n

或者是否没有希望step_tfidf应对 600 万个观测值并且对最大令牌没有限制？

\n

PS tm，甚至tidytext无法开始解决这个问题。

\n

完整代码

\n

我将给出一个可重复的例子来说明我正在尝试做的事情。此代码设置了一个推文长文本的语料库，其中包含大小为 5m+ 的随机单词：

\n

library(tidymodels)\nlibrary(dplyr)\nlibrary(stringr)\nlibrary(textrecipes)\nlibrary(hardhat)\n\nurl <- "https://gutenberg.org/cache/epub/2701/pg2701-images.html"\nwords <- readLines(url, encoding = "UTF-8") %>% str_extract_all('\\\\w+\\\\b') %>% unlist()\nx <- rnorm(n = 6000000, mean = 18, sd = 14)\nx <- x[x > 0]\n\ncorpus <- \n  lapply(x, function(i) {\n    c('text' = paste(sample(words, size = i, replace = TRUE), collapse = ' '))\n  }) %>% \n  bind_rows() %>% \n  mutate(ID = 1:n(), Class = factor(sample(c(0, 1), n(), replace = TRUE)))\n

Run Code Online (Sandbox Code Playgroud)\n

所以corpus看起来像这样：

\n

> corpus\n# A tibble: 5,402,638 \xc3\x97 3\n   text                                                                                                                                       ID Class\n   <chr>                                                                                                                                   <int> <fct>\n 1 included Fast at can aghast me some as article and ship things is                                                                           1 1    \n 2 him to quantity while became man was childhood it that Who in on his the is                                                                 2 1    \n 3 no There a pass are it in evangelical rather in direst the in a even reason to Yes and the this unconditional his clear other thou all\xe2\x80\xa6     3 0    \n 4 this would against his You disappeared have summit the vagrant in fine inland is scrupulous signifies that come the the buoyed and of \xe2\x80\xa6     4 1    \n 5 slippery the Judge ever life Moby But i will after sounding ship like p he Like                                                             5 1    \n 6 at can hope running                                                                                                                         6 1    \n 7 Jeroboam even there slow though thought though I flukes yarn swore called p oarsmen with sort who looked and sharks young Radney s          7 1    \n 8 not if rocks ever lantern go last though at you white his that remains of primal Starbuck sans you steam up with against                    8 1    \n 9 Nostril as p full the furnish are nor made towards except bivouacks p blast how never now are here of difference it whalemen s much th\xe2\x80\xa6     9 1    \n10 and p multitudinously body Archive fifty was of Greenland                                                                                  10 0    \n# \xe2\x84\xb9 5,402,628 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\n

Run Code Online (Sandbox Code Playgroud)\n

它本身大约有 1 Gb RAM。

\n

我执行了标准建模工作流程，为了信息的完整性，我将在这里完整地展示该工作流程。

\n

# prep\ncorpus_split <- initial_split(corpus, strata = Class) # split\ncorpus_train <- training(corpus_split)\ncorpus_test <- testing(corpus_split)\nfolds <- vfold_cv(corpus_train) #k-fold cv prep\nsparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix") # use sparse matrices\nsmaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20) # hyperparameter calibration\n\n# recipe\nrecipe <-\n  recipe(Ad ~ text, data = corpus_train) %>% \n  step_tokenize(text) %>%\n  step_stopwords(text, custom_stopword_source = 'twclid') %>% \n  step_tokenfilter(text, max_tokens = 10000) %>% \n  step_tfidf(text)\n\n# lasso model\nlasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% # tuning the penalty hyperparameter\n  set_mode("classification") %>%\n  set_engine("glmnet")\n\n# workflow\nsparse_wf <- workflow() %>%\n  add_recipe(recipe, blueprint = sparse_bp) %>%\n  add_model(lasso_spec)\n\n# fit\nsparse_rs <- tune_grid(\n  sparse_wf,\n  folds,\n  grid = smaller_lambda\n)\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

小智 5

遗憾的是，您现在在 tidymodels 中无能为力来解决您的任务。{tidymodels} 包集围绕使用 {tibble} 作为其公共数据容器。这在许多情况下都很有效，但对于稀疏数据来说除外。

当在工作流程中使用配方时，需要将数据作为小标题传递给防风草。这要求数据是非稀疏的，在您的情况下，数据大小会急剧增加！ei 如果你有 6,000,000 个观察值和 2000 个不同的标记，你最终将得到 96GB...

这是我希望在某一时刻发生的事情（我是 {textrecipes} 的作者，也是 tidymodels 团队的开发人员之一），但它目前超出了我的控制范围，因为我们需要找到一种方法来实现tibbles 中的稀疏数据。

归档时间：	2 年，4 月前
查看次数：	139 次
最近记录：	2 年，4 月前