Geo*_* Y. 3 r classification tf-idf tidytext tidymodels
我正在努力对大型推文数据集进行文本分类,如果有人能给我指出正确的方向,我将不胜感激。
\n总体而言,我需要训练一个分类器来区分庞大数据集(最多 600 万个文本)上的两个类。我一直在食谱框架中执行此操作,然后通过tidymodels运行 glmnet lasso 。具体问题是我在计算 tf-idf 时内存不足。
\n我应该朝哪个方向努力来解决这个问题?我基本上可以批量手动获取所有 tf-idf 值,然后再次手动将它们组合成稀疏矩阵对象。这听起来很肛门,肯定有人以前遇到过这个问题并解决了它?另一种选择是 Spark,但它远远超出了我目前的能力范围,并且对于一次性任务来说可能有些过大了。或者也许我遗漏了一些东西,而现有的工具能够做到这一点?
\n具体来说,我在运行以下命令时遇到两种问题(变量应该是不言自明的,但我稍后将提供完整的可重现代码):
\nrecipe <-\n recipe(Class ~ text, data = corpus) %>% \n step_tokenize(text) %>%\n step_stopwords(text) %>% \n step_tokenfilter(text, max_tokens = m) %>% \n step_tfidf(text) %>% \n prep()\nRun Code Online (Sandbox Code Playgroud)\n如果corpus太大或者m太大,Rstudio就会崩溃。如果它们相当大,它会发出警告:
In asMethod(object) :\n sparse->dense coercion: allocating vector of size 1.2 GiB\nRun Code Online (Sandbox Code Playgroud)\n我在网上没有找到相关内容,我也不太明白。为什么它试图强迫某些东西从稀疏变成密集?这对于任何大型数据集来说肯定会带来麻烦。难道我做错了什么?如果这是可以预防的,也许我的完整数据集会有更好的运气?
\n或者是否没有希望step_tfidf应对 600 万个观测值并且对最大令牌没有限制?
PS tm,甚至tidytext无法开始解决这个问题。
我将给出一个可重复的例子来说明我正在尝试做的事情。此代码设置了一个推文长文本的语料库,其中包含大小为 5m+ 的随机单词:
\nlibrary(tidymodels)\nlibrary(dplyr)\nlibrary(stringr)\nlibrary(textrecipes)\nlibrary(hardhat)\n\nurl <- "https://gutenberg.org/cache/epub/2701/pg2701-images.html"\nwords <- readLines(url, encoding = "UTF-8") %>% str_extract_all('\\\\w+\\\\b') %>% unlist()\nx <- rnorm(n = 6000000, mean = 18, sd = 14)\nx <- x[x > 0]\n\ncorpus <- \n lapply(x, function(i) {\n c('text' = paste(sample(words, size = i, replace = TRUE), collapse = ' '))\n }) %>% \n bind_rows() %>% \n mutate(ID = 1:n(), Class = factor(sample(c(0, 1), n(), replace = TRUE)))\nRun Code Online (Sandbox Code Playgroud)\n所以corpus看起来像这样:
> corpus\n# A tibble: 5,402,638 \xc3\x97 3\n text ID Class\n <chr> <int> <fct>\n 1 included Fast at can aghast me some as article and ship things is 1 1 \n 2 him to quantity while became man was childhood it that Who in on his the is 2 1 \n 3 no There a pass are it in evangelical rather in direst the in a even reason to Yes and the this unconditional his clear other thou all\xe2\x80\xa6 3 0 \n 4 this would against his You disappeared have summit the vagrant in fine inland is scrupulous signifies that come the the buoyed and of \xe2\x80\xa6 4 1 \n 5 slippery the Judge ever life Moby But i will after sounding ship like p he Like 5 1 \n 6 at can hope running 6 1 \n 7 Jeroboam even there slow though thought though I flukes yarn swore called p oarsmen with sort who looked and sharks young Radney s 7 1 \n 8 not if rocks ever lantern go last though at you white his that remains of primal Starbuck sans you steam up with against 8 1 \n 9 Nostril as p full the furnish are nor made towards except bivouacks p blast how never now are here of difference it whalemen s much th\xe2\x80\xa6 9 1 \n10 and p multitudinously body Archive fifty was of Greenland 10 0 \n# \xe2\x84\xb9 5,402,628 more rows\n# \xe2\x84\xb9 Use `print(n = ...)` to see more rows\nRun Code Online (Sandbox Code Playgroud)\n它本身大约有 1 Gb RAM。
\n我执行了标准建模工作流程,为了信息的完整性,我将在这里完整地展示该工作流程。
\n# prep\ncorpus_split <- initial_split(corpus, strata = Class) # split\ncorpus_train <- training(corpus_split)\ncorpus_test <- testing(corpus_split)\nfolds <- vfold_cv(corpus_train) #k-fold cv prep\nsparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix") # use sparse matrices\nsmaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20) # hyperparameter calibration\n\n# recipe\nrecipe <-\n recipe(Ad ~ text, data = corpus_train) %>% \n step_tokenize(text) %>%\n step_stopwords(text, custom_stopword_source = 'twclid') %>% \n step_tokenfilter(text, max_tokens = 10000) %>% \n step_tfidf(text)\n\n# lasso model\nlasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>% # tuning the penalty hyperparameter\n set_mode("classification") %>%\n set_engine("glmnet")\n\n# workflow\nsparse_wf <- workflow() %>%\n add_recipe(recipe, blueprint = sparse_bp) %>%\n add_model(lasso_spec)\n\n# fit\nsparse_rs <- tune_grid(\n sparse_wf,\n folds,\n grid = smaller_lambda\n)\nRun Code Online (Sandbox Code Playgroud)\n
小智 5
遗憾的是,您现在在 tidymodels 中无能为力来解决您的任务。{tidymodels} 包集围绕使用 {tibble} 作为其公共数据容器。这在许多情况下都很有效,但对于稀疏数据来说除外。
当在工作流程中使用配方时,需要将数据作为小标题传递给防风草。这要求数据是非稀疏的,在您的情况下,数据大小会急剧增加!ei 如果你有 6,000,000 个观察值和 2000 个不同的标记,你最终将得到 96GB...
这是我希望在某一时刻发生的事情(我是 {textrecipes} 的作者,也是 tidymodels 团队的开发人员之一),但它目前超出了我的控制范围,因为我们需要找到一种方法来实现tibbles 中的稀疏数据。