标签: text-mining

from sklearn.feature_extraction.text import TfidfVectorizer

self.vocabulary = "a list of words I want to look for in the documents".split()
self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', 
                 stop_words='english')
self.vect.fit_transform(self.vocabulary)

...

doc = "some string I want to get tf-idf vector for"
tfidf = self.vect.transform(doc)

Run Code Online (Sandbox Code Playgroud)

问题是这会返回一个包含n行的矩阵,其中n是我的doc字符串的大小.我希望它只返回一个代表整个字符串的tf-idf的向量.我怎样才能将字符串视为单个文档,而不是每个字符都是文档？另外,我对文本挖掘很新,所以如果我在概念上做错了,那就太棒了.任何帮助表示赞赏.

python document text-mining tf-idf

Ste*_*ing

2016 04-10

36
推荐指数

1
解决办法

5万
查看次数

R-Project没有适用于'meta'的方法应用于类"character"的对象

我正在尝试运行此代码(Ubuntu 12.04,R 3.1.1)

# Load requisite packages
library(tm)
library(ggplot2)
library(lsa)

# Place Enron email snippets into a single vector.
text <- c(
  "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
  "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
  "you sold $101 million worth …

Run Code Online (Sandbox Code Playgroud)

r text-mining tm

use*_*137

2016 08-12

32
推荐指数

2
解决办法

4万
查看次数

检测R中的文本语言

在RI有一个推文列表,我想只保留那些英文.

我想知道你是否有人知道一个R包提供了一种识别字符串语言的简单方法.

干杯,z

r text-mining

zol*_*oth

lucky-day

30
推荐指数

6
解决办法

2万
查看次数

'utf8towcs'中的r tm包无效输入

我正在尝试使用R中的tm包来执行一些文本分析.我绑了以下内容:

require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)?lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'

Run Code Online (Sandbox Code Playgroud)

问题是某些字符无效.我想从R中或在导入文件进行处理之前从分析中排除无效字符.

我尝试使用iconv将所有文件转换为utf-8并排除任何无法转换为的内容,如下所示:

find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \;

Run Code Online (Sandbox Code Playgroud)

正如在此指出的那样使用iconv将latin-1文件批量转换为utf-8

但我仍然得到同样的错误.

我很感激任何帮助.

r utf-8 text-mining iconv

mai*_*ini

2017 05-23

27
推荐指数

6
解决办法

4万
查看次数

大规模机器学习

我需要在一个大数据集(10-100亿条记录)上运行各种机器学习技术.问题主要是文本挖掘/信息提取,包括各种内核技术但不限于它们(我们使用一些贝叶斯方法,自举,渐变提升,回归树 - 许多不同的问题和解决方法)

什么是最好的实施？我在ML方面经验丰富,但是对于大型数据集没有多少经验.是否有任何可扩展和可定制的机器学习库利用MapReduce基础设施强烈偏好c ++,但Java和python是可以的亚马逊Azure或自己的数据中心(我们可以买得起)？

c++ java mapreduce machine-learning text-mining

use*_*263

lucky-day

26
推荐指数

2
解决办法

2908
查看次数

使用R TM包找到2和3个单词的短语

我试图找到一个实际上可以找到R文本挖掘包中最常用的两个和三个单词短语的代码(也许还有另一个我不知道的包).我一直在尝试使用标记器,但似乎没有运气.

如果您过去曾处理过类似情况,您是否可以发布经过测试且实际有效的代码？非常感谢!

r data-mining text-mining

app*_*ree

lucky-day

24
推荐指数

3
解决办法

3万
查看次数

使用多个核时,tm_map转换函数的行为不一致

这篇文章的另一个潜在标题可能是"当r中的并行处理时,核心数量,循环块大小和对象大小之间的比例是否重要？"

我有一个语料库我正在运行一些使用tm包的转换.由于语料库很大,我使用多路并行包进行并行处理.

有时转换完成任务,但有时它们不会.例如,tm::removeNumbers().语料库中的第一个文档的内容值为"n417".因此,如果预处理成功,那么此doc将转换为"n".

下面的样本语料库用于复制.这是代码块:

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
library(SnowballC)

  corpus <- (see below)
  n <- 100 # this is the size of each chunk in the loop

  # split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # save memory

  # save pieces to rds files since not enough RAM
  tmpfile <- tempfile() 
  for (i in …

Run Code Online (Sandbox Code Playgroud)

parallel-processing r text-mining tm doparallel

Dou*_*Fir

2017 08-31

24
推荐指数

1
解决办法

607
查看次数