标签: text-mining

如何在 R 中“拆分”文本文档或文本字符串，以便每个单词在数据框中都有自己的行？

documents <- c("This is document number one", "document two is the second element of the vector")

Run Code Online (Sandbox Code Playgroud)

我试图创建的数据框是：

idealdf <- c("this", "is", "document", "number", "one", "document", "two", "is", "the", "second", "element", "of", "the", "vector")

Run Code Online (Sandbox Code Playgroud)

我一直在使用 tm 包将我的文档转换为语料库，并通过以下功能去除标点符号、转换为小写字母等：

#create a corpus:
myCorpus <- Corpus(VectorSource(documents))

#convert to lowercase:
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

#remove punctuation:
myCorpus <- tm_map(myCorpus, removePunctuation)

Run Code Online (Sandbox Code Playgroud)

...但我在尝试将其放入 df 时遇到了麻烦，其中每个单词都有自己的行（我更喜欢每个单词都有自己的行 - 即使同一个单词显示为多行）。

谢谢。

r corpus text-mining tm

Rya*_*ase

lucky-day

0
推荐指数

1
解决办法

2948
查看次数

如何从文档项矩阵中提取单词频率？

我正在使用Python进行LDA分析.我使用以下代码创建了一个文档术语矩阵

corpus = [dictionary.doc2bow(text) for text in texts].

Run Code Online (Sandbox Code Playgroud)

是否有任何简单的方法可以计算整个语料库中的单词频率.由于我的词典是term-id列表,我想我可以将词频与term-id匹配.

python dictionary text-mining

Aeg*_* Wu

2016 06-17

0
推荐指数

1
解决办法

5725
查看次数

如何获取 python 或 R 中最常见的短语或单词

给定一些文本，我如何获得 n=1 到 6 之间最常见的 n 元语法？我见过一些方法来获取 3 克或 2 克的方法，一次一个 n，但是有没有办法提取最有意义的最大长度短语以及所有其余的短语？

例如，在本文中仅用于演示目的： fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.

n-gram 及其计数器的理想结果是：

fri evening commute: 3,
off-peak: 2,
rest of the words: 1

Run Code Online (Sandbox Code Playgroud)

任何建议表示赞赏。谢谢。

python nlp r text-mining

san*_*oku

lucky-day

0
推荐指数

1
解决办法

3366
查看次数

从 R 中的文本中提取任何格式的日期

我想从给定的文本中提取日期，日期可以是任何格式 2018 年 4 月 10 日、10-04-2018、10/04/2018、2018/04/10、04.10.2018，就像其他格式一样......

我有新闻数据，想从文本中提取日期

例如：我的朋友将于 2018 年 7 月 10 日或 10/07/2018 来

我想从给定的文本中提取日期

请帮忙

提前致谢

datetime text-extraction r text-mining

rac*_*hit

2018 05-04

0
推荐指数

1
解决办法

1761
查看次数

如何使用 R 拆分没有分隔符的合并/粘合单词

我使用 R 中的 rvest 使用以下代码从本文页面中抓取文本关键字：

#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management

#' start with get the link for the web to be scraped
page <- read_html("https://www.sciencedirect.com/science/article/pii/S1877042810004568")
keyW <- page %>% html_nodes("div.Keywords.u-font-serif") %>% html_text() %>% paste(collapse = ",")

Run Code Online (Sandbox Code Playgroud)

它给了我：

> keyW    
[1] "KeywordsPhysics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

Run Code Online (Sandbox Code Playgroud)

使用以下代码行从字符串中删除单词“Keywords”及其之前的所有内容后：

keyW <- gsub(".*Keywords","", keyW)

Run Code Online (Sandbox Code Playgroud)

新的密钥W是：

[1] "Physics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

Run Code Online (Sandbox Code Playgroud)

但是，我想要的输出是这个列表：

[1] "Physics curriculum" "Turkish education system" "finnish education …

Run Code Online (Sandbox Code Playgroud)

r text-mining gsub strsplit rvest

Zaw*_*min

2021 01-29

0
推荐指数

1
解决办法

93
查看次数

文本挖掘的基本算法是什么？

我正在尝试从Web上挖掘一些文本的应用程序,但我不确定执行文本挖掘的最佳方法是什么.

我对这个问题的要求是了解什么是最常用的技术/算法来执行文本挖掘并在文档中进行一些信息检索(而不是用于索引).

nlp information-retrieval text-mining

Ren*_*ani

2011 11-05

-2
推荐指数

1
解决办法

9337
查看次数

如何从r中的文本中删除连续的大写字符？

例如,我有一个文本

a <- "This IS A SAMple sentence TMP"

Run Code Online (Sandbox Code Playgroud)

我希望输出为:

"This A ple sentence"

Run Code Online (Sandbox Code Playgroud)

我该怎么做？一些更简单的方法吗？

r text-mining

use*_*832

lucky-day

-3
推荐指数

1
解决办法

489
查看次数

Spark-Scala中的文本预处理

我想对Spark-Scala中的大量文本数据应用预处理阶段,例如Lemmatization - Remove Stop Words(使用Tf-Idf) - POS标记,有什么方法可以在Spark中实现它们 - Scala？

例如,这是我的数据的一个示例:

The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.

Run Code Online (Sandbox Code Playgroud)

预处理后:

perfect fit iPod photo great sound great price use everywhere very useful

Run Code Online (Sandbox Code Playgroud)

他们有POS标签,例如 (iPod,NN) (photo,NN)

有一个POS标签(sister.arizona)是否适用于Spark？

text preprocessor scala text-mining apache-spark

Esm*_*edi

2015 04-29

-3
推荐指数

1
解决办法

5153
查看次数

以大写字母分割字符串，但前提是 Python 中跟随有小写字母

我在 Python 中使用 pdfminer.six 来提取长文本数据。不幸的是，Miner 并不总是能很好地工作，尤其是在段落和文本换行方面。例如，我得到以下输出：

"2018Annual ReportInvesting for Growth and Market LeadershipOur CEO will provide you with all further details below."

--> "2018 Annual Report Investing for Growth and Market Leadership Our CEO will provide you with all further details below."

Run Code Online (Sandbox Code Playgroud)

现在我想在小写字母后跟大写字母然后是小写字母（以及数字）时插入一个空格。以至于最终"2018Annual"成为"2018 Annual"，"ReportInvesting"成为"Report Investing"，却"...CEO..."依然"...CEO..."。

我只找到了在大写字母和/sf/answers/225134311/处拆分字符串的解决方案，但无法重写它。不幸的是，我在 Python 领域完全陌生。

python split text-mining uppercase

作者

2020 11-15

-4
推荐指数

1
解决办法

440
查看次数