标签: text-mining

在 python (sklearn) 中使用亲和传播对 word2vec 向量进行聚类

我想使用亲和力传播对我的 word2vec 集群进行聚类并获取集群中心单词。

我当前的代码如下。

model = word2vec.Word2Vec.load("word2vec")
word_vectors = model.wv.syn0
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
af= affprop.fit(word_vectors)

Run Code Online (Sandbox Code Playgroud)

但是，这会引发以下错误： ValueError: S must be a square array (shape=(77, 300))

据我了解，300 意味着 word2vec 隐藏层维度，77 是我的词汇量。

我只是想知道如何对非方阵的 word2vec 向量使用亲和传播。

请帮我！

python cluster-analysis text-mining scikit-learn word2vec

作者

lucky-day

4
推荐指数

1
解决办法

2102
查看次数

Python - 匹配和解析包含数字/货币金额的字符串

假设我在 python 中有以下字符串（输入）：

1) "$ 1,350,000" 2) "1.35 MM $" 3) "$ 1.35 M" 4) 1350000（现在是数值）

显然，尽管字符串表示形式不同，但数字是相同的。如何实现字符串匹配，或者换句话说，将它们分类为相等的字符串？

一种方法是使用正则表达式对可能的模式进行建模。不过，可能有一种情况是我没有想到的。

有人看到这个问题的 NLP 解决方案吗？

谢谢

python regex parsing currency text-mining

mrt*_*mrt

2018 01-21

4
推荐指数

1
解决办法

2265
查看次数

在 Python 中使用正则表达式从文本中提取列表

我希望从以下字符串中提取元组列表：

text='''Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020

        Producer Price Index:
        +0.4% in Sep 2020

        Employment Cost Index:
        +0.5% in 2nd Qtr of 2020

        Productivity:
        +10.1% in 2nd Qtr of 2020

        Import Price Index:
        +0.3% in Sep 2020

        Export Price Index:
        +0.6% in Sep 2020'''

Run Code Online (Sandbox Code Playgroud)

我在该过程中使用“import re”。

输出应类似于：[('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]

我想使用 re.findall 函数来生成上述输出，到目前为止我有这个：

re.findall(r"(:\Z)\s+(%\Z+)(\Ain )", text)

Run Code Online (Sandbox Code Playgroud)

我先识别“:”之前的字符，然后识别“%”之前的字符，然后识别“in”之后的字符。

我真的不知道如何继续。任何帮助，将不胜感激。谢谢！

python regex text-mining

bbu*_*net

lucky-day

4
推荐指数

1
解决办法

1060
查看次数

如何计算数据框中逗号分隔的值？

我试图弄清楚如何从列中列出特定文本值的次数获取 value_counts 。

示例数据：

d = {'Title': ['Crash Landing on You', 'Memories of the Alhambra', 'The Heirs', 'While You Were Sleeping', 
'Something in the Rain', 'Uncontrollably Fond'], 
'Cast' : ['Hyun Bin,Son Ye Jin,Seo Ji Hye', 'Hyun Bin,Park Shin Hye,Park Hoon', 'Lee Min Ho,Park Shin Hye,Kim Woo Bin', 
'Bae Suzy,Lee Jong Suk,Jung Hae In', 'Son Ye Jin,Jung Hae In,Jang So Yeon', 'Kim Woo Bin,Bae Suzy,Im Joo Hwan']}

Title   Cast
0   Crash Landing on You    Hyun Bin,Son Ye Jin,Seo Ji Hye
1 …

Run Code Online (Sandbox Code Playgroud)

python text-mining pandas

jik*_*lie

2022 05-28

4
推荐指数

1
解决办法

2136
查看次数

跟踪单词接近度

我正在开发一个小项目,该项目涉及在文档集合中进行基于字典的文本搜索.我的字典有正面的信号词(又名好词),但在文档集中只是找到一个单词并不能保证肯定的结果,因为可能存在负面词,例如(不是,不重要)可能在这些正面词附近.我想构建一个矩阵,使其包含文档编号,正文字及其与否定字的接近程度.

任何人都可以建议一种方法来做到这一点.我的项目处于非常早期阶段,所以我给出了我的文本的基本示例.

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.

Run Code Online (Sandbox Code Playgroud)

这是我的示例文件,其中坎地沙坦西酯,格列本脲,硝苯地平,地高辛,华法林,氢氯噻嗪是我的正面词,没有重要的是我的否定词.我想在我的积极和有意义的词之间做一个接近(基于词的)映射.

谁能提供一些有用的指示？

r text-mining

Shr*_*nik

lucky-day

3
推荐指数

1
解决办法

844
查看次数

R:按索引合并文本文档

我有一个如下所示的数据框:

_________________id ________________text______
    1   | 7821             | "some text here"
    2   | 7821             |  "here as well"
    3   | 7821             |  "and here"
    4   | 567              |   "etcetera"
    5   | 567              |    "more text"
    6   | 231              |   "other text"

Run Code Online (Sandbox Code Playgroud)

我想按ID对文本进行分组,因此我可以运行一个聚类算法:

________________id___________________text______
    1   | 7821             | "some text here here as well and here"
    2   | 567              |   "etcetera more text"
    3   | 231              |   "other text"

Run Code Online (Sandbox Code Playgroud)

有没有办法做到这一点？我从数据库表导入,我有很多数据,所以我不能手动完成.

r text-mining

d12*_*12n

lucky-day

3
推荐指数

1
解决办法

181
查看次数

如何刮取网页内容然后计算R中的单词频率？

这是我的代码:

library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog          <- htmlParse(blog, encoding = "UTF-8")
titles  <- xpathSApply (blog ,"//loc",xmlValue)             ## titles

traverse_each_page <- function(x){
  tmp <- htmlParse(x)
  xpathApply(tmp, '//div[@id="mainContent"]')
}
pages <- lapply(titles[2:3], traverse_each_page)

Run Code Online (Sandbox Code Playgroud)

这是伪代码:

拿一个xml文档: http://www.jamesaltucher.com/sitemap.xml
转到每个链接
解析每个链接的html内容
提取里面的文字 div id="mainContent"
计算所有文章显示的每个单词的频率,不区分大小写.

我已设法完成步骤1-4.我需要一些帮助.5.

基本上,如果"the"这个词在第1条中出现两次而在第2条中出现了5次.我想知道"the"在2篇文章中总共出现了7次.

另外,我不知道如何查看我提取的内容pages.我想学习如何查看内容,这将使我更容易调试.

r text-mining web-scraping tm

Kim*_*cks

2013 11-08

3
推荐指数

1
解决办法

2189
查看次数

如何将稀疏或simple_triplet_matrix转换为tm-package文档术语矩阵而不通过Corpus/VCorpus,在R？

我有一个spsMatrix(库矩阵)或一个simple_triplet_matrix(库满贯)的docs x术语,例如:

library(Matrix)
mat <- sparseMatrix(i = c(1,2,4,5,3), j = c(2,3,4,1,5), x = c(3,2,3,4,1))
rownames(mat) <- paste0("doc", 1:5)
colnames(mat) <- paste0("word", 1:5)

5 x 5 sparse Matrix of class "dgCMatrix"
     word1 word2 word3 word4 word5
doc1     .     3     .     .     .
doc2     .     .     2     .     .
doc3     .     .     .     .     1
doc4     .     .     .     3     .
doc5     4     .     .     .     .

Run Code Online (Sandbox Code Playgroud)

要么:

library(slam)
mat2 <- simple_triplet_matrix(c(1,2,4,5,3), j = c(2,3,4,1,5), v = c(3,2,3,4,1),
                          dimnames = list(paste0("doc", 1:5), …

Run Code Online (Sandbox Code Playgroud)

r text-mining sparse-matrix tm

Gio*_*oni

2017 05-23

3
推荐指数

1
解决办法

1315
查看次数

文本挖掘稀疏/非稀疏意义

有人可以告诉我,下面的代码和输出意味着什么？我确实在这里创建了语料库

frequencies = DocumentTermMatrix(corpus)
frequencies

Run Code Online (Sandbox Code Playgroud)

输出是

<<DocumentTermMatrix (documents: 299, terms: 1297)>>
Non-/sparse entries: 6242/381561
Sparsity           : 98%
Maximal term length: 19
Weighting          : term frequency (tf)

Run Code Online (Sandbox Code Playgroud)

稀疏的代码就在这里.

sparse = removeSparseTerms(frequencies, 0.97)
sparse

Run Code Online (Sandbox Code Playgroud)

输出是

> sparse
<<DocumentTermMatrix (documents: 299, terms: 166)>>
Non-/sparse entries: 3773/45861
Sparsity           : 92%
Maximal term length: 10
Weighting          : term frequency (tf)

Run Code Online (Sandbox Code Playgroud)

这里发生了什么,非稀疏条目和稀疏条目是什么意思？有人可以帮助我理解这些.

谢谢.

r text-mining

sub*_*bro

lucky-day

3
推荐指数

1
解决办法

7098
查看次数

在R中的Wordcloud中将所有单词变为大写

创建Wordclouds时，最常见的做法是将所有单词都小写。但是，我希望wordclouds将单词显示为大写。强制单词为大写字母后，wordcloud仍显示小写单词。有什么想法吗？

可复制的代码：

    library(tm)
    library(wordcloud)

data <- data.frame(text = c("Creativity is the art of being ‘productive’ by using
          the available resources in a skillful manner. 
          Scientifically speaking, creativity is part of
          our consciousness and we can be creative –
          if we know – ’what goes on in our mind during
          the process of creation’.
          Let us now look at 6 examples of creativity which blows the mind."))

text <- paste(data$text, collapse = " ")

# I am using toupper() to force …

Run Code Online (Sandbox Code Playgroud)

r text-mining word-cloud tm

Fil*_*ipW

lucky-day

3
推荐指数

1
解决办法

768
查看次数