标签: tf-idf

tf-idf和以前看不见的条款

TF-IDF(术语频率 - 逆文档频率)是信息检索的主要内容.虽然它不是一个合适的模型,但当新术语被引入语料库时似乎会崩溃.当查询或新文档有新术语时,人们如何处理它,特别是如果它们是高频率的话.在传统的余弦匹配下,这些对总比赛没有影响.

algorithm statistics nlp tf-idf

Gre*_*ind

2009 11-03

7
推荐指数

1
解决办法

2480
查看次数

如何使用tf-idf选择停用词？(非英语语料库)

我设法评估给定语料库的tf-idf函数.如何找到每个文档的停用词和最佳单词？我理解给定单词和文档的低tf-idf意味着它不是选择该文档的好词.

information-retrieval text-mining stop-words tf-idf

Dan*_*rns

2013 06-05

7
推荐指数

2
解决办法

1万
查看次数

用于python的tfidf算法

我有这个代码用于计算与tf-idf的文本相似性.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [doc1,doc2]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T
print pairwise_similarity.A

Run Code Online (Sandbox Code Playgroud)

问题是这个代码作为输入普通字符串,我想通过删除停用词,词干和tokkenize来准备文档.所以输入将是一个列表.如果我documents = [doc1,doc2]用tokkenized文件调用该错误是:

    Traceback (most recent call last):
  File "C:\Users\tasos\Desktop\my thesis\beta\similarity.py", line 18, in <module>
    tfidf = TfidfVectorizer().fit_transform(documents)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 1219, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\feature_extraction\text.py", line …

Run Code Online (Sandbox Code Playgroud)

python tf-idf scikit-learn

Tas*_*sos

2013 08-27

7
推荐指数

1
解决办法

3913
查看次数

Lucene 4.4.如何获得所有指数的期限频率？

我正在尝试计算文档中每个术语的tf-idf值.因此,我遍历文档中的术语,并希望在整个语料库中找到术语的频率以及术语出现的文档数.以下是我的代码:

//@param index path to index directory
//@param docNbr the document number in the index
public void readingIndex(String index, int docNbr) {
    IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));

    Document doc = reader.document(docNbr);         
    System.out.println("Processing file: "+doc.get("id"));

    Terms termVector = reader.getTermVector(docNbr, "contents");
    TermsEnum itr = termVector.iterator(null);
    BytesRef term = null;

    while ((term = itr.next()) != null) {               
        String termText = term.utf8ToString();                              
        long termFreq = itr.totalTermFreq();   //FIXME: this only return frequency in this doc
        long docCount = itr.docFreq();   //FIXME: docCount = 1 in all …

Run Code Online (Sandbox Code Playgroud)

lucene indexing tf-idf frequency-analysis

che*_*kha

2016 05-24

7
推荐指数

1
解决办法

9843
查看次数

如何使用Spark为文本分类创建TF-IDF？

我有一个CSV文件,格式如下:

product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]

Run Code Online (Sandbox Code Playgroud)

product_idX是一个整数,product_titleX是一个String,例如:

453478692, Apple iPhone 4 8Go

Run Code Online (Sandbox Code Playgroud)

我正在尝试从我的文件创建TF-IDF,所以我可以将它用于MLlib中的朴素贝叶斯分类器.

到目前为止,我正在使用Spark for Scala并使用我在官方页面和Berkley AmpCamp 3和4上找到的教程.

所以我正在读文件:

val file = sc.textFile("offers.csv")

Run Code Online (Sandbox Code Playgroud)

然后我将它映射到元组中 RDD[Array[String]]

val tuples = file.map(line => line.split(",")).cache

Run Code Online (Sandbox Code Playgroud)

在我将元组转换成对之后 RDD[(Int, String)]

val pairs = tuples.(line => (line(0),line(1)))

Run Code Online (Sandbox Code Playgroud)

但我被困在这里,我不知道如何从它创建Vector,把它变成TFIDF.

谢谢

scala tf-idf apache-spark apache-spark-mllib

eli*_*sah

2016 04-25

7
推荐指数

1
解决办法

1万
查看次数

Scikit Learn - 从特征数组的语料库中计算TF-IDF,而不是从原始文档的语料库中计算TF-IDF

Scikit-Learn的TfidfVectorizer将原始文档集合转换为TF-IDF特征矩阵.我希望将功能名称矩阵转换为TF-IDF功能,而不是原始文档.

您提供的语料库fit_transform()应该是一组原始文档,但我希望能够为每个文档提供一系列特征数组(或类似函数).例如:

corpus = [
    ['orange', 'red', 'blue'],
    ['orange', 'yellow', 'red'],
    ['orange', 'green', 'purple (if you believe in purple)'],
    ['orange', 'reddish orange', 'black and blue']
]

Run Code Online (Sandbox Code Playgroud)

...而不是一维字符串数组.

我知道我可以为TfidfVectorizer定义我自己的词汇表,所以我可以轻松地在我的语料库中创建一个独特特征的词典,并在特征向量中创建它们的索引.但是该函数仍然需要原始文档,并且因为我的功能有不同的长度并偶尔重叠(例如,'orange'和'reddish orange'),所以我不能将我的功能连接成单个字符串并使用ngrams.

是否有我可以使用的不同的Scikit-Learn功能,我找不到？有没有办法使用我没见过的TfidfVectorizer？或者我必须自制自己的TF-IDF功能吗？

python machine-learning tf-idf scikit-learn

And*_*ise

lucky-day

7
推荐指数

1
解决办法

2452
查看次数

如何使TF-IDF矩阵密集？

我正在使用TfidfVectorizer将原始文档的集合转换为TF-IDF特征的矩阵,然后我计划将其输入到k-means算法(我将实现).在该算法中,我将不得不计算质心(文章类别)和数据点(文章)之间的距离.我将使用欧几里德距离,所以我需要这两个实体具有相同的尺寸,在我的情况下max_features.这是我有的:

tfidf = TfidfVectorizer(max_features=10, strip_accents='unicode', analyzer='word', stop_words=stop_words.extra_stopwords, lowercase=True, use_idf=True)
X = tfidf.fit_transform(data['Content']) # the matrix articles x max_features(=words)
for i, row in enumerate(X):
    print X[i]

Run Code Online (Sandbox Code Playgroud)

然而,X似乎是一个稀疏(？)矩阵,因为输出是:

  (0, 9)    0.723131915847
  (0, 8)    0.090245047798
  (0, 6)    0.117465276892
  (0, 4)    0.379981697363
  (0, 3)    0.235921470645
  (0, 2)    0.0968780456528
  (0, 1)    0.495689001273

  (0, 9)    0.624910843051
  (0, 8)    0.545911131362
  (0, 7)    0.160545991411
  (0, 5)    0.49900042174
  (0, 4)    0.191549050212

  ...

Run Code Online (Sandbox Code Playgroud)

当我想的(0, col)状态列索引的矩阵,这实际上就像一个阵列,每一个细胞都指向一个列表,其中.

如何将此矩阵转换为密集矩阵(以便每行具有相同的列数)？

>print type(X)
<class 'scipy.sparse.csr.csr_matrix'>

Run Code Online (Sandbox Code Playgroud)

python cluster-analysis sparse-matrix tf-idf scikit-learn

gsa*_*ras

2016 01-31

7
推荐指数

1
解决办法

9880
查看次数

Python - tf-idf预测新文档的相似性

受到这个答案的启发,我试图在经过训练的训练有素的tf-idf矢量化器和新文档之间找到余弦相似性,并返回类似的文档.

下面的代码找到第一个向量的余弦相似度,而不是新的查询

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

Run Code Online (Sandbox Code Playgroud)

由于我的火车数据很大,循环遍历整个训练过的矢量器听起来像个坏主意.如何推断新文档的向量,并找到相关文档,与下面的代码相同？

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

Run Code Online (Sandbox Code Playgroud)

python machine-learning tf-idf document-classification scikit-learn

Shl*_*rtz

2017 05-23

7
推荐指数

1
解决办法

1823
查看次数