PySpark vs sklearn TFIDF

Question

PySpark vs sklearn TFIDF

lU5*_*5er 7 python scikit-learn apache-spark pyspark

我是PySpark的新手.我正在玩tfidf.只是想检查一下他们是否给出了相同的结果.但他们不一样.这就是我做的.

# create the PySpark dataframe
sentenceData = sqlContext.createDataFrame((
  (0.0, "Hi I heard about Spark"),
  (0.0, "I wish Java could use case classes"),
  (1.0, "Logistic regression models are neat")
)).toDF("label", "sentence")

# tokenize
tokenizer = Tokenizer().setInputCol("sentence").setOutputCol("words")
wordsData = tokenizer.transform(sentenceData)

# vectorize
vectorizer = CountVectorizer(inputCol='words', outputCol='vectorizer').fit(wordsData)
wordsData = vectorizer.transform(wordsData)

# calculate scores
idf = IDF(inputCol="vectorizer", outputCol="tfidf_features")
idf_model = idf.fit(wordsData)
wordsData = idf_model.transform(wordsData)

# dense the current response variable
def to_dense(in_vec):
    return DenseVector(in_vec.toArray())
to_dense_udf = udf(lambda x: to_dense(x), VectorUDT())

# create dense vector
wordsData = wordsData.withColumn("tfidf_features_dense", to_dense_udf('tfidf_features'))

Run Code Online (Sandbox Code Playgroud)

我将PySpark df转换为熊猫

wordsData_pandas = wordsData.toPandas()

Run Code Online (Sandbox Code Playgroud)

然后,使用sklearn的tfidf计算如下

def dummy_fun(doc):
    return doc

# create sklearn tfidf
tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None)  

# transform and get idf scores
feature_matrix = tfidf.fit_transform(wordsData_pandas.words)

# create sklearn dtm matrix
sklearn_tfifdf = pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())

# create PySpark dtm matrix
spark_tfidf = pd.DataFrame([np.array(i) for i in wordsData_pandas.tfidf_features_dense], columns=vectorizer.vocabulary)

Run Code Online (Sandbox Code Playgroud)

但不幸的是,我为PySpark得到了这个

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>i</th>      <th>are</th>      <th>logistic</th>      <th>case</th>      <th>spark</th>      <th>hi</th>      <th>about</th>      <th>neat</th>      <th>could</th>      <th>regression</th>      <th>wish</th>      <th>use</th>      <th>heard</th>      <th>classes</th>      <th>java</th>      <th>models</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0.287682</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>    </tr>    <tr>      <th>1</th>      <td>0.287682</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>    </tr>    <tr>      <th>2</th>      <td>0.000000</td>      <td>0.693147</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.693147</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.693147</td>    </tr>  </tbody></table>

Run Code Online (Sandbox Code Playgroud)

这对于sklearn来说,

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>i</th>      <th>are</th>      <th>logistic</th>      <th>case</th>      <th>spark</th>      <th>hi</th>      <th>about</th>      <th>neat</th>      <th>could</th>      <th>regression</th>      <th>wish</th>      <th>use</th>      <th>heard</th>      <th>classes</th>      <th>java</th>      <th>models</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0.355432</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.467351</td>      <td>0.467351</td>      <td>0.467351</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.467351</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>    </tr>    <tr>      <th>1</th>      <td>0.296520</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.389888</td>      <td>0.000000</td>      <td>0.389888</td>      <td>0.389888</td>      <td>0.000000</td>    </tr>    <tr>      <th>2</th>      <td>0.000000</td>      <td>0.447214</td>      <td>0.447214</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.447214</td>      <td>0.000000</td>      <td>0.447214</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.000000</td>      <td>0.447214</td>    </tr>  </tbody></table>

Run Code Online (Sandbox Code Playgroud)

我确实试过了use_idf,smooth_idf参数.但似乎没有一个相同.我错过了什么？任何帮助表示赞赏.提前致谢.

Answer 1

Hak*_*ata 9

那是因为IDF在两者之间的计算方式略有不同.

来自sklearn的文档:

与pyspark的文档比较:

除了在IDF中添加1之外,sklearn TF-IDF使用了l2范数而pyspark没有

TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ven*_*lam 6

tfidf得分的Python和Pyspark实现都是相同的。请参考相同的Sklearn文档，但在下一行，

它们之间的主要区别是Sklearn l2默认使用规范，而Pyspark则不是。如果将标准设置为“无”，则在sklearn中也将得到相同的结果。

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

corpus = ["I heard about Spark","I wish Java could use case classes","Logistic regression models are neat"]
corpus = [sent.lower().split() for sent in corpus]

def dummy_fun(doc):
    return doc

tfidfVectorizer=TfidfVectorizer(norm=None,analyzer='word',
                                tokenizer=dummy_fun,preprocessor=dummy_fun,token_pattern=None)

tf=tfidfVectorizer.fit_transform(corpus)
tf_df=pd.DataFrame(tf.toarray(),columns= tfidfVectorizer.get_feature_names())
tf_df

Run Code Online (Sandbox Code Playgroud)

请在此处参考我的答案，以了解规范如何与tf-idf矢量化器一起使用。

归档时间：	7 年，1 月前
查看次数：	680 次
最近记录：	7 年，1 月前