小编qqp*_*lot的帖子

Pyspark from_unixtime (unix_timestamp) 不转换为时间戳

我在 Python 2.7 中使用 Pyspark。我在字符串中有一个日期列（带毫秒）并且想转换为时间戳

这是我迄今为止尝试过的

df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )

Run Code Online (Sandbox Code Playgroud)

printSchema() 显示 end_time: string (nullable = true)

当我将时间戳用作变量类型时

date pyspark

qqp*_*lot

2019 01-24

5
推荐指数

2
解决办法

3万
查看次数

如何使用 PySpark 获取与最高 tf-idf 对应的单词？

我看过类似的帖子，但没有完整的答案，因此在这里发帖。

我在 Spark 中使用 TF-IDF 来获取文档中具有最大 tf-idf 值的单词。我使用下面的一段代码。

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover

tokenizer = Tokenizer(inputCol="doc_cln", outputCol="tokens")
remover1 = StopWordsRemover(inputCol="tokens", 
outputCol="stopWordsRemovedTokens")

stopwordList =["word1","word2","word3"]

remover2 = StopWordsRemover(inputCol="stopWordsRemovedTokens", 
outputCol="filtered" ,stopWords=stopwordList)

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=2000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[tokenizer, remover1, remover2, hashingTF, idf])

model = pipeline.fit(df)

results = model.transform(df)
results.cache()

Run Code Online (Sandbox Code Playgroud)

我得到的结果是

|[a8g4i9g5y, hwcdn] |(2000,[905,1104],[7.34977707433047,7.076179741760428])

Run Code Online (Sandbox Code Playgroud)

在哪里

filtered: array (nullable = true)
features: vector (nullable = true)

Run Code Online (Sandbox Code Playgroud)

如何从“特征”中提取数组？理想情况下，我想得到对应于最高 tfidf 的单词，如下所示

|a8g4i9g5y|7.34977707433047

Run Code Online (Sandbox Code Playgroud)

提前致谢！

python tf-idf pyspark

qqp*_*lot

2018 10-11

3
推荐指数

1
解决办法

2438
查看次数

标签统计

pyspark ×2

date ×1

python ×1

tf-idf ×1

Pyspark from_unixtime (unix_timestamp) 不转换为时间戳

如何使用 PySpark 获取与最高 tf-idf 对应的单词？

标签 统计

小编qqp_lot的帖子

标签统计