我想将我的矢量转移到数组,所以我使用
get_array = udf(lambda x: x.toArray(),ArrayType(DoubleType()))
result3 = result2.withColumn('list',get_array('features'))
result3.show()
Run Code Online (Sandbox Code Playgroud)
列features是矢量dtype.但是Spark告诉我
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Run Code Online (Sandbox Code Playgroud)
我知道原因必须是我在UDF中使用的类型,所以我尝试了get_array = udf(lambda x: x.toArray(),ArrayType(FloatType())),这也无法工作.我知道它在传输后是numpy.narray,但我怎样才能正确显示?
以下是我获取数据帧结果的代码:
df4 = indexed.groupBy('uuid').pivot('name').sum('fre')
df4 = df4.fillna(0)
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=df4.columns[1:],
outputCol="features")
dataset = assembler.transform(df4)
bk = BisectingKMeans(k=8, seed=2, featuresCol="features")
result2 = bk.fit(dataset).transform(dataset)
Run Code Online (Sandbox Code Playgroud)
这是索引的样子:
+------------------+------------+---------+-------------+------------+----------+--------+----+
| uuid| category| code| servertime| cat| fre|catIndex|name|
+------------------+------------+---------+-------------+------------+----------+--------+----+
| 351667085527886| 398| null|1503084585000| 398|0.37951264| 2.0| a2|
| 352279079643619| 403| null|1503105476000| 403| 0.3938634| …Run Code Online (Sandbox Code Playgroud) 我正在做一个新闻推荐系统,我需要为用户和他们阅读的新闻建立一个表格。我的原始数据是这样的:
001436800277225 ["9161492","9161787","9378531"]
009092130698762 ["9394697"]
010003000431538 ["9394697","9426473","9428530"]
010156461231357 ["9350394","9414181"]
010216216021063 ["9173862","9247870"]
010720006581483 ["9018786"]
011199797794333 ["9017977","9091134","9142852","9325464","9331913"]
011337201765123 ["9161294","9198693"]
011414545455156 ["9168185","9178348","9182782","9359776"]
011425002581540 ["9083446","9161294","9309432"]
Run Code Online (Sandbox Code Playgroud)
我使用spark-SQL爆炸并进行了一次热编码,
df = getdf()
df1 = df.select('uuid',explode('news').alias('news'))
stringIndexer = StringIndexer(inputCol="news", outputCol="newsIndex")
model = stringIndexer.fit(df1)
indexed = model.transform(df1)
encoder = OneHotEncoder(inputCol="newsIndex", outputCol="newsVec")
encoded = encoder.transform(indexed)
encoded.show(20,False)
Run Code Online (Sandbox Code Playgroud)
之后,我的数据变为:
+---------------+-------+---------+----------------------+
|uuid |news |newsIndex|newsVec |
+---------------+-------+---------+----------------------+
|014324000386050|9398253|10415.0 |(105721,[10415],[1.0])|
|014324000386050|9428530|70.0 |(105721,[70],[1.0]) |
|014324000631752|654112 |1717.0 |(105721,[1717],[1.0]) |
|014324000674240|730531 |2282.0 |(105721,[2282],[1.0]) |
|014324000674240|694306 |1268.0 |(105721,[1268],[1.0]) |
|014324000674240|712016 |4766.0 |(105721,[4766],[1.0]) |
|014324000674240|672307 |7318.0 |(105721,[7318],[1.0]) |
|014324000674240|698073 |1241.0 …Run Code Online (Sandbox Code Playgroud) python machine-learning apache-spark apache-spark-sql pyspark-sql
我想将pySpark中的List更改为Vector,然后将此列用于机器学习模型进行训练。但是我的Spark版本是1.6.0,没有VectorUDT()。那么我应该在udf函数中返回哪种类型?
from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *
conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)
df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
print '?????',column
return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()
Run Code Online (Sandbox Code Playgroud)
我尝试了很多Types,这DenseVector()给了我错误:
Traceback (most recent call last):
File "t.py", line 21, in <module>
getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly …Run Code Online (Sandbox Code Playgroud) python machine-learning apache-spark pyspark apache-spark-mllib
我正在尝试使用 scala API 计算按关键字段分组的 AUC(ROC 下的区域),类似于以下问题:PySpark:Calculate grouped-by AUC。
不幸的是,我不能使用sklearn. 我该如何进行?
apache-spark apache-spark-sql apache-spark-ml apache-spark-mllib