我有以下熊猫数据框。
ex_one ex_two weight fake_date
0 228055 231908 1 2004-12-17
1 228056 228899 1 2000-02-26
2 228050 230029 1 2003-01-27
3 228055 230564 1 2001-07-25
4 228059 230548 1 2002-05-04
Run Code Online (Sandbox Code Playgroud)
这就是我想要的:
以列ex_one为例,然后根据( ) 和( ) 值228055计算出现次数fake_datemaxfake_datemin228055
ex_one ex_two weight fake_date max_date min_date frequency
0 228055 231908 1 2004-12-17 2004-12-17 2001-07-25 2
1 228056 228899 1 2000-02-26
2 228050 230029 1 2003-01-27
3 228055 230564 1 2001-07-25
4 228059 230548 1 2002-05-04
Run Code Online (Sandbox Code Playgroud) 尝试将第四列附加到以下长度的数据帧465017。
0 1 2
0 228055 231908 1
1 228056 228899 1
Run Code Online (Sandbox Code Playgroud)
运行以下语法
x["Fake_date"]= fake.date(pattern="%Y-%m-%d", end_datetime=None)
Run Code Online (Sandbox Code Playgroud)
返回
0 1 2 Fake_date
0 228055 231908 1 1980-10-12
1 228056 228899 1 1980-10-12
Run Code Online (Sandbox Code Playgroud)
但我想要465017一个实例的行上不同的随机日期,
0 1 2 Fake_date
0 228055 231908 1 1980-10-11
1 228056 228899 1 1980-09-12
Run Code Online (Sandbox Code Playgroud)
我如何随机化这个?
我正在努力创建一个 LDA 模型。
这是我到目前为止所做的 - 创建了一个 unigram 并根据这篇文章将数据帧转换为 RDD 。
这是代码:
countVectors = CountVectorizer(inputCol="unigrams", outputCol="features", vocabSize=3, minDF=2.0)
model = countVectors.fit(res)
result = model.transform(res)
result.show(5, truncate=False)
Run Code Online (Sandbox Code Playgroud)
这是数据集
+------------------------------------------------------------------------+---+-------------------+
|unigrams |id |features |
+------------------------------------------------------------------------+---+-------------------+
|[born, furyth, leaguenemesi, rise, (the, leaguenemesi, rise, seri, book]|0 |(3,[0,1],[1.0,1.0])|
|[hous, raven, (the, nightfal, chronicl, book] |1 |(3,[0,1],[1.0,1.0])|
|[law, 101everyth, need, know, american, law, fourth, edit] |2 |(3,[],[]) |
|[hot, summer, night] |3 |(3,[],[]) |
|[wet, bundlemega, collect, sex, stori, (30, book, box, set)] |4 |(3,[0],[1.0]) …Run Code Online (Sandbox Code Playgroud) 有没有一种简单的方法来可视化 pyspark 的 LDA (pyspark.ml.clustering.LDA)?
ldamodel.transform(result).show()产生
+--------------------+---+--------------------+--------------------+
| filtered| id| features| topicDistribution|
+--------------------+---+--------------------+--------------------+
| [problem, popul]| 0|(18054,[49,493],[...|[0.03282220322786...|
|[tyler, note, glo...| 1|(18054,[40,52,57,...|[0.00440868073429...|
|[mani, economist,...| 2|(18054,[12,17,25,...|[0.00404065731437...|
|[probabl, correct...| 3|(18054,[0,4,7,21,...|[0.00485107317270...|
|[even, popul, ass...| 4|(18054,[10,12,49,...|[0.00334279689625...|
|[sake, argument, ...| 5|(18054,[1,9,12,61...|[0.00285045818525...|
|[much, tougher, p...| 6|(18054,[27,32,49,...|[0.00485107690380...|
+--------------------+---+--------------------+--------------------
Run Code Online (Sandbox Code Playgroud)