小编ALE*_*HEW的帖子

在String向量中使用余弦相似度来过滤掉类似的字符串

我有一个字符串向量.矢量的一些字符串(可能多于两个)在它们包含的单词方面彼此相似.我想过滤掉与矢量的任何其他字符串具有超过30%的余弦相似度的字符串.在比较的两个字符串中,我希望保持字符串更多的单词.也就是说,我只想要那些与原始向量的任何字符串具有小于30%相似性的字符串.我的目的是过滤掉类似的字符串,只保留大致不同的字符串.

防爆.矢量是:

x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")

Run Code Online (Sandbox Code Playgroud)

结果应该给出(假设相似度小于30%):

c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")

Run Code Online (Sandbox Code Playgroud)

以上结果尚未得到验证.

余弦代码我正在使用:

CSString_vector <- c("String One","String Two")
    corp <- tm::VCorpus(VectorSource(CSString_vector))
    controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
    weighting …

Run Code Online (Sandbox Code Playgroud)

ALE*_*HEW

2018 04-19

5
推荐指数

1
解决办法

410
查看次数

Luigi Pipelining:在Windows中没有名为pwd的模块

我正在尝试执行https://marcobonzanini.com/2015/10/24/building-data-pipelines-with-python-and-luigi/中给出的教程.

我可以使用本地调度程序自行运行程序,给我:

Scheduled 2 tasks of which:
* 2 ran successfully:
    - 1 PrintNumbers(n=1000)
    - 1 SquaredNumbers(n=1000)

This progress looks :) because there were no failed tasks or missing external de
pendencies

===== Luigi Execution Summary =====

Run Code Online (Sandbox Code Playgroud)

但是,要尝试在服务器上进行可视化,当我尝试运行luigid --background时,它会抛出一个错误,说我没有pwd模块.我找不到使用pip for windows的pwd模块.

  File "c:\users\alex\appdata\local\continuum\anaconda3\lib\site-packages
\luigi\process.py", line 79, in daemonize
    import daemon
  File "c:\users\alex\appdata\local\continuum\anaconda3\lib\site-packages
\daemon\__init__.py", line 42, in <module>
    from .daemon import DaemonContext
  File "c:\users\alex\appdata\local\continuum\anaconda3\lib\site-packages
\daemon\daemon.py", line 25, in <module>
    import pwd
ModuleNotFoundError: No module named 'pwd'

Run Code Online (Sandbox Code Playgroud)

我使用Python 3.6在Anaconda Spyder上工作

python pipelining python-3.x luigi

ALE*_*HEW

lucky-day

4
推荐指数

2
解决办法

3165
查看次数

Pyspark - 一次聚合数据帧的所有列

我想将数据框分组到单个列上，然后对所有列应用聚合函数。

例如，我有一个包含 10 列的 df。我希望对第一列“1”进行分组，然后对所有剩余列（均为数字）应用聚合函数“sum”。

与此等效的 R 是 summarise_all。前在R。

df = df%>%group_by(column_one)%>%summarise_all(funs(sum))

Run Code Online (Sandbox Code Playgroud)

我不想在 pyspark 的聚合命令中手动输入列，因为数据框中的列数是动态的。

r aggregate-functions apache-spark pyspark

ALE*_*HEW

lucky-day

4
推荐指数

1
解决办法

2万
查看次数

标签统计

r ×2

aggregate-functions ×1

apache-spark ×1

luigi ×1

pipelining ×1

pyspark ×1

python ×1

python-3.x ×1

在String向量中使用余弦相似度来过滤掉类似的字符串

Luigi Pipelining:在Windows中没有名为pwd的模块

Pyspark - 一次聚合数据帧的所有列

标签 统计

小编ALE_HEW的帖子

标签统计