小编lil*_*ffa的帖子

Spark SQL:将聚合函数应用于列列表

有没有办法将聚合函数应用于数据帧的所有(或列表)列groupBy？换句话说,有没有办法避免为每一列执行此操作:

df.groupBy("col1")
  .agg(sum("col2").alias("col2"), sum("col3").alias("col3"), ...)

Run Code Online (Sandbox Code Playgroud)

aggregate-functions dataframe apache-spark apache-spark-sql

lil*_*ffa

2019 06-11

65
推荐指数

2
解决办法

12万
查看次数

R Mclust - 获取 svd 错误“无限或缺失值”

我正在使用 Mclust 函数（来自 mclust 包）来执行混合高斯 glustering。数据集由 21000+ 行和 10 列组成。

我收到以下错误：

Error in svd(shape.o, nu = 0) : infinite or missing values in 'x'

Run Code Online (Sandbox Code Playgroud)

奇怪的是：1) 我已经检查了 NaN、Inf 等，但没有 2) 如果我为 9 个变量运行模型，它运行良好，当我添加一个变量时，我得到了错误。我尝试了一组不同的附加变量，但得到了同样的错误......

你们知道出了什么问题吗？非常感激。

编辑变量

> str(data_scaled[data_subset, model_variables])
'data.frame':   21304 obs. of  12 variables:
 $ PROD_ALL_OR_NOTHING_PERC: num  -0.064 -0.064 -0.064 -0.064 0.141 ...
 $ PROD_CASH_3_PERC        : num  -0.212 -0.212 -0.212 1.303 0.686 ...
 $ PROD_CASH_4_PERC        : num  -0.18 -0.18 -0.18 1.09 8.75 ...
 $ PROD_EINSTANTS_PERC     : num  -0.502 0.68 2.329 -0.582 …

Run Code Online (Sandbox Code Playgroud)

r svd

lil*_*ffa

2015 04-20

5
推荐指数

1
解决办法

1201
查看次数

所以我希望hist(data, labels=TRUE)能给我一个带有9个箱子的直方图,一个用于零,一个用于等等,并且每个箱子都有一个值.但它汇总了0和1,在谷歌搜索一天后,我仍然无法弄清楚如何解决它.我也试图宣布箱子的数量,hist(data, breaks=c(0,8))但没有.作为一种替代方案,我尝试使用histogram该lattice包,它工作正常...但我无法弄清楚如何显示每个bin的值...你可以帮助我任何一种方式(具有正确数量的列hist()或有显示的箱子值histogram())？非常感谢.

r histogram

lil*_*ffa

lucky-day

4
推荐指数

1
解决办法

4782
查看次数

安装问题:JDK 1.8上的pentaho 5.1 ce

我是Pentaho的新手,我需要在我的机器上安装它才能启动BI项目.我从社区网站(biserver-ce-5.1.0.0-752)安装了最新的社区版本,将PENTAHO_JAVA_HOME设置为指向我的JDK 1.8安装并简单地解压缩文件并运行

2014-07-22 00:02:48,669 ERROR [org.pentaho.platform.util.logging.Logger] Error: Pentaho
2014-07-22 00:02:48,671 ERROR [org.pentaho.platform.util.logging.Logger] misc-class org.pentaho.platform.plugin.services.pluginmgr.DefaultPluginManager: PluginManager.ERROR_0011 - Failed to register plugin cgg
org.springframework.beans.factory.BeanDefinitionStoreException: Unexpected exception parsing XML document from file [C:\Pentaho\biserver-ce\pentaho-solutions\system\cgg\plugin.spring.xml]; nested exception is java.lang.IllegalStateException: Context namespace element 'annotation-config' and its parser class [org.springframework.context.annotation.AnnotationConfigBeanDefinitionParser] are only available on JDK 1.5 and higher
    at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.doLoadBeanDefinitions(XmlBeanDefinitionReader.java:420)
    at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:342)
    at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:310)
    at org.pentaho.platform.plugin.services.pluginmgr.DefaultPluginManager.getNativeBeanFactory(DefaultPluginManager.java:411)
    at org.pentaho.platform.plugin.services.pluginmgr.DefaultPluginManager.initializeBeanFactory(DefaultPluginManager.java:439)
    at org.pentaho.platform.plugin.services.pluginmgr.DefaultPluginManager.reload(DefaultPluginManager.java:189)
    at org.pentaho.platform.plugin.services.pluginmgr.PluginAdapter.startup(PluginAdapter.java:40)
    at org.pentaho.platform.engine.core.system.PentahoSystem$2.call(PentahoSystem.java:398)
    at org.pentaho.platform.engine.core.system.PentahoSystem$2.call(PentahoSystem.java:389)
    at org.pentaho.platform.engine.core.system.PentahoSystem.runAsSystem(PentahoSystem.java:368)
    at org.pentaho.platform.engine.core.system.PentahoSystem.notifySystemListenersOfStartup(PentahoSystem.java:389)
    at org.pentaho.platform.engine.core.system.PentahoSystem.access$000(PentahoSystem.java:77)
    at org.pentaho.platform.engine.core.system.PentahoSystem$1.call(PentahoSystem.java:326)
    at org.pentaho.platform.engine.core.system.PentahoSystem$1.call(PentahoSystem.java:323)
    at …

Run Code Online (Sandbox Code Playgroud)

installation exception pentaho

lil*_*ffa

lucky-day

4
推荐指数

1
解决办法

5704
查看次数

SparkSQL:使用两列的条件求和

我希望你能帮助我.我有一个DF如下:

val df = sc.parallelize(Seq(
  (1, "a", "2014-12-01", "2015-01-01", 100), 
  (2, "a", "2014-12-01", "2015-01-02", 150),
  (3, "a", "2014-12-01", "2015-01-03", 120), 
  (4, "b", "2015-12-15", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns")
.withColumn("dateTrans", to_date($"dateTrans"))

Run Code Online (Sandbox Code Playgroud)

我很乐意做一个groupBy prodId并汇总'value',将日期范围总结为'dateIns'和'dateTrans'列之间的差异.特别是,我想有一种方法来定义一个条件和,它总结了上述列之间预定义的最大差异内的所有值.即从dateIns('dateTrans' - 'dateIns'<= 10,20,30)10天,20天,30天之间发生的所有值.

在spark中是否有任何预定义的聚合函数允许进行条件求和？你建议开发一个aggr.UDF(如果是这样,任何建议)？我正在使用pySpqrk,但也很高兴获得Scala解决方案.非常感谢!

sql aggregate-functions apache-spark apache-spark-sql pyspark

lil*_*ffa

2015 11-23

3
推荐指数

1
解决办法

7534
查看次数