我的问题是除了在summarise_each之外定义一个带有多个参数的函数,还有另一种方法可以直接在summarise_each中添加参数吗?
例如,我希望得到没有NAs的平均值.这种方式有效
mean_fun=function(x)mean(x,na.rm=TRUE)
AA_group=AA_new %>% group_by(tractID)
AA_group %>% summarise_each(funs(mean_fun))
Run Code Online (Sandbox Code Playgroud)
我想知道是否有一种方法可以na.rm=TRUE直接添加summarise_each,比如more_args选项?
如果我把mean_fun直接放到summarise_each上,
AA_group %>% summarise_each(funs(function(x)mean(x,na.rm=TRUE)))
Run Code Online (Sandbox Code Playgroud)
而错误是
expecting a single value
Run Code Online (Sandbox Code Playgroud)
这是否意味着每次我们想使用summarise_each时,我们必须定义一个除此之外的函数?
我今天仍然遇到这个错误,我已经从github下载了plyr,但它仍然无效.安装plyr后,我重新启动了R-studio甚至我的电脑.

我的问题是我想用几个分类变量对数据进行特征选择.我用get_dummies它pandas来为这些分类变量生成所有稀疏矩阵.我的问题是sklearn如何知道一个特定的稀疏矩阵实际上属于一个特征并选择/删除所有特征?例如,我有一个名为city的变量.纽约,芝加哥和波士顿有三个级别用于该变量,因此稀疏矩阵看起来像:
[1,0,0]
[0,1,0]
[0,0,1]
我如何告知sklearn,在这三个"列"中实际上属于一个功能,哪个是城市,最终不会选择纽约,并删除芝加哥和波士顿?
非常感谢!
我想从sql server导入一个表,使用JDBC驱动程序激活本地,并在其上运行spark sql.我为sql server下载了sqljdbc,并将此行添加到conf目录中的spark-env.sh:
SPARK_CLASSPATH= "C:/Program Files/Microsoft SQL Server/sqljdbc_4.0/enu/sqljdbc4.jar" ./bin/spark-shell
Run Code Online (Sandbox Code Playgroud)
正如这里所说的那样
并使用此行加载数据
df = sqlContext.load(source="jdbc", url="jdbc:sqlserver:dd", dbtable="Reporting.dbo.datatable")
Run Code Online (Sandbox Code Playgroud)
但是,它会抛出一个错误:
Py4JJavaError: An error occurred while calling o28.load.
: java.sql.SQLException: No suitable driver found for jdbc:sqlserver:PC-BFS2
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:269)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
Run Code Online (Sandbox Code Playgroud) 我的问题是如何使用相关矩阵更改 seaborn 中字体的大小我不知道为什么字体对我来说太大了
我的情节如下图所示,我的代码就在这里
g = sns.FacetGrid(teacherValueFinal3, row='Grade Level', col='Course',margin_titles=True)
g1=g.set_titles("gg")
g1.map(sns.violinplot, '2013-2014 Assessment Score', 'Gender', color="RdBu")
Run Code Online (Sandbox Code Playgroud)

看来seaborn的默认设置是选择出现在map函数中的第一个变量x axis.但是,因为voilinplot,groupby论证必须放在第二位.有没有办法反转默认的轴标签,即Genderon x axis和2013 - 2014 assessment scoreson y axis?
我有几个分类功能,并希望使用它们进行转换OneHotEncoder.但是,当我尝试应用时StringIndexer,我得到一个错误:
stringIndexer = StringIndexer(
inputCol = ['a', 'b','c','d'],
outputCol = ['a_index', 'b_index','c_index','d_index']
)
model = stringIndexer.fit(Data)
Run Code Online (Sandbox Code Playgroud)
An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:79)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Traceback (most recent call last):
Py4JJavaError: An error occurred while calling o328.fit.
: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
at …Run Code Online (Sandbox Code Playgroud) python apache-spark pyspark apache-spark-ml apache-spark-mllib
构建模型并且可以生成所有变量重要性图.但是,当我partialPlot在连续变量上使用时,例如purchase_value:
partialPlot(rf, fraud_data_train, purchase_value, which.class = 1)
错误是
Error in is.finite(x): default method not implemented for type 'list'
Run Code Online (Sandbox Code Playgroud)
对于分类变量(browser),错误是
partialPlot(rf, fraud_data_train, browser, which.class = 1)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
Run Code Online (Sandbox Code Playgroud)
该数据是可用在这里和代码如下:
rf = randomForest(y = fraud_data_train$class_factor,
x = fraud_data_train[,-predictors_notinclude],
ntree = 30, mtry = 4, keep.forest = TRUE,
importance = TRUE, proximity = TRUE)
partialPlot(rf, fraud_data_train, purchase_value, which.class =1)
Run Code Online (Sandbox Code Playgroud)
更新:
更新2 …
我的代码如下,不知怎的,附近总是有错误@Name
DECLARE @Name nvarchar(MAX) = '(mm.dll, ben and jerry.exe)'
DECLARE @sql nvarchar(MAX)= 'SELECT OrderName,
customer.version,
count(DISTINCT company.CID) as Counts
FROM [CompanyData] company
INNER JOIN [vendor] mav on company.CID = mav.CID
LEFT OUTER JOIN [Customer] customer on company.VendorId = customer.VendorId AND company.DId = customer.DId
WHERE OrderName in' + @Name+ '
GROUP BY
customer.version, OrderName'
EXEC sp_executesql @sql
Run Code Online (Sandbox Code Playgroud) 我想写一个火花数据帧,以实木复合地板,但不是指定为partitionBy,但numPartitions每个分区或大小。在 PySpark 中是否有一种简单的方法可以做到这一点?
apache-spark ×3
pyspark ×3
python ×3
r ×3
seaborn ×2
sql-server ×2
correlation ×1
dplyr ×1
ggplot2 ×1
jdbc ×1
label ×1
plot ×1
rstudio ×1
scikit-learn ×1
sql ×1
t-sql ×1