小编E B*_*E B的帖子

R caret rpart 在`[.data.frame`(m, labs) 中返回错误：选择了未定义的列

我正在为 rpart 运行分类。我需要将数据准备为稀疏格式以运行多个模型。

当我运行 rpart 方法时，使用此调用：

control <- trainControl(method="repeatedcv", number=10, repeats=3)
#Metric Measurement for Model Performance
fitmetric <- "Accuracy"
set.seed(seed)

ptm <- proc.time()
adultFit.cart <- train(response~., data=adultTraining, method="rpart", metric=fitmetric, trControl=control,
                  parms = list( split = "information"),control=rpart.control(cp = 0.04))
proc.time() - ptm

Run Code Online (Sandbox Code Playgroud)

我收到这条消息：

`[.data.frame`(m, labs) : undefined columns selected

Run Code Online (Sandbox Code Playgroud)

似乎无法弄清楚是什么原因造成的，因为它对所有其他模型都运行良好

这是我用来测试函数和以下示例的 df 的定义：

> str(adultTraining)
'data.frame':   22793 obs. of  57 variables:
 $ age                                : num  53 37 42 37 30 23 34 25 32 43 ...
 $ fnlwgt                             : num  234721 …

Run Code Online (Sandbox Code Playgroud)

r classification rpart r-caret

E B*_*E B

2017 09-23

7
推荐指数

2
解决办法

6198
查看次数

如何在 Python 中访问聚合函数的值

我创建了一个数据帧并分组和聚合时间戳，为每个分组提供最小值和最大值，结果数据帧看起来像这个 DF 定义为病人 ID，时间戳我按病人 ID 对 DF 进行分组，然后我想获取最小值和最大值每个组的最大时间戳，我这样做了

bypatient_date = pd.DataFrame(byencounter.agg({'timestamp' : [np.min,np.max]})).reset_index())

  patient_id  timestamp            
              amin        amax
0         19  3396-08-21  3396-08-25
1         99  2723-09-27  2727-03-17
2       3014  2580-12-02  2581-05-01
3      24581  3399-07-19  3401-04-13

Run Code Online (Sandbox Code Playgroud)

我正在尝试找出每个患者 ID 的最小值和最大值之间的差异，但在尝试访问时间戳 amin 和时间戳 amax 中的值时遇到问题有没有办法在不循环但使用内置 pandas 或 numpy 的情况下执行此操作

python aggregate pandas

E B*_*E B

2016 01-22

5
推荐指数

1
解决办法

5148
查看次数

CVlm {DAAG}：设置 printit = FALSE 会导致问题 - 未找到对象“sumss”

我正在使用，CVlm {DAAG}并且我想设置，printint=FALSE因为默认值是

printit if TRUE, output is printed to the screen

Run Code Online (Sandbox Code Playgroud)

我尝试使用plotit和printitto运行该函数FALSE，但后来发现错误：sumssvariable not find。

有没有办法实际设置printit和plotittoFALSE因为我不想在屏幕上打印绘图或表格？

regression r linear-regression lm cross-validation

E B*_*E B

2016 11-23

5
推荐指数

1
解决办法

1283
查看次数

在 R ggplot 如何更改多个条形图的标签

我正在尝试更改 ggplot 中多变量图的标签我的第一张图片是这样的：

第一张图片显示我有两个变量 - Count 和 Total Gross 我有 x 标签作为 G 和 N（这是我在数据中的变量值）我想将 x 标签更改为更具描述性

如何更新我的 ggplot 语句以引入这些新标签

 test %>%
   group_by(DiscInd) %>% 
   summarise(Count=n(),TotalGross=sum(Gross)/100000000) %>% 
   gather(Var, Val, -DiscInd) %>% 
   ggplot(., aes(x=DiscInd, y = Val, fill=Var)) +
        geom_bar(stat="identity", position="dodge") +
    xlab("Year vs Released Difference") + 
    ylab("Total Gross")                 +
    ggtitle("Total Movie with Gross ")

Run Code Online (Sandbox Code Playgroud)

这可能吗？

这是我的数据框测试的一些记录

        DiscInd      Gross
          N        2783918982
          N        2207615668
          N        1670328025
          N        1519479547
          G        1514019071
          G        1404705868

Run Code Online (Sandbox Code Playgroud)

更新：我还试图更改标签和格式标签，使其不会相互重叠。

r bar-chart ggplot2

E B*_*E B

2016 10-03

2
推荐指数

1
解决办法

1万
查看次数

R ：插入符我们如何为 kNN 传递 k 参数

我使用插入符号表示 knn，最初使用tuneLength=10 运行该过程，我发现用于模型的 k=21

我想使用一组特定的 k 值来运行参数，但在传递 tuneGrid 中的值或将 k 值直接传递给训练函数时遇到错误

数据：

library(mlbench)
data(PimaIndiansDiabetes)

Run Code Online (Sandbox Code Playgroud)

代码：

grid = expand.grid(k = c(5,7,9,15,19,21)

compute_learncurve5 <- function(df=adultFile,control=control,ratio=30,fold=10,N=3,metric="Accuracy",
                                seed=1234,scaled=FALSE,DEBUG=FALSE) {
  result_df = c()
  size <- round(size=(ratio/100 * nrow(df)))
  split <-  gsub(" ","",paste(as.character(100-ratio),"/",as.character(ratio)))
  iter <-  N
  trainSize <-  nrow(df)-size
  testSize <-  size

  if (DEBUG){
    print(paste("Dimension of InputDataSet : ", dim(df)))
    print(paste("Test/Train Perct : ",ratio,"|",100-ratio,
                " : Train/Test size = ", trainSize,"|",testSize))
  }

  #Set-up data
  trainpct  <- (100-ratio)/100

  # Set-up Train and Test - Change target variable …

Run Code Online (Sandbox Code Playgroud)

r knn r-caret

E B*_*E B

lucky-day

2
推荐指数

1
解决办法

4342
查看次数

pyspark 在一次加载中加载多个分区文件

我正在尝试在一次加载中加载多个文件。它们都是分区文件，当我用 1 个文件尝试它时，它可以工作，但是当我列出 24 个文件时，它给了我这个错误，除了在加载后进行联合之外，我找不到任何有关限制的文档和解决方法。还有其他选择吗？

下面的代码重现了问题：

basepath = '/file/' 
paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', ]   

df = sqlContext.read.format('orc') \
               options(header='true',inferschema='true',basePath=basePath)\
               .load(*paths)

Run Code Online (Sandbox Code Playgroud)

收到错误：

 TypeError                                 Traceback (most recent call last)
 <ipython-input-43-7fb8fade5e19> in <module>()

---> 37 df = sqlContext.read.format('orc')                .options(header='true', inferschema='true',basePath=basePath)                .load(*paths)
     38 

TypeError: load() takes at most 4 arguments (24 given)

Run Code Online (Sandbox Code Playgroud)

partitioned-view apache-spark apache-spark-sql pyspark orc

E B*_*E B

2018 01-20

2
推荐指数

1
解决办法

5356
查看次数