如何调试"对比只能应用于具有2级或更多级别的因素"错误?

Tro*_*roy 23 regression r lm glm r-faq

以下是我正在使用的所有变量:

str(ad.train)
$ Date                : Factor w/ 427 levels "2012-03-24","2012-03-29",..: 4 7 12 14 19 21 24 29 31 34 ...
 $ Team                : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season              : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ Round               : Factor w/ 28 levels "EF","GF","PF",..: 5 16 21 22 23 24 25 26 27 6 ...
 $ Score               : int  137 82 84 96 110 99 122 124 49 111 ...
 $ Margin              : int  69 18 -56 46 19 5 50 69 -26 29 ...
 $ WinLoss             : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 2 ...
 $ Opposition          : Factor w/ 18 levels "Adelaide","Brisbane Lions",..: 8 18 10 9 13 16 7 3 4 6 ...
 $ Venue               : Factor w/ 19 levels "Adelaide Oval",..: 4 7 10 7 7 13 7 6 7 15 ...
 $ Disposals           : int  406 360 304 370 359 362 365 345 324 351 ...
 $ Kicks               : int  252 215 170 225 221 218 224 230 205 215 ...
 $ Marks               : int  109 102 52 41 95 78 93 110 69 85 ...
 $ Handballs           : int  154 145 134 145 138 144 141 115 119 136 ...
 $ Goals               : int  19 11 12 13 16 15 19 19 6 17 ...
 $ Behinds             : int  19 14 9 16 11 6 7 9 12 6 ...
 $ Hitouts             : int  42 41 34 47 45 70 48 54 46 34 ...
 $ Tackles             : int  73 53 51 76 65 63 65 67 77 58 ...
 $ Rebound50s          : int  28 34 23 24 32 48 39 31 34 29 ...
 $ Inside50s           : int  73 49 49 56 61 45 47 50 49 48 ...
 $ Clearances          : int  39 33 38 52 37 43 43 48 37 52 ...
 $ Clangers            : int  47 38 44 62 49 46 32 24 31 41 ...
 $ FreesFor            : int  15 14 15 18 17 15 19 14 18 20 ...
 $ ContendedPossessions: int  152 141 149 192 138 164 148 151 160 155 ...
 $ ContestedMarks      : int  10 16 11 3 12 12 17 14 15 11 ...
 $ MarksInside50       : int  16 13 10 8 12 9 14 13 6 12 ...
 $ OnePercenters       : int  42 54 30 58 24 56 32 53 50 57 ...
 $ Bounces             : int  1 6 4 4 1 7 11 14 0 4 ...
 $ GoalAssists         : int  15 6 9 10 9 12 13 14 5 14 ...
Run Code Online (Sandbox Code Playgroud)

这是我想要适合的glm:

ad.glm.all <- glm(WinLoss ~ factor(Team) + Season  + Round + Score  + Margin + Opposition + Venue + Disposals + Kicks + Marks + Handballs + Goals + Behinds + Hitouts + Tackles + Rebound50s + Inside50s+ Clearances+ Clangers+ FreesFor + ContendedPossessions + ContestedMarks + MarksInside50 + OnePercenters + Bounces+GoalAssists, 
                  data = ad.train, family = binomial(logit))
Run Code Online (Sandbox Code Playgroud)

我知道这是很多变量(计划是通过前向变量选择减少).但即使知道很多变量,他们要么是int,要么是因素; 据我所知,事情应该与glm一起工作.但是,每当我尝试适应这个模型时,我得到:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
Run Code Online (Sandbox Code Playgroud)

哪种看起来好像R因为某些原因而没有将我的因子变量视为因子变量?

即便是一些简单的事情:

ad.glm.test <- glm(WinLoss ~ factor(Team), data = ad.train, family = binomial(logit))
Run Code Online (Sandbox Code Playgroud)

不工作!(相同的错误消息)

这样的地方:

ad.glm.test <- glm(WinLoss ~ Clearances, data = ad.train, family = binomial(logit))
Run Code Online (Sandbox Code Playgroud)

将工作!

谁知道这里发生了什么?为什么我不能将这些因子变量适合我的glm?

提前致谢!

-Troy

李哲源*_*李哲源 49

介绍

什么是"对比错误"已得到很好的解释:你有一个只有一个级别(或更少)的因素.但实际上这个简单的事实很容易被掩盖,因为实际用于模型拟合的数据可能与您传入的数据非常不同.当您拥有NA数据时,会对数据进行子集化,这是一个因素.有未使用的级别,或者你已经转换了变量并到达NaN某个地方.在这种理想情况下,您很少能直接发现单级因子str(your_data_frame).关于此错误的StackOverflow上的许多问题都是不可重现的,因此人们的建议可能会也可能不会起作用.因此,虽然现在有118个帖子关于这个问题,用户仍然找不到自适应的解决方案,以便一次又一次地提出这个问题.这个答案是我的尝试,"一劳永逸"地解决这个问题,或者至少提供一个合理的指导.

这个答案有丰富的信息,所以让我先快速总结一下.

我定义了3个辅助功能为您提供:debug_contr_error,debug_contr_error2,NA_preproc.

我建议您按以下方式使用它们.

  1. 跑来NA_preproc获得更完整的案件;
  2. 运行您的模型,如果您遇到"对比错误",请使用debug_contr_error2调试.

大多数答案会逐步向您显示这些函数的定义方式和原因.跳过这些开发过程可能没什么坏处,但不要跳过"可重复的案例研究和讨论"中的部分.


修改后的答案

原来的答案 完全适用于OP,并已成功帮助一些人.但由于缺乏适应性,它在其他地方失败了.看一下问题的输出str(ad.train).OP的变量是数字或因子; 没有人物.最初的答案是针对这种情况.如果您有字符变量,虽然它们会在期间lmglm拟合期间被强制转换,但它们不会被代码报告,因为它们不是作为因素提供的,因此is.factor会错过它们.在这个扩展中,我将使原始答案更具适应性.

dat你的数据集传递给lmglm.如果您没有这样的数据框,也就是说,所有变量都分散在全局环境中,则需要将它们收集到数据框中.以下可能不是最好的方法,但它的工作原理.

## `form` is your model formula, here is an example
y <- x1 <- x2 <- x3 <- 1:4
x4 <- matrix(1:8, 4)
form <- y ~ bs(x1) + poly(x2) + I(1 / x3) + x4

## to gather variables `model.frame.default(form)` is the easiest way 
## but it does too much: it drops `NA` and transforms variables
## we want something more primitive

## first get variable names
vn <- all.vars(form)
#[1] "y"  "x1" "x2" "x3" "x4"

## `get_all_vars(form)` gets you a data frame
## but it is buggy for matrix variables so don't use it
## instead, first use `mget` to gather variables into a list
lst <- mget(vn)

## don't do `data.frame(lst)`; it is buggy with matrix variables
## need to first protect matrix variables by `I()` then do `data.frame`
lst_protect <- lapply(lst, function (x) if (is.matrix(x)) I(x) else x)
dat <- data.frame(lst_protect)
str(dat)
#'data.frame':  4 obs. of  5 variables:
# $ y : int  1 2 3 4
# $ x1: int  1 2 3 4
# $ x2: int  1 2 3 4
# $ x3: int  1 2 3 4
# $ x4: 'AsIs' int [1:4, 1:2] 1 2 3 4 5 6 7 8

## note the 'AsIs' for matrix variable `x4`
## in comparison, try the following buggy ones yourself
str(get_all_vars(form))
str(data.frame(lst))
Run Code Online (Sandbox Code Playgroud)

第0步:显式子集化

如果您使用了or 的subset参数,则从显式的子集开始:lmglm

## `subset_vec` is what you pass to `lm` via `subset` argument
## it can either be a logical vector of length `nrow(dat)`
## or a shorter positive integer vector giving position index
## note however, `base::subset` expects logical vector for `subset` argument
## so a rigorous check is necessary here
if (mode(subset_vec) == "logical") {
  if (length(subset_vec) != nrow(dat)) {
    stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
    }
  subset_log_vec <- subset_vec
  } else if (mode(subset_vec) == "numeric") {
  ## check range
  ran <- range(subset_vec)
  if (ran[1] < 1 || ran[2] > nrow(dat)) {
    stop("'numeric' `subset_vec` provided but values are out of bound")
    } else {
    subset_log_vec <- logical(nrow(dat))
    subset_log_vec[as.integer(subset_vec)] <- TRUE
    } 
  } else {
  stop("`subset_vec` must be either 'logical' or 'numeric'")
  }
dat <- base::subset(dat, subset = subset_log_vec)
Run Code Online (Sandbox Code Playgroud)

第1步:删除不完整的案例

dat <- na.omit(dat)
Run Code Online (Sandbox Code Playgroud)

如果您已经完成步骤0,则可以跳过此步骤,因为会subset自动删除不完整的案例.

第2步:模式检查和转换

数据框列通常是原子矢量,具有以下模式:"逻辑","数字","复杂","字符","原始".对于回归,不同模式的变量的处理方式不同.

"logical",   it depends
"numeric",   nothing to do
"complex",   not allowed by `model.matrix`, though allowed by `model.frame`
"character", converted to "numeric" with "factor" class by `model.matrix`
"raw",       not allowed by `model.matrix`, though allowed by `model.frame`
Run Code Online (Sandbox Code Playgroud)

逻辑变量很棘手.它既可以被视为虚拟变量(1for TRUE; 0for FALSE),也可以被视为"数字",或者它可以被强制为两级因子.这完全取决于是否model.matrix认为从模型公式的规范中需要"因子"强制.为了简单起见,我们可以理解它:它总是被强制转换为一个因子,但应用对比的结果可能最终得到相同的模型矩阵,就像它直接作为虚拟处理一样.

有些人可能想知道为什么不包括"整数".因为整数向量,例如1:4,具有"数字"模式(尝试mode(1:4)).

数据帧列也可以是具有"AsIs"类的矩阵,但是这样的矩阵必须具有"数字"模式.

我们的检查是在产生错误时

  • 找到"复杂"或"原始";
  • 找到"逻辑"或"字符"矩阵变量;

并继续将"逻辑"和"字符"转换为"因子"类的"数字".

## get mode of all vars
var_mode <- sapply(dat, mode)

## produce error if complex or raw is found
if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")

## get class of all vars
var_class <- sapply(dat, class)

## produce error if an "AsIs" object has "logical" or "character" mode
if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
  stop("matrix variables with 'AsIs' class must be 'numeric'")
  }

## identify columns that needs be coerced to factors
ind1 <- which(var_mode %in% c("logical", "character"))

## coerce logical / character to factor with `as.factor`
dat[ind1] <- lapply(dat[ind1], as.factor)
Run Code Online (Sandbox Code Playgroud)

请注意,如果数据框列已经是因子变量,则它不会包含在内ind1,因为因子变量具有"数字"模式(尝试mode(factor(letters[1:4]))).

第3步:删除未使用的因子水平

We won't have unused factor levels for factor variables converted from step 2, i.e., those indexed by ind1. However, factor variables that come with dat might have unused levels (often as the result of step 0 and step 1). We need to drop any possible unused levels from them.

## index of factor columns
fctr <- which(sapply(dat, is.factor))

## factor variables that have skipped explicit conversion in step 2
## don't simply do `ind2 <- fctr[-ind1]`; buggy if `ind1` is `integer(0)`
ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr

## drop unused levels
dat[ind2] <- lapply(dat[ind2], droplevels)
Run Code Online (Sandbox Code Playgroud)

step 4: summarize factor variables

Now we are ready to see what and how many factor levels are actually used by lm or glm:

## export factor levels actually used by `lm` and `glm`
lev <- lapply(dat[fctr], levels)

## count number of levels
nl <- lengths(lev)
Run Code Online (Sandbox Code Playgroud)

To make your life easier, I've wrapped up those steps into a function debug_contr_error.

Input:

  • dat is your data frame passed to lm or glm via data argument;
  • subset_vec is the index vector passed to lm or glm via subset argument.

Output: a list with

  • nlevels (a list) gives the number of factor levels for all factor variables;
  • levels (a vector) gives levels for all factor variables.

The function produces a warning, if there are no complete cases or no factor variables to summarize.

debug_contr_error <- function (dat, subset_vec = NULL) {
  if (!is.null(subset_vec)) {
    ## step 0
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    } else {
    ## step 1
    dat <- stats::na.omit(dat)
    }
  if (nrow(dat) == 0L) warning("no complete cases")
  ## step 2
  var_mode <- sapply(dat, mode)
  if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")
  var_class <- sapply(dat, class)
  if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
    stop("matrix variables with 'AsIs' class must be 'numeric'")
    }
  ind1 <- which(var_mode %in% c("logical", "character"))
  dat[ind1] <- lapply(dat[ind1], as.factor)
  ## step 3
  fctr <- which(sapply(dat, is.factor))
  if (length(fctr) == 0L) warning("no factor variables to summary")
  ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr
  dat[ind2] <- lapply(dat[ind2], base::droplevels.factor)
  ## step 4
  lev <- lapply(dat[fctr], base::levels.default)
  nl <- lengths(lev)
  ## return
  list(nlevels = nl, levels = lev)
  }
Run Code Online (Sandbox Code Playgroud)

Here is a constructed tiny example.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

#  y  x f1 f2
#1 1  1  a  A
#2 2  2  a  A
#3 3  3  b  A
#4 4 NA  b  B

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: chr  "A" "A" "A" "B"

lm(y ~ x + f1 + f2, dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels
Run Code Online (Sandbox Code Playgroud)

Good, we see an error. Now my debug_contr_error exposes that f2 ends up with a single level.

debug_contr_error(dat)
#$nlevels
#f1 f2 
# 2  1 
#
#$levels
#$levels$f1
#[1] "a" "b"
#
#$levels$f2
#[1] "A"
Run Code Online (Sandbox Code Playgroud)

Note that the original short answer is hopeless here, as f2 is provided as a character variable not a factor variable.

## old answer
tmp <- na.omit(dat)
fctr <- lapply(tmp[sapply(tmp, is.factor)], droplevels)
sapply(fctr, nlevels)
#f1 
# 2 
rm(tmp, fctr)
Run Code Online (Sandbox Code Playgroud)

Now let's see an example with a matrix variable x.

dat <- data.frame(X = I(rbind(matrix(1:6, 3), NA)),
                  f = c("a", "a", "a", "b"),
                  y = 1:4)

dat
#  X.1 X.2 f y
#1   1   4 a 1
#2   2   5 a 2
#3   3   6 a 3
#4  NA  NA b 4

str(dat)
#'data.frame':  4 obs. of  3 variables:
# $ X: 'AsIs' int [1:4, 1:2] 1 2 3 NA 4 5 6 NA
# $ f: Factor w/ 2 levels "a","b": 1 1 1 2
# $ y: int  1 2 3 4

lm(y ~ X + f, data = dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f 
#1
Run Code Online (Sandbox Code Playgroud)

Note that a factor variable with no levels can cause an "contrasts error", too. You may wonder how a 0-level factor is possible. Well it is legitimate: nlevels(factor(character(0))). Here you will end up with a 0-level factors if you have no complete cases.

dat <- data.frame(y = 1:4,
                  x = rep(NA_real_, 4),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c("A", "A", "A", "B"),
                  stringsAsFactors = FALSE)

lm(y ~ x + f1 + f2, dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels

debug_contr_error(dat)$nlevels
#f1 f2 
# 0  0    ## all values are 0
#Warning message:
#In debug_contr_error(dat) : no complete cases
Run Code Online (Sandbox Code Playgroud)

Finally let's see some a situation where if f2 is a logical variable.

dat <- data.frame(y = 1:4,
                  x = c(1:3, NA),
                  f1 = gl(2, 2, labels = letters[1:2]),
                  f2 = c(TRUE, TRUE, TRUE, FALSE))

dat
#  y  x f1    f2
#1 1  1  a  TRUE
#2 2  2  a  TRUE
#3 3  3  b  TRUE
#4 4 NA  b FALSE

str(dat)
#'data.frame':  4 obs. of  4 variables:
# $ y : int  1 2 3 4
# $ x : int  1 2 3 NA
# $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
# $ f2: logi  TRUE TRUE TRUE FALSE
Run Code Online (Sandbox Code Playgroud)

Our debugger will predict a "contrasts error", but will it really happen?

debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 
Run Code Online (Sandbox Code Playgroud)

No, at least this one does not fail (the NA coefficient is due to the rank-deficiency of the model; don't worry):

lm(y ~ x + f1 + f2, data = dat)
#Coefficients:
#(Intercept)            x          f1b       f2TRUE  
#          0            1            0           NA
Run Code Online (Sandbox Code Playgroud)

It is difficult for me to come up with an example giving an error, but there is also no need. In practice, we don't use the debugger for prediction; we use it when we really get an error; and in that case, the debugger can locate the offending factor variable.

Perhaps some may argue that a logical variable is no different to a dummy. But try the simple example below: it does depends on your formula.

u <- c(TRUE, TRUE, FALSE, FALSE)
v <- c(1, 1, 0, 0)  ## "numeric" dummy of `u`

model.matrix(~ u)
#  (Intercept) uTRUE
#1           1     1
#2           1     1
#3           1     0
#4           1     0

model.matrix(~ v)
#  (Intercept) v
#1           1 1
#2           1 1
#3           1 0
#4           1 0

model.matrix(~ u - 1)
#  uFALSE uTRUE
#1      0     1
#2      0     1
#3      1     0
#4      1     0

model.matrix(~ v - 1)
#  v
#1 1
#2 1
#3 0
#4 0
Run Code Online (Sandbox Code Playgroud)

More flexible implementation using "model.frame" method of lm

You are also advised to go through R: how to debug "factor has new levels" error for linear model and prediction, which explains what lm and glm do under the hood on your dataset. You will understand that steps 0 to 4 listed above are just trying to mimic such internal process. Remember, the data that are actually used for model fitting can be very different from what you've passed in.

Our steps are not completely consistent with such internal processing. For a comparison, you can retrieve the result of the internal processing by using method = "model.frame" in lm and glm. Try this on the previously constructed tiny example dat where f2 is a character variable.

dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")

dat_internal
#  y x f1 f2
#1 1 1  a  A
#2 2 2  a  A
#3 3 3  b  A

str(dat_internal)
#'data.frame':  3 obs. of  4 variables:
# $ y : int  1 2 3
# $ x : int  1 2 3
# $ f1: Factor w/ 2 levels "a","b": 1 1 2
# $ f2: chr  "A" "A" "A"
## [.."terms" attribute is truncated..]
Run Code Online (Sandbox Code Playgroud)

In practice, model.frame will only perform step 0 and step 1. It also drops variables provided in your dataset but not in your model formula. So a model frame may have both fewer rows and columns than what you feed lm and glm. Type coercing as done in our step 2 is done by the later model.matrix where a "contrasts error" may be produced.

There are a few advantages to first get this internal model frame, then pass it to debug_contr_error (so that it only essentially performs steps 2 to 4).

advantage 1: variables not used in your model formula are ignored

## no variable `f1` in formula
dat_internal <- lm(y ~ x + f2, dat, method = "model.frame")

## compare the following
debug_contr_error(dat)$nlevels
#f1 f2 
# 2  1 

debug_contr_error(dat_internal)$nlevels
#f2 
# 1 
Run Code Online (Sandbox Code Playgroud)

advantage 2: able to cope with transformed variables

It is valid to transform variables in the model formula, and model.frame will record the transformed ones instead of the original ones. Note that, even if your original variable has no NA, the transformed one can have.

dat <- data.frame(y = 1:4, x = c(1:3, -1), f = rep(letters[1:2], c(3, 1)))
#  y  x f
#1 1  1 a
#2 2  2 a
#3 3  3 a
#4 4 -1 b

lm(y ~ log(x) + f, data = dat)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
#  contrasts can be applied only to factors with 2 or more levels
#In addition: Warning message:
#In log(x) : NaNs produced

# directly using `debug_contr_error` is hopeless here
debug_contr_error(dat)$nlevels
#f 
#2 

## this works
dat_internal <- lm(y ~ log(x) + f, data = dat, method = "model.frame")
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a

debug_contr_error(dat_internal)$nlevels
#f 
#1
Run Code Online (Sandbox Code Playgroud)

Given these benefits, I write another function wrapping up model.frame and debug_contr_error.

Input:

  • form is your model formula;
  • dat is the dataset passed to lm or glm via data argument;
  • subset_vec is the index vector passed to lm or glm via subset argument.

Output: a list with

  • mf (a data frame) gives the model frame (with "terms" attribute dropped);
  • nlevels (a list) gives the number of factor levels for all factor variables;
  • levels (a vector) gives levels for all factor variables.

## note: this function relies on `debug_contr_error`
debug_contr_error2 <- function (form, dat, subset_vec = NULL) {
  ## step 0
  if (!is.null(subset_vec)) {
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    }
  ## step 0 and 1
  dat_internal <- stats::lm(form, data = dat, method = "model.frame")
  attr(dat_internal, "terms") <- NULL
  ## rely on `debug_contr_error` for steps 2 to 4
  c(list(mf = dat_internal), debug_contr_error(dat_internal, NULL))
  }
Run Code Online (Sandbox Code Playgroud)

Try the previous log transform example.

debug_contr_error2(y ~ log(x) + f, dat)
#$mf
#  y    log(x) f
#1 1 0.0000000 a
#2 2 0.6931472 a
#3 3 1.0986123 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced
Run Code Online (Sandbox Code Playgroud)

Try subset_vec as well.

## or: debug_contr_error2(y ~ log(x) + f, dat, c(T, F, T, T))
debug_contr_error2(y ~ log(x) + f, dat, c(1,3,4))
#$mf
#  y   log(x) f
#1 1 0.000000 a
#3 3 1.098612 a
#
#$nlevels
#f 
#1 
#
#$levels
#$levels$f
#[1] "a"
#
#
#Warning message:
#In log(x) : NaNs produced
Run Code Online (Sandbox Code Playgroud)

Model fitting per group and NA as factor levels

If you are fitting model by group, you are more likely to get a "contrasts error". You need to

  1. split your data frame by the grouping variable (see ?split.data.frame);
  2. work through those data frames one by one, applying debug_contr_error2 (lapply function can be helpful to do this loop).

Some also told me that they can not use na.omit on their data, because it will end up too few rows to do anything sensible. This can be relaxed. In practice it is the NA_integer_ and NA_real_ that have to be omitted, but NA_character_ can be retained: just add NA as a factor level. To achieve this, you need to loop through variables in your data frame:

  • if a variable x is already a factor and anyNA(x) is TRUE, do x <- addNA(x). The "and" is important. If x has no NA, addNA(x) will add an unused <NA> level.
  • if a variable x is a character, do x <- factor(x, exclude = NULL) to coerce it to a factor. exclude = NULL will retain <NA> as a level.
  • if x is "logical", "numeric", "raw" or "complex", nothing should be changed. NA is just NA.

<NA> factor level will not be dropped by droplevels or na.omit, and it is valid for building a model matrix. Check the following examples.

## x is a factor with NA

x <- factor(c(letters[1:4], NA))  ## default: `exclude = NA`
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

na.omit(x)  ## NA is gone
#[1] a b c d
#[.. attributes truncated..]
#Levels: a b c d

x <- addNA(x)  ## now add NA into a valid level
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>  ## it appears here

droplevels(x)    ## it can not be dropped
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

na.omit(x)  ## it is not omitted
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>

model.matrix(~ x)   ## and it is valid to be in a design matrix
#  (Intercept) xb xc xd xNA
#1           1  0  0  0   0
#2           1  1  0  0   0
#3           1  0  1  0   0
#4           1  0  0  1   0
#5           1  0  0  0   1
Run Code Online (Sandbox Code Playgroud)

## x is a character with NA

x <- c(letters[1:4], NA)
#[1] "a" "b" "c" "d" NA 

as.factor(x)  ## this calls `factor(x)` with default `exclude = NA`
#[1] a    b    c    d    <NA>     ## there is an NA value
#Levels: a b c d                  ## but NA is not a level

factor(x, exclude = NULL)      ## we want `exclude = NULL`
#[1] a    b    c    d    <NA>
#Levels: a b c d <NA>          ## now NA is a level
Run Code Online (Sandbox Code Playgroud)

Once you add NA as a level in a factor/character, your dataset might suddenly have more complete cases. Then you can run your model. If you still get a "contrasts error", use debug_contr_error2 to see what has happened.

For your convenience, I write a function for this NA preprocessing.

Input:

  • dat is your full dataset.

Output:

  • a data frame, with NA added as a level for factor/character.

NA_preproc <- function (dat) {
  for (j in 1:ncol(dat)) {
    x <- dat[[j]]
    if (is.factor(x) && anyNA(x)) dat[[j]] <- base::addNA(x)
    if (is.character(x)) dat[[j]] <- factor(x, exclude = NULL)
    }
  dat
  }
Run Code Online (Sandbox Code Playgroud)

Reproducible case studies and Discussions

The followings are specially selected for reproducible case studies, as I just answered them with the three helper functions created here.

There are also a few other good-quality threads solved by other StackOverflow users:

This answer aims to debug the "contrasts error" during model fitting. However, this error can also turn up when using predict for prediction. Such behavior is not with predict.lm or predict.glm, but with predict methods from some packages. Here are a few related threads on StackOverflow.

Also note that the philosophy of this answer is based on that of lm and glm. These two functions are a coding standard for many model fitting routines, but maybe not all model fitting routines behave similarly. For example, the following does not look transparent to me whether my helper functions would actually be helpful.

Although a bit off-topic, it is still useful to know that sometimes a "contrasts error" merely comes from writing a wrong piece of code. In the following examples, OP passed the name of their variables rather than their values to lm. Since a name is a single value character, it is later coerced to a single-level factor and causes the error.


How to resolve this error after debugging?

In practice people want to know how to resolve this matter, either at a statistical level or a programming level.

If you are fitting models on your complete dataset, then there is probably no statistical solution, unless you can impute missing values or collect more data. Thus you may simply turn to a coding solution to drop the offending variable. debug_contr_error2 returns nlevels which helps you easily locate them. If you don't want to drop them, replace them by a vector of 1 (as explained in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?) and let lm or glm deal with the resulting rank-deficiency.

If you are fitting models on subset, there can be statistical solutions.

Fitting models by group does not necessarily require you splitting your dataset by group and fitting independent models. The following may give you a rough idea:

If you do split your data explicitly, you can easily get "contrasts error", thus have to adjust your model formula per group (that is, you need to dynamically generate model formulae). A simpler solution is to skip building a model for this group.

You may also randomly partition your dataset into a training subset and a testing subset so that you can do cross-validation. R: how to debug "factor has new levels" error for linear model and prediction briefly mentions this, and you'd better do a stratified sampling to ensure the success of both model estimation on the training part and prediction on the testing part.

  • 谢谢!这表明我的一个变量是"谎言"它的因素水平.(我是愚蠢的,只使用一个只包含其中一个因素的子集 - 因此错误).你是冠军! (2认同)

归档时间:

查看次数:

48480 次

最近记录:

6 年,4 月 前