根据一个或多个其他列中的值对因子进行排序

Sca*_*ard 30 r r-factor

我已经浏览了很多关于订购因素的帖子,但还没找到与我的问题匹配的帖子.不幸的是,我对R的了解还很不成熟.

我有一个我正在使用的考古工件目录的子集.我正在尝试交叉制表诊断历史工件类型和站点测试位置.使用ddply或tapply足够容易.

我的问题是我想通过它们的平均诊断日期(数字/年)对工件类型(一个因子)进行排序,并且我会按字母顺序获取它们.我知道我需要让它成为一个有序因子,但无法弄清楚如何通过另一列中的年份值来订购它.

IDENTIFY                                      MIDDATE
engine-turned fine red stoneware              1769
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
yellow lead-glazed buff earthenware           1732
...
Run Code Online (Sandbox Code Playgroud)

需要订购的是:

IDENTIFY                                      MIDDATE
yellow lead-glazed buff earthenware           1732
white salt-glazed stoneware, scratch blue     1760
wrought nail, 'L' head                        1760
engine-turned fine red stoneware              1769
...
Run Code Online (Sandbox Code Playgroud)

因子(IDENTIFY)需要按日期(MIDDATE)排序.我以为我有它

Catalog$IDENTIFY<-factor(Catalog$IDENTIFY,levels=Catalog$MIDDATE,ordered=TRUE)
Run Code Online (Sandbox Code Playgroud)

但得到警告:

In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) 
else paste0(labels,: duplicated levels will not be allowed 
in factors anymore
Run Code Online (Sandbox Code Playgroud)

IDENTIFY具有~130个因子级别,并且许多具有相同的MIDDATE值,因此我需要通过MIDDATE和另一个列TYPENAME来订购IDENTIFY.

更多细节:

我有一个数据帧Catalog,它分解为(即str(Catalog)):

> str(Catalog)
'data.frame':   2211 obs. of  15 variables:
 $ TRENCH  : Factor w/ 7 levels "DRT 1","DRT 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ U_TYPE  : Factor w/ 3 levels "EU","INC","STP": 1 1 1 1 1 1 1 1 1 1 ...
 $ U_LBL   : Factor w/ 165 levels "001","005","007",..: 72 72 72 72 72 72 ...
 $ L_STRAT : Factor w/ 217 levels "#2-7/25","[3]",..: 4 4 4 4 4 4 89 89 89 89 ...
 $ START   : num  0 0 0 0 0 0 39.4 39.4 39.4 39.4 ...
 $ END     : num  39.4 39.4 39.4 39.4 39.4 39.4 43.2 43.2 43.2 43.2 ...
 $ Qty     : int  1 1 3 5 1 1 6 8 1 1 ...
 $ MATNAME : Factor w/ 6 levels "Ceramics","Chipped Stone",..: 1 1 1 5 5 6 ...
 $ TYPENAME: Factor w/ 9 levels "Architectural Hardware",..: 9 9 9 1 1 3 9 ...
 $ CATNAME : Factor w/ 32 levels "Biface","Bottle Glass",..: 24 29 29 6 24 ...
 $ IDENTIFY: Factor w/ 112 levels "amethyst bottle glass",..: 17 91 96 71 103 ...
 $ BEGDATE : int  1820 1820 1830 1835 1700 NA 1670 1762 1800 1720 ...
 $ ENDDATE : int  1900 1970 1860 1875 1820 NA 1795 1820 1820 1780 ...
 $ OCC_LBL : Ord.factor w/ 5 levels "Late 19th Century"<..: 2 1 2 2 4 5 4 3 ...
 $ MIDDATE : num  1860 1895 1845 1855 1760 ...
Run Code Online (Sandbox Code Playgroud)

我需要制作IDENTIFY一个有序因子,然后通过MIDDATE- > TYPENAME- > alpha by 重新排序IDENTIFY.

我真正得到的是如何通过多列的组合订单重新排序.

我会在数据库中执行此操作,但我正在运行的很多内容都是在各种交叉表格中的加权平均值(例如,按地点划分的工件类别的地表下加权平均深度)...

......可以在Access中使用,但是很麻烦,而且不可预测.在R中管理起来更容易和更清晰,但我宁愿不必手动重新排序结果表.

我想要产生的是以下几点:

>xtab.Catalog<-tapply(Catalog$Qty,list(Catalog$IDENTIFY,Catalog$TRENCH),sum)

IDENTIFY                        DRT1    DRT2    DRT3    DRT4    DRT5    DRT6
Staffordshire stoneware         4       NA      NA      NA      NA      NA  
undecorated delftware           6       4       NA      NA      NA      NA  
unidentified wrought nail       15      9       3       1       3       NA  
white salt-glazed stoneware     6       1       1       NA      2       1   
white salt-glazed scratch blue  1       NA      NA      NA      NA      NA  
white stoneware, slip-dipped    NA      NA      NA      NA      NA      NA  
wrought nail, 'L' head          2       NA      NA      NA      NA      NA  
wrought nail, 'rose' head       62      21      4       NA      1       1   
wrought nail, 'T' head          2       NA      1       NA      NA      1   
yellow lead-glazed              12      NA      NA      NA      1       3   
...
Run Code Online (Sandbox Code Playgroud)

...但我需要它们按逻辑(即按时间顺序/类型)顺序排序,而不是按字母排序.

Mat*_*erg 41

这是一个可重复的样本,有解决方案:

set.seed(0)
a = sample(1:20,replace=F)
b = sample(1:20,replace=F)
f = as.factor(letters[1:20])

> a
 [1] 18  6  7 10 15  4 13 14  8 20  1  2  9  5  3 16 12 19 11 17
> b
 [1] 16 18  4 12  3  5  6  1 15 10 19 17  9 11  2  8 20  7 13 14
> f
 [1] a b c d e f g h i j k l m n o p q r s t
Levels: a b c d e f g h i j k l m n o p q r s t
Run Code Online (Sandbox Code Playgroud)

现在为新因素:

fn = factor(f, levels=unique(f[order(a,b,f)]), ordered=TRUE)

> fn
 [1] a b c d e f g h i j k l m n o p q r s t
20 Levels: k < l < o < f < n < b < c < i < m < d < s < q < g < h < e < ... < j
Run Code Online (Sandbox Code Playgroud)

在'a',下一个'b'和最后'f'本身上排序(尽管在这个例子中,'a'没有重复的值).

  • 如果有人回答了您的问题,我建议您使用复选标记(颜色为绿色)来表示问题.这是礼貌的,对未来的搜索者有很大的帮助. (2认同)

zac*_*ach 23

我建议使用以下基于dplyr的方法(h/t daattali),它可以扩展到任意数量的列:

library(dplyr)
Catalog <- Catalog %>%
  arrange(MIDDATE, TYPENAME) %>%               # sort your dataframe
  mutate(IDENTIFY = factor(IDENTIFY, unique(IDENTIFY))) # reset your factor-column based on that order
Run Code Online (Sandbox Code Playgroud)

  • 这非常有用,它应该是一个“dplyr”函数。 (2认同)

YCR*_*YCR 8

该函数fct_reorder2就是这样做的。

fct_reorder请注意按升序排序和按fct_reordering2降序排序的微妙之处。

文档中的代码:

df0 <- tibble::tribble(
  ~color,     ~a, ~b,
  "blue",      1,  2,
  "green",     6,  2,
  "purple",    3,  3,
  "red",       2,  3,
  "yellow",    5,  1
Run Code Online (Sandbox Code Playgroud)

df0$color <- factor(df0$color)
fct_reorder(df0$color, df0$a, min)
 #> [1] blue   green  purple red    yellow
 #> Levels: blue red purple yellow green
fct_reorder2(df0$color, df0$a, df0$b)
Run Code Online (Sandbox Code Playgroud)