我有一个包含犯罪数据和相关“价格”的数据框,按国家和年份组织(尽管我认为这在这里并不重要)。这是我的数据的一个子集:
> crime
# A tibble: 8 x 8
iso year theft robbery burglary theft_price robbery_price burglary_price
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALB 2003 3694 199 874 32.9 115 49.3
2 ALB 2004 3694 199 874 38.2 134 57.3
3 ALB 2005 3694 199 874 42.8 150 64.2
4 ALB 2006 3450 164 779 47.0 165 70.5
5 AUS 2003 722334 14634 586266 408.4 1427 612.4
6 AUS 2004 636717 14634 512551 481.3 1683 721.2
7 AUS 2005 598700 14634 468558 536.7 1877 804.5
8 AUS 2006 594111 14634 433974 564.8 1973 846.5
Run Code Online (Sandbox Code Playgroud)
我想创建包含每个犯罪类型的产品及其价格的新列,所以theftx theft_price= theft_prod等。在我的实际数据集中,我有更多的犯罪类型,所以我需要一些可以扩展到比这个子集包含的更多变量的东西。
我喜欢dplyr包的语法,所以我使用这样的东西,但我找不到解决方案。我认为除了vars(). 正确的?
crime %>%
mutate_at(vars(theft, robbery, burglary),
funs(prod = . * ????))
Run Code Online (Sandbox Code Playgroud)
谢谢。
使用dplyr和tidyr:
library(dplyr); library(tidyr);
df %>%
gather(crime, value, -iso, -year) %>%
separate(crime, c('crime', 'type'), sep='_', fill = 'right') %>%
replace_na(list(type = 'amount')) %>%
spread(type, value) %>%
transmute(
iso = iso, year = year,
crime = paste(crime, 'prod', sep = '_'),
prod = amount * price
) %>%
spread(crime, prod)
# iso year burglary_prod robbery_prod theft_prod
#1 ALB 2003 43088.2 22885 121532.6
#2 ALB 2004 50080.2 26666 141110.8
#3 ALB 2005 56110.8 29850 158103.2
#4 ALB 2006 54919.5 27060 162150.0
#5 AUS 2003 359029298.4 20882718 295001205.6
#6 AUS 2004 369651781.2 24629022 306451892.1
#7 AUS 2005 376954911.0 27468018 321322290.0
#8 AUS 2006 367358991.0 28872882 335553892.8
Run Code Online (Sandbox Code Playgroud)
另一个没有数据整形的选项,假设列的名称遵循crime_price约定:
library(tidyverse)
# find out the crimes columns
crimes = grep('^(?!.*_price$)', names(df)[-c(1,2)], perl = T, value = T)
# construct the crimes prices columns
crimes_prices = paste(crimes, 'price', sep = '_')
crimes_prod = paste(crimes, 'prod', sep = '_')
# loop through crime and crime price columns and multiply them
map2(crimes, crimes_prices, ~ df[[.x]] * df[[.y]]) %>%
set_names(crimes_prod) %>%
as_tibble() %>%
bind_cols(select(df, iso, year))
# A tibble: 8 x 5
# theft_prod robbery_prod burglary_prod iso year
# <dbl> <int> <dbl> <fct> <int>
#1 121533. 22885 43088. ALB 2003
#2 141111. 26666 50080. ALB 2004
#3 158103. 29850 56111. ALB 2005
#4 162150 27060 54920. ALB 2006
#5 295001206. 20882718 359029298. AUS 2003
#6 306451892. 24629022 369651781. AUS 2004
#7 321322290 27468018 376954911 AUS 2005
#8 335553893. 28872882 367358991 AUS 2006
Run Code Online (Sandbox Code Playgroud)
进行这种操作的tidyverse最佳方法是通过重塑数据来确保数据整洁。一种purrr方法也是可能的,但可能依赖于列的顺序,这可能并不总是可靠的。相反,您可以执行以下操作:
gather增加所有度量列mutate一个新列measure_type,指示它是计数还是价格,并删除_pricefrom crime_type。现在,我们有单独的列来显示犯罪类型和我们用于该犯罪的指标。每行都是一个同年犯罪度量组合。spread犯罪类型退出,所以现在我们对所有犯罪都有单独的count和price列,然后乘以mutate。count和price我们的新product专栏,unite以结合犯罪类型并spread退出。library(tidyverse)
tbl <- read_table2(
"iso year theft robbery burglary theft_price robbery_price burglary_price
ALB 2003 3694 199 874 32.9 115 49.3
ALB 2004 3694 199 874 38.2 134 57.3
ALB 2005 3694 199 874 42.8 150 64.2
ALB 2006 3450 164 779 47.0 165 70.5
AUS 2003 722334 14634 586266 408.4 1427 612.4
AUS 2004 636717 14634 512551 481.3 1683 721.2
AUS 2005 598700 14634 468558 536.7 1877 804.5
AUS 2006 594111 14634 433974 564.8 1973 846.5"
)
tidy_tbl <- tbl %>%
gather(crime_type, measure, -iso, - year) %>%
mutate(
measure_type = if_else(str_detect(crime_type, "_price$"), "price", "count"),
crime_type = str_remove(crime_type, "_price")
) %>%
spread(measure_type, measure) %>%
mutate(product = count * price)
tidy_tbl
#> # A tibble: 24 x 6
#> iso year crime_type count price product
#> <chr> <int> <chr> <dbl> <dbl> <dbl>
#> 1 ALB 2003 burglary 874 49.3 43088.
#> 2 ALB 2003 robbery 199 115 22885
#> 3 ALB 2003 theft 3694 32.9 121533.
#> 4 ALB 2004 burglary 874 57.3 50080.
#> 5 ALB 2004 robbery 199 134 26666
#> 6 ALB 2004 theft 3694 38.2 141111.
#> 7 ALB 2005 burglary 874 64.2 56111.
#> 8 ALB 2005 robbery 199 150 29850
#> 9 ALB 2005 theft 3694 42.8 158103.
#> 10 ALB 2006 burglary 779 70.5 54920.
#> # ... with 14 more rows
tidy_tbl %>%
gather(measure_type, measure, count:product) %>%
unite("colname", crime_type, measure_type) %>%
spread(colname, measure)
#> # A tibble: 8 x 11
#> iso year burglary_count burglary_price burglary_product robbery_count
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 ALB 2003 874 49.3 43088. 199
#> 2 ALB 2004 874 57.3 50080. 199
#> 3 ALB 2005 874 64.2 56111. 199
#> 4 ALB 2006 779 70.5 54920. 164
#> 5 AUS 2003 586266 612. 359029298. 14634
#> 6 AUS 2004 512551 721. 369651781. 14634
#> 7 AUS 2005 468558 804. 376954911 14634
#> 8 AUS 2006 433974 846. 367358991 14634
#> # ... with 5 more variables: robbery_price <dbl>, robbery_product <dbl>,
#> # theft_count <dbl>, theft_price <dbl>, theft_product <dbl>
Run Code Online (Sandbox Code Playgroud)
由reprex 包(v0.2.0) 于 2018-08-15 创建。