我正在慢慢地从 Excel 转向 R,但在 Excel 中需要两秒钟才能完成的任务中不断遇到问题...例如,请参阅以下法国和英国 GDP 的数据示例:
假设我想计算 1929 年(即大萧条)以来的百分比变化。在 Excel 中,我会在法国的新列中执行类似的操作:=(B2/$B$11)*100然后将公式填充到相邻单元格。然后,对英国重复一遍。
你会如何在 R 中做到这一点(注意,这只是一个例子。我对背后的思考过程感兴趣)?显然,数据的结构会因三个变量而有所不同:年份、国家/地区、国内生产总值。
我正在考虑使用mutate()然后case_when()确定正确的国家/地区。但这就是我陷入困境的地方。看看我的代码。数据是麦迪逊:
library(tidyverse)
library(ggplot2)
library(haven)
library(readxl)
# Loading df
df <- read_excel("/PATH TO DATA/mpd2018.xlsx", sheet = 2)
# Tidy dataset
df <- df %>%
transmute(
cntry = as_factor(countrycode), # Rename and define as factor
year = zap_labels(year), # Zap labels
gdp = zap_labels(rgdpnapc) # Rename and zap labels
) %>%
dplyr::filter(
cntry %in% c("FRA","GBR"), # Keep only FRA and GRB
year >= 1920 & year <= 1950 # Only the interval between 1920 and 1950
)
# Calculations
df <- df %>% mutate(
gdp_rel = case_when(
cntry == "FRA" ~ (df$gdp/df[10,3])*100,
cntry == "GBR" ~ (df$gdp/df[41,3])*100
))
Run Code Online (Sandbox Code Playgroud)
首先,代码会产生错误。但更重要的是,我相信这可以比通过df[x, y]. 什么是数据框更大?
有多种方法可以实现您想要的结果。这里有 2 个不同的选项。
library(tidyverse)
# Seed for reproducibility
set.seed(1234)
# Example data
data <- data.frame(Year = 1920:1939,
France = 1920:1939 * 3 + rnorm(1939 - 1920 + 1, 5, 10),
Germany = 1920:1939 * 3.5 + rnorm(1939 - 1920 + 1, 2, 18))
row_id <- which(data$Year == 1929)
# dplyr. Note that "across" performs caclulation across all columns
# selected in the first argument
data %>%
mutate(across(-Year, # All columns except for year
#Row 10 (row_id) has year = 1929
~ . / .[[row_id]] * 100,
# Add column name to new transformed result.
.names = '{.col}_return'))
# Manual way
res <- list()
for(i in names(data)[-1]){
# Manual mutate
res[[paste0(i, '_return')]] <- data[[i]] / data[10, i] * 100
}
# Combine result
cbind(data, res)
Run Code Online (Sandbox Code Playgroud)
两者都会产生以下结果(在模拟数据上):
Year France Germany France_return Germany_return
1 1920 5752.929 6724.414 99.47830 99.81832
2 1921 5770.774 6716.668 99.78687 99.70334
3 1922 5781.844 6721.070 99.97830 99.76869
4 1923 5750.543 6740.773 99.43704 100.06115
5 1924 5781.291 6723.513 99.96873 99.80495
6 1925 5785.061 6713.432 100.03391 99.65531
7 1926 5777.253 6753.346 99.89889 100.24779
8 1927 5780.534 6728.074 99.95563 99.87266
9 1928 5783.355 6749.728 100.00442 100.19408
10 1929 5783.100 6736.653 100.00000 100.00000
11 1930 5790.228 6776.841 100.12326 100.59656
12 1931 5788.016 6751.939 100.08502 100.22691
13 1932 5793.237 6751.230 100.17530 100.21639
14 1933 5804.645 6758.477 100.37255 100.32397
15 1934 5816.595 6741.676 100.57919 100.07457
16 1935 5808.897 6753.483 100.44608 100.24983
17 1936 5807.890 6738.759 100.42867 100.03127
18 1937 5806.888 6757.362 100.41134 100.30741
19 1938 5810.628 6779.703 100.47602 100.63904
20 1939 5846.158 6780.114 101.09040 100.64514
Run Code Online (Sandbox Code Playgroud)
根据 SnupSnurre 的评论,我在这里提供了一个示例,说明如何假设数据以“长”格式(垂直)存储。
# Use pivot_longer to make wide data long
data_long <- pivot_longer(data,
-Year,
names_to = 'Country')
# Calculate on long format:
(return_1929 <- data_long %>%
# Group by country, calculations will be done for each country
group_by(Country) %>%
# Perform the actual calculations
mutate(value_return = value / value[Year == 1929] * 100) %>%
# Remove the country grouping
ungroup()
)
# Return to wide format
return_1929 %>%
pivot_wider(id_cols = Year,
# Column to "expand" to a wide format.
names_from = Country,
# Coluns to get values from
values_from = c(value, value_return),
)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
170 次 |
| 最近记录: |