在ggplot2中转换Dataframe以制作瀑布图

nak*_*120 3 r waterfall ggplot2 dplyr

我想将我的数据框转换为适合瀑布图的格式。

我的数据框如下:

employee <- c('A','B','C','D','E','F', 
              'A','B','C','D','E','F',
              'A','B','C','D','E','F',
              'A','B','C','D','E','F')
revenue <- c(10, 20, 30, 40, 10, 40, 
              8, 10, 20, 50, 20, 10,
              2,  5, 70, 30, 10, 50,
             40,  8, 30, 40, 10, 40)
date <- as.Date(c('2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-04','2017-03-04','2017-03-04',
                  '2017-03-04','2017-03-04','2017-03-04'))
df <- data.frame(date,employee,revenue)

         date employee revenue
1  2017-03-01        A      10
2  2017-03-01        B      20
3  2017-03-01        C      30
4  2017-03-01        D      40
5  2017-03-01        E      10
6  2017-03-01        F      40
7  2017-03-02        A       8
8  2017-03-02        B      10
9  2017-03-02        C      20
10 2017-03-02        D      50
11 2017-03-02        E      20
12 2017-03-02        F      10
13 2017-03-03        A       2
14 2017-03-03        B       5
15 2017-03-03        C      70
16 2017-03-03        D      30
17 2017-03-03        E      10
18 2017-03-03        F      50
19 2017-03-04        A      40
20 2017-03-04        B       8
21 2017-03-04        C      30
22 2017-03-04        D      40
23 2017-03-04        E      10
24 2017-03-04        F      40
Run Code Online (Sandbox Code Playgroud)

如何转换此数据框,以便将其转换为 ggplot2 中瀑布图的形式?

amount列是与员工总天数的差值。

end列是start列减去amount列。

start列是Total前一天的结束值。

最终数据框应如下所示:

         date employee     start    end    amount    total_for_day
1  2017-03-01        A         0     10        10               10
2  2017-03-01        B         0     20        20               20
3  2017-03-01        C         0     30        30               30
4  2017-03-01        D         0     40        40               40
5  2017-03-01        E         0     10        10               10
6  2017-03-01        F         0     40        40               40
7  2017-03-01    Total         0    150       150              150
8  2017-03-02        A       150    148        -2                8
9  2017-03-02        B       150    140       -10               10
10 2017-03-02        C       150    140       -10               20
11 2017-03-02        D       150    160        10               50 
12 2017-03-02        E       150    160        10               20
13 2017-03-02        F       150    120       -30               10  
14 2017-03-02    Total       150    118       -32               98
15 2017-03-03        A       118    112        -6                2                      
16 2017-03-03        B       118    113        -5                5                  
17 2017-03-03        C       118    168        50               70
18 2017-03-03        D       118     98       -20               30  
19 2017-03-03        E       118    108       -10               10  
20 2017-03-03        F       118    158        40               50
21 2017-03-03    Total       118    167        49              170  
22 2017-03-04        A       167    205        38               40
23 2017-03-04        B       167    170         3                8
24 2017-03-04        C       167    127       -40               30
25 2017-03-04        D       167    177        10               40
26 2017-03-04        E       167    167         0               10
27 2017-03-04        F       167    157       -10               40 
28 2017-03-04    Total       167    168         1              168
Run Code Online (Sandbox Code Playgroud)

Mar*_*son 6

有几个步骤可以让你做到这一点,我认为这个dplyr包会有所帮助(下面大量使用)。

我的理解是,revenue给出的是累计总收入,而不是每天的变化。如果这是错误的,您将需要反转其中的一些计算。

第一步是创建一个新的 data.frame 来计算每日总数,然后将其绑定回 data.frame。然后,您可以group_by添加员工(包括“总计”)并添加将为每个员工单独创建的列(前一天的值、更改,然后是增加还是减少)。

toPlot <-
  bind_rows(
    df
    , df %>%
      group_by(date) %>%
      summarise(revenue = sum(revenue)) %>%
      mutate(employee = "Total") 
  ) %>%
  group_by(employee) %>%
  mutate(
    previousDay = lag(revenue, default = 0) 
    , change = revenue - previousDay
    , direction = ifelse(change > 0
                         , "Positive"
                         , "Negative"))
Run Code Online (Sandbox Code Playgroud)

返回:

         date employee revenue previousDay change direction
       <date>    <chr>   <dbl>       <dbl>  <dbl>     <chr>
1  2017-03-01        A      10           0     10  Positive
2  2017-03-01        B      20           0     20  Positive
3  2017-03-01        C      30           0     30  Positive
4  2017-03-01        D      40           0     40  Positive
5  2017-03-01        E      10           0     10  Positive
6  2017-03-01        F      40           0     40  Positive
7  2017-03-02        A       8          10     -2  Negative
8  2017-03-02        B      10          20    -10  Negative
9  2017-03-02        C      20          30    -10  Negative
10 2017-03-02        D      50          40     10  Positive
# ... with 18 more rows
Run Code Online (Sandbox Code Playgroud)

然后,我们可以使用以下方法绘制该图:

toPlot %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  facet_wrap(~employee
             , scale = "free_y") +
  scale_fill_brewer(palette = "Set1")
Run Code Online (Sandbox Code Playgroud)

给予

在此处输入图片说明

请注意,包括“总计”会超出比例(需要自由比例),所以我宁愿省略它:

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  facet_wrap(~employee) +
  scale_fill_brewer(palette = "Set1")
Run Code Online (Sandbox Code Playgroud)

为此,允许员工之间的直接比较

在此处输入图片说明

这对于整体总数

toPlot %>%
  filter(employee == "Total") %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  scale_fill_brewer(palette = "Set1")
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

虽然我仍然发现折线图更容易解释(尤其是比较员工):

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(x = date
             , y = revenue
             , col = employee)) +
  geom_line() +
  scale_fill_brewer(palette = "Dark2")
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

如果您想按天绘制变化本身,您可以执行以下操作:

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(x = date
             , y = change
             , fill = employee)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Dark2")
Run Code Online (Sandbox Code Playgroud)

要得到:

在此处输入图片说明

但是现在您离“瀑布”图输出还很远。如果你真的,真的想让一个瀑布在不同的图中具有可比性,你可以,但它会相当丑陋(我强烈推荐上面的线图)。

在这里,您需要手动移动框,如果您更改输出纵横比(或大小)或员工数量,这将需要一些修补。您还需要为员工和更改方向包含颜色,这开始看起来很粗糙。这属于“可以,但可能不应该”的类别——可能有更好的方式来显示这些数据。

toPlot %>%
  filter(employee != "Total") %>%
  ungroup() %>%
  mutate(empNumber = as.numeric(as.factor(employee))) %>%
  ggplot(aes(xmin = (empNumber) - 0.4
             , xmax = (empNumber) + 0.4
             , ymin = previousDay
             , ymax = revenue
             , col = direction
             , fill = employee)) +
  geom_rect(size = 1.5) +
  facet_grid(~date) +
  scale_fill_brewer(palette = "Dark2") +
  theme(axis.text.x = element_blank()
        , axis.ticks.x = element_blank())
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明