Alluvial diagram is a variant of a Parallel Coordinates Plot (PCP) but for categorical variables. Variables are assigned to vertical axes that are parallel. Values are represented with blocks on each axis. Observations are represented with alluvia (sing. “alluvium”) spanning across all the axes.
You create alluvial diagrams with function alluvial()
. This is one example using Titanic
dataset. Let’s convert it to a data frame
tit <- as.data.frame(Titanic, stringsAsFactors = FALSE)
head(tit)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
and create the alluvial diagram.
alluvial(tit[,1:4], freq=tit$Freq,
col = ifelse(tit$Survived == "Yes", "orange", "grey"),
border = ifelse(tit$Survived == "Yes", "orange", "grey"),
hide = tit$Freq == 0,
cex = 0.7
)
We have four variables:
Class
on the ship the passanger occupiedSex
of the passengerAge
of the passengerSurvived
.Vertical sizes of the blocks are proportional to the frequency, and so are the widths of the alluvia. Alluvia represent all combinations of values of the variables in the dataset. By default the vertical order of the alluvia is determined by alphabetical ordering of the values on each variable lexicographically (last variable changes first) drawn from bottom to top. In this example, the color is determined by passengers’ survival status, i.e. passenger who survived are represented with orange alluvia.
Alluvial diagrams are very useful in reading various conditional and uncoditional distributions in a multivariate dataset. For example, we can see that:
Minimal use requires supplying data frame(s) as first argument, and a vector of frequencies as the freq
argument. By default all alluvia are drawn using gray, mildly transparent colors.
Two variables Class
and Survived
:
# Survival status and Class
tit %>% group_by(Class, Survived) %>%
summarise(n = sum(Freq)) -> tit2d
alluvial(tit2d[,1:2], freq=tit2d$n)
Three variables Sex
, Class
, and Survived
:
# Survival status, Sex, and Class
tit %>% group_by(Sex, Class, Survived) %>%
summarise(n = sum(Freq)) -> tit3d
alluvial(tit3d[,1:3], freq=tit3d$n)
There are several ways to customize alluvial diagrams with alluvial()
the following sections illustrate probably most common usecases.
Colors of the alluvia can be customized with col
, border
and alpha
arguments. For example:
alluvial(
tit3d[,1:3],
freq=tit3d$n,
col = ifelse( tit3d$Sex == "Female", "pink", "lightskyblue"),
border = "grey",
alpha = 0.7,
blocks=FALSE
)
With alluvial
sometimes it is desirable to hide omit plotting some of the alluvia. This is most frequently the case with larger datasets in which there are a lot of combinations of values of the variables associated with very small frequencies, or even 0s. Alluvia can be hidden with argument hide
expecting a logical vector of length equal to the number of rows in the data. Alluvia for which hide
is FALSE
are not plotted. For example, to hide alluvia with frequency less than 150:
alluvial(tit2d[,1:2], freq=tit2d$n, hide=tit2d$n < 150)
This skips drawing the alluvia corresponding to the following rows in tit
data frame:
tit2d %>% select(Class, Survived, n) %>%
filter(n < 150)
## Source: local data frame [2 x 3]
## Groups: Class [2]
##
## Class Survived n
## <chr> <chr> <dbl>
## 1 1st No 122
## 2 2nd Yes 118
You can see the gaps e.g. on the “Yes” and “No” category blocks on the Survived
axis.
If you would rather omit these rows from the plot alltogether (i.e. no gaps), you need to filter your data before it is used by alluvial()
.
By default alluvia are plotted in the same order in which the rows are ordered in the dataset.
Consider simple data:
d <- data.frame(
x = c(1, 2, 3),
y = c(3 ,2, 1),
freq=c(1,1,1)
)
d
## x y freq
## 1 1 3 1
## 2 2 2 1
## 3 3 1 1
As there are three rows, we will have three alluvia:
alluvial(d[,1:2], freq=d$freq, col=1:3, alpha=1)
# Reversing the order
alluvial(d[ 3:1, 1:2 ], freq=d$freq, col=3:1, alpha=1)
Note that to keep colors matched in the same way to the alluvia we had to reverse the col
argument too. Instead of reordering the data and keeping track of the other arguments plotting order can be adjusted with layer
argument:
alluvial(d[,1:2], freq=d$freq, col=1:3, alpha=1,
layer=3:1)
The value of layer
is passed to order
so it is possible to use logical vectors e.g. if you only want to put some of the flows on top. For example, for Titanic data to put all alluvia for all survivors on top we can:
alluvial(tit3d[,1:3], freq=tit3d$n,
col = ifelse( tit3d$Survived == "Yes", "orange", "grey" ),
alpha = 0.8,
layer = tit3d$Survived == "No"
)
First layer is the one on top, second layer below the first and so on. Consequently, in the example above, Survived == "No"
is ordered after Survived == "Yes"
so the former is below the latter.
This is feature is experimental!
Usually the order of the variables (axes) is rather unimportant. However, having particular two variables next to each other facilitates analyzing dependency between those two variables. In alluvial diagrams the ordering of the variables determines the vertical plotting order of the alluvia. This vertical order, together with setting blocks
to FALSE
, can be used to turn category blocks into stacked barcharts.
Consider two versions of subsets of the Titanic data that differ only in the order of variables.
tit %>% group_by(Sex, Age, Survived) %>%
summarise( n= sum(Freq)) -> x
tit %>% group_by(Survived, Age, Sex) %>%
summarise( n= sum(Freq)) -> y
In x
we have Sex-Age-Survived-n while in y
we have Survived-Age-Sex-n.
If we color the alluvia according to the first axis, the category blocks of Age and Survived become barcharts showing relative frequencies of Men and Women within categories of Age and Survived.
alluvial(x[,1:3], freq=x$n,
col = ifelse(x$Sex == "Male", "orange", "grey"),
alpha = 0.8,
blocks=FALSE
)
Now we can see for example that
Age == "Child"
)Argument ordering
can be used to fully customize the ordering of each alluvium on each axis without the need to reorder the axes themselves. This feature is experimental as you can easily break things. It expects a list of numeric vectors or NULL
s one for each variable in the data:
NULL
does not change the default order on the corresponding axis.For example:
alluvial(y[,1:3], freq=y$n,
# col = RColorBrewer::brewer.pal(8, "Set1"),
col = ifelse(y$Sex == "Male", "orange", "grey"),
alpha = 0.8,
blocks = FALSE,
ordering = list(
order(y$Survived, y$Sex == "Male"),
order(y$Age, y$Sex == "Male"),
NULL
)
)
The list passed to ordering
has has three elements corresponding to Survived
, Age
, and Sex
respectively (that’s the order of the variables in y
). The elements of this list are
order
sorting the alluvia on the Survived
axis. The alluvia need to be sorted according to Survived
first (otherwise the categories “Yes” and “No” will be destroyed) and according to the Sex
second.order
sorting the alluvia on the Age
axis. The alluvia need to be sorted according to Age
first Sex
second.NULL
leaves the default ordering on Sex
axis.In the example below alluvia are colored by sex (red=Female, blue=Male) and survival status (bright=survived, dark=did not survive). Each category block is a stacked barchart showing relative freuquencies of man/women who did/did not survive. The alluvia are reordered on the last axis (Age) so that Sex categories are next each other (red together and blue together):
pal <- c("red4", "lightskyblue4", "red", "lightskyblue")
tit %>%
mutate(
ss = paste(Survived, Sex),
k = pal[ match(ss, sort(unique(ss))) ]
) -> tit
alluvial(tit[,c(4,2,3)], freq=tit$Freq,
hide = tit$Freq < 10,
col = tit$k,
border = tit$k,
blocks=FALSE,
ordering = list(
NULL,
NULL,
order(tit$Age, tit$Sex )
)
)
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=pl_PL.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=pl_PL.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=pl_PL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.5.0 alluvial_0.1-2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 digest_0.6.10 assertthat_0.1 R6_2.1.3
## [5] DBI_0.5 formatR_1.4 magrittr_1.5 evaluate_0.9
## [9] stringi_1.1.1 lazyeval_0.2.0 rmarkdown_1.0 tools_3.3.1
## [13] stringr_1.1.0 yaml_2.1.13 htmltools_0.3.5 knitr_1.14
## [17] tibble_1.2