Many quantities in epidemiology can be thought of as the time between two events or “delays”. Important examples include:
We encompass all of these delays as the time between a “primary event” and “secondary event”. Unfortunately, estimating delays accurately from observational data is usually difficult. In our experience, the two main issues are:
Don’t worry if you’ve not come across these terms before. First, in
Section @ref(data), we explain interval censoring and right truncation
by simulating data like that we might observe during an ongoing
infectious disease outbreak. Then, in Section @ref(fit), we show how
epidist
can be used to accurately estimate delay
distributions by using a statistical model which properly accounts for
these two issues.
If you would like more technical details, the epidist
package implements models following best practices as described in Park et al. (2024) and Charniga et al. (2024). We also recommend the
introduction provided by the nowcasting and forecasting
infectious disease dynamics course.
To run this vignette yourself, as well as the epidist
package, you will need the following packages:
First, we use the Gillepsie algorithm to generate infectious disease outbreak data (Figure @ref(fig:outbreak)) from a stochastic compartmental model.
(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.)
outbreak |>
filter(case %% 50 == 0) |>
ggplot(aes(x = ptime, y = case)) +
geom_point(col = "#56B4E9") +
labs(x = "Primary event time (day)", y = "Case number") +
theme_minimal()
outbreak
is a data.frame
with the two
columns: case
and ptime
. Here
ptime
is a numeric column giving the time of infection. In
reality, it is more common to recive primary event times as a date
rather than a numeric.
head(outbreak)
#> case ptime
#> 1 1 0.04884052
#> 2 2 0.06583120
#> 3 3 0.21857827
#> 4 4 0.24963421
#> 5 5 0.30133392
#> 6 6 0.31425010
To generate secondary events, we will use a lognormal distribution (Figure @ref(fig:lognormal)) for the delay between primary and secondary events:
secondary_dist <- data.frame(mu = 1.8, sigma = 0.5)
class(secondary_dist) <- c("lognormal_samples", class(secondary_dist))
secondary_dist <- add_mean_sd(secondary_dist)
obs <- outbreak |>
simulate_secondary(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
(ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability.
ggplot(data.frame(x = c(0, 30)), aes(x = x)) +
geom_function(
fun = dlnorm,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
theme_minimal() +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density"
)
(ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure @ref(fig:lognormal)). As with Figure @ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown.
obs |>
filter(case %% 50 == 0) |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
geom_point(aes(x = ptime), col = "#56B4E9") +
geom_point(aes(x = stime), col = "#009E73") +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
obs
is now a data.frame
with further
columns for delay
and stime
. The secondary
event time is simply the primary event time plus the delay:
If we were to receive the complete data obs
as above
then it would be simple to accurately estimate the delay distribution.
However, in reality, during an outbreak we almost never receive the data
as above.
First, the times of primary and secondary events will usually be censored. This means that rather than exact event times, we observe event times within an interval. Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure @ref(fig:cens)):
obs_cens <- obs |>
mutate(
ptime_lwr = floor(.data$ptime),
ptime_upr = .data$ptime_lwr + 1,
stime_lwr = floor(.data$stime),
stime_upr = .data$stime_lwr + 1,
delay_daily = stime_lwr - ptime_lwr
)
(ref:cens) Interval censoring of the primary and secondary event
times obscures the delay times. A common example of this is when events
are reported as daily aggregates. While daily censoring is most common,
epidist
supports the primary and secondary events having
other delay intervals.
obs_cens |>
filter(case %% 50 == 0, case <= 250) |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
# The primary event censoring intervals
geom_errorbarh(
aes(xmin = ptime_lwr, xmax = ptime_upr, y = case, yend = case),
col = "#56B4E9", height = 5
) +
# The secondary event censoring intervals
geom_errorbarh(
aes(xmin = stime_lwr, xmax = stime_upr, y = case, yend = case),
col = "#009E73", height = 5
) +
geom_point(aes(x = ptime), fill = "white", col = "#56B4E9", shape = 21) +
geom_point(aes(x = stime), fill = "white", col = "#009E73", shape = 21) +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
During an outbreak we will usually be estimating delays in real time. The result is that only those cases with a secondary event occurring before some time will be observed. This is called (right) truncation, and biases the observation process towards shorter delays:
obs_time <- 25
obs_cens_trunc <- obs_cens |>
mutate(obs_time = obs_time) |>
filter(.data$stime_upr <= .data$obs_time)
Finally, in reality, it’s not possible to observe every case. We
suppose that a sample of individuals of size sample_size
are observed:
This sample size corresponds to 8.7% of the data.
Another issue, which epidist
currently does not account
for, is that sometimes only the secondary event might be observed, and
not the primary event. For example, symptom onset may be reported, but
start of infection unknown. Discarding events of this type leads to what
are called ascertainment biases. Whereas each case is equally likely to
appear in the sample above, under ascertainment bias some cases are more
likely to appear in the data than others.
With our censored, truncated, and sampled data, we are now ready to
try to recover the underlying delay distribution using
epidist
.
If we had access to the complete and unaltered obs
, it
would be simple to estimate the delay distribution. However, with only
access to obs_cens_trunc_samp
, the delay distribution we
observe is biased (Figure @ref(fig:obs-est)) and we must use a
statistical model.
(ref:obs-est) The histogram of delays from the complete,
retrospective data obs_cens
match quite closely with the
underlying distribution, whereas those from
obs_cens_trunc_samp
show more significant systematic bias.
In this instance the extent of the bias caused by censoring is less than
that caused by right truncation. Nonetheless, we always recommend [Charniga et al. (2024); Table 2] adjusting for
censoring when it is present.
bind_rows(
obs_cens |>
mutate(type = "Complete, retrospective data") |>
select(delay = delay_daily, type),
obs_cens_trunc_samp |>
mutate(type = "Censored, truncated,\nsampled data") |>
select(delay = delay_daily, type)
) |>
group_by(type, delay, .drop = FALSE) |>
summarise(n = n()) |>
mutate(p = n / sum(n)) |>
ggplot() +
geom_col(
aes(x = delay, y = p, fill = type, group = type),
position = position_dodge2(preserve = "single")
) +
scale_fill_manual(values = c("#CC79A7", "#0072B2")) +
geom_function(
data = data.frame(x = c(0, 30)), aes(x = x),
fun = dlnorm,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density",
fill = ""
) +
theme_minimal() +
theme(legend.position = "bottom")
The main function you will use for modelling is called
epidist()
1. We will fit the model
"epidist_latent_model"
which uses latent variables for the
time of primary and secondary event of each individual2. To do so, we first
prepare the data
using
as_epidist_latent_model()
:
linelist_data <- as_epidist_linelist_data(
obs_cens_trunc_samp$ptime_lwr,
obs_cens_trunc_samp$ptime_upr,
obs_cens_trunc_samp$stime_lwr,
obs_cens_trunc_samp$stime_upr,
obs_time = obs_cens_trunc_samp$obs_time
)
data <- as_epidist_latent_model(linelist_data)
class(data)
#> [1] "epidist_latent_model" "epidist_linelist_data" "tbl_df"
#> [4] "tbl" "data.frame"
The data
object now has the class
epidist_latent_model
. Using this data
, we now
call epidist()
to fit the model. The parameters of the
model are inferred using Bayesian inference. In particular, we use the
the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm
via the brms
R
package (Bürkner
2017).
Note that here we use the default rstan
backend
but we generally recommend using the cmdstanr
backend for
faster sampling and additional features. This can be
set using backend = "cmdstanr"
after following the
installing CmdStan instructions in the README.
The fit
object is a brmsfit
object
containing MCMC samples from each of the parameters in the model, shown
in the table below. Many of these parameters (e.g. swindow
and pwindow
) are the so called latent variables, and have
lengths corresponding to the sample_size
.
pars <- fit |>
variables(reserved = FALSE) |>
gsub(pattern = "\\[\\d+\\]", replacement = "")
data.frame(
Parameter = unique(pars), Length = table(pars)
) |>
gt()
Parameter | Length.pars | Length.Freq |
---|---|---|
b_Intercept | Intercept | 1 |
b_sigma_Intercept | Intercept_sigma | 1 |
Intercept | b_Intercept | 1 |
Intercept_sigma | b_sigma_Intercept | 1 |
lprior | lp__ | 1 |
lp__ | lprior | 1 |
swindow_raw | pwindow | 200 |
pwindow_raw | pwindow_raw | 200 |
pwindow | swindow | 200 |
swindow | swindow_raw | 200 |
Users familiar with Stan and brms
, can work with
fit
directly. Any tool that supports brms
fitted model objects will be compatible with fit
.
For example, we can use the built in summary()
function
summary(fit)
#> Family: latent_lognormal
#> Links: mu = identity; sigma = log
#> Formula: delay | vreal(relative_obs_time, pwindow, swindow) ~ 1
#> sigma ~ 1
#> Data: data (Number of observations: 200)
#> Draws: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup draws = 2000
#>
#> Regression Coefficients:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 1.75 0.05 1.67 1.85 1.00 2281 1375
#> sigma_Intercept -0.78 0.07 -0.91 -0.64 1.00 2364 1575
#>
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
to see that the model has converged and recovered the delay distribution distribution paramters reasonably well.
The epidist
package also provides functions to make
common post-processing tasks easy. For example, individual predictions
of the lognormal delay parameters can be extracted using:
Figure @ref(fig:fitted-lognormal) shows the lognormal delay
distribution obtained for 100 draws from the posterior distribution of
the mu
and sigma
parameters. Whereas in Figure
@ref(fig:obs-est) the histogram of censored, truncated, sampled data was
substantially different to the underlying delay distribution, using
epidist()
we have obtained a much closer match to the
truth.
(ref:fitted-lognormal) The fitted delay distribution (pink lines, 100 draws from the posterior) compared to the true underlying delay distribution (black line). Each pink line represents one possible delay distribution based on the fitted model parameters.
# Sample 100 random draws from the posterior
set.seed(123)
sample_draws <- sample(nrow(pred), 100)
ggplot() +
# Plot the true distribution
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
linewidth = 1.5,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
# Plot 100 sampled posterior distributions
lapply(sample_draws, \(i) {
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
args = list(
meanlog = pred$mu[i],
sdlog = pred$sigma[i]
),
alpha = 0.1,
color = "#CC79A7"
)
}) +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density"
) +
theme_minimal()
Technically, epidist()
is an S3 generic which allows it to
work differently for inputs of different classes. This is in part why
inputs must be prepared first via as_epidist_latent_model()
so that they are of the appropriate class!↩︎
In a future vignette, we will explain in more detail the structure of the model!↩︎