The epinowcast
package aims to be a modular toolbox for
real-time infectious disease surveillance both in outbreak and routine
contexts. As such we provide a modular modelling framework that is
optimised for a range of common surveillance tasks whilst maintaining
flexibility and supporting user extension.
We provide a flexible semi-parametric model for the underlying generative process similar to that implemented in other real-time infectious disease modelling packages[1,2]. This optionally includes a renewal process[3,4] and latent reporting process[1,5,6]. Combined with the appropriate generation time distribution this approach has been shown to correspond to a Susceptible-Exposed-Infected-Recovered (SEIR) model[7] with the addition of reporting delays. However, our default model contains minimal mechanism in order to more flexibly fit highly informative data.
Our nowcasting approach is an extension of that proposed by Günther
et al.[8] which was itself an
extension of the model proposed by Höhle and Heiden[9] and
implemented in the surveillance
R package[10].
Compared to the model proposed in Günther et al.[8],
epinowcast
adds support for jointly nowcasting multiple
related datasets, a flexible formula interface allowing for the
specification of a large range of models, and an optional parametric
assumption for the underlying reporting delay.
We also support flexible joint modelling of missing data by assuming that the reporting delay is consistent between reported and unreported observations following the methodology of Lison et al.[11].
Our modelling framework is implemented in the stan
probabilistic programming language via cmdstanr
[12,13] with a focus on computational
efficiency and robustness.
In the following sections we outline our modelling methodology, current feature set stratified by module, and highlight implementation details. For each module we present both our default implementation as well as the more generic framework we support. Note that extensions to this methodology are warmly welcomed with our community site being a good first point of contact.
We are concerned with outcomes that occur at a time of reference (e.g., date of symptom onset or test for a disease) that are reported only with a delay, at the time of report (e.g., the date onsets are entered into a central database and so become available for analysis). We assume that these times are measured in discrete time steps, usually of a day (in which case the times are dates).
We follow the approach of Höhle and Heiden[9] and consider the distribution of notifications (ng, t, d) by date of reference (t) and reporting delay (d) conditional on the final observed count Ng, t for each dataset (g) such that,
where D represents the maximum delay between date of reference and time of report which in theory could be infinite but in practice we set to a finite value in order to make the model identifiable and computationally feasible. For each t and g these notifications are assumed to be drawn from a multinomial distribution with Ng, t trials and a probability vector (pg, t, d) of length D. The aim of nowcasting is to predict the final observed counts Ng, t given information available up to time t. We do this by estimating the components of this probability vector jointly with the expected number of final notifications (λg, t = 𝔼[Ng, t]) in dataset g at time t.
An alternative approach would be to consider each ng, t, d independently at which point the model can be defined as a regression that can be fit using standard software with the appropriate observation model and adjustment for reporting delay (i.e., it becomes a Poisson or Negative Binomial regression). An implementation of this approach is available in Bastos et al.[14]. A downside of this simplified approach is that reporting is not conditionally dependent which may make specifying models for complex reporting distributions difficult.
Here we follow the approach of Günther et al.[8] and specify the model for expected final notifications as a first order random walk. This simple model is highly flexible and so a good fit for nowcasting problems where the data is highly informative.
where Ng0, the first time point for expected observations in dataset d, is assumed to have been completely observed.
In settings where data is sparse or where users want to understand the underlying generative process our flexible default model is likely not a good choice. In these settings our generic model offers a range of options that are context specific. Our generic model is currently based on a renewal process[3,4] with additional latent reporting delays[1,5,6]. As previously noted[7], this corresponds to the commonly used Susceptible-Exposed-Infected-Recovered (SEIR) model when appropriate generation time is specified[7]
We model the instantaneous reproduction number (Rt) on the log scale (though support for other link functions is planned). When the generation time is fixed to be a day this can be interpreted as the instantaneous growth rate (rt) defined as the difference in the log of the expected number of final notifications between time t and t − 1.
where r0 is the
optional intercept, Xr is the design
matrix for fixed effects (βf, r),
and Zr is
the design matrix for random effects (βr, r.
Within this specification the default model can be specified as a random
effect on the day with no intercept. Alternative specifications that may
be of interest include a weekly random walk (specified as
~ 1 + rw(week)
), a piecewise linear model (specified as
~ 1 + day:week
), and a group specific random effect
(specified as ~ 1 + (1 | .group)
).
We model the expected number of infections/latent notifications (λl) using a renewal process[3,4]. This model is a generalisation of the default model and can be used to model the expected number of latent notifications in a setting where the generation time is not fixed to be a day. It implies that current infections/notifications are dependent on past infections/notifications based on a kernel (usually interpreted as the generation time or serial interval). An instantaneous daily growth rate model can be recovered by setting the generation time to be fixed at 1 day. The model is defined as follows,
Where Gg(p, t − p) is the probability of an infection p days after infection t − p days ago for group g, and P is the maximum generation time. To initialise the model we assume that the first P latent notifications are log-normally distributed with mean μg, tl and standard deviation σg, tl. This is equivalent to assuming that the first P latent notifications are independent and identically distributed. The mean of the log-normal distribution for each group is the log of the latest reported case count for the first reference date for that group scaled by the sum of the latent reporting delay. The standard deviation is assumed to be 1. Both of these assumptions can be altered by the user.
In some settings there may be additional reporting delays on top of
those that are directly observed in the data, and therefore
“nowcastable”, a common example is the delay from exposure to symptom
onset. For these settings we support modelling “latent” reporting delays
as a convolution of the underlying expected counts with the potential
for these delays to vary over time and by group. This implementation is
similar to that implemented in EpiNow2
and
epidemia
as well as other similar models[1,5,6,11]. In addition to this we
support modelling ascertainment through the use of improper probability
mass functions (i.e., by not enforcing a sum to 1 constraint) and
inferring ascertainment where possible (for example day of the week
reporting patterns).
Where νg, t is the inferred ascertainment and is modelled flexibly using an optional intercept (ν0), a design matrix (Xν) for fixed effects (βf, ν), and a design matrix (Zν) for random effects (βr, ν.
Given case counts both by date of reference and by date of report, we can estimate the reporting delay distribution directly and jointly with the underlying process model, rather than relying on external estimates from other sources (though we may want to account for external information in our priors). In the following section, we describe our default parametric delay distribution model and its extension into a generic, highly flexible delay model based on discrete time-to-event modelling.
In our default model, we consider the reporting delay to follow a LogNormal(μd, σd) distribution, with parameters
This distribution is discretised into daily probabilities pg, t, d
(for a case in group g with
reference date t to be
reported with delay d) and
adjusted for the maximum delay, see
vignette("distributions")
for details.
We generalise this model in order to support a range of delay distributions as well as effects of the reporting process on the delay. Following the approach of Günther et al.[8] and others, we parameterise the delay probability (pg, t, d) via a discrete-time hazard model, i.e.,
where
hg, t, d = P(delay = d|delay ≥ d, Wg, t, d), is the so-called reporting hazard. For a case in group g with reference date t, the hazard hg, t, d states the probability of being reported with delay d, given that the case is not reported earlier. The hazard depends on a design matrix Wg, t, d, which encodes a baseline delay distribution and covariates that affect the reporting delay. We extend the model of Günther et al. by decomposing the hazard into three components,
Each component adds to the overall hazard through a regression with a logit link
where the maximum delay hazard is hg, t, D = 1, in order to enforce the assumption that all observations are reported within the specified maximum delay. In the following, we describe the parameterisation of the different components.
The parametric baseline hazard γg, t, d
for a case in group g with
reference date t is modelled
according to a certain discretised parametric probability distribution
with parameters μg, t
and υg, t.
Currently, epinowcast
supports four different parametric
distributions: (i) log-normal (default), (ii) exponential, (iii) gamma,
and (iv) log-logistic. The distributions are discretised and adjusted
for an assumed maximum delay, see vignette("distributions")
for details. The delay probabilities pg, t, d′
obtained from the discretised delay distribution are converted into
hazards on the logit scale using
In the default case of the log-normal distribution, the parameters μg, t and υg, t represent the log mean and log standard deviation. Each parameter is defined using a (log-)linear model. The model consists of an intercept and a number of arbitrary, shared covariates, indexed by reference date. The covariates are multiplied by fixed (βf, i) and random (βr, i) coefficients (note that these can include auto-regressive terms), i.e.,
In addition to parametric reporting effects there may also be non-parametric effects referenced by both reference and report dates. These are represented by the non-distributional logit hazard components for the date of reference and report, defined using an intercept (δ0) and arbitrary, shared covariates with fixed (βf, i) and random (βr, i) coefficients (note these can include auto-regressive terms).
All fixed (βf, i)
and random (βr, i)
coefficients have standard normal priors by default with standard
half-normal priors for pooled standard deviations. For further
implementation details see enw_reference()
for delays
linked to the date of reference, enw_report()
for delays
linked to the date of report.
Expected notifications by date of reference (t) and reporting delay can now be found by multiplying expected final notifications for each t with the probability of reporting for each day of delay (pg, t, d). We assume a negative binomial observation model, by default, with a joint overdispersion parameter (with a standard half normal prior on 1 over square root of the overdispersion[15]) and produce a nowcast of final observed notifications at each reference date by summing posterior estimates for unobserved notification and observed notifications for that reference date.
Where αg, t
is the proportion of cases by reference date that will not report their
reference date. By default this is not modelled and is set to zero , see
the accounting for reported cases with a missing reference date section
for further defaults. Other observation models such as the Poisson
distribution are also supported. See the documentation
enw_obs()
for details.
In order to make best use of observed data when nowcasting we use observations where available and where they have not been reported for a given report and reference date we use the posterior prediction from the observation model above. This means that as nowcast dates become increasingly truncated they depend more on modelled estimates whereas when they are more complete the majority of the final count is known. Depending on your use case the posterior predictions alone may also be of interest.
In real-world settings observations may be reported without a linked reference date. A common example of this is cases by date of symptom onset where report date is often known but onset date may not be. To account for this we support modelling this missing process by assuming that cases with a missing reference date have the same reporting delay distribution as cases with a known reference date and that processes that drive the probability of having a missing reference date (αg, t) are linked to the unknown date of reference rather than the date of report based on Lison et al.[11]. We model this probability flexibly on a logit scale as follows,
Where α0 represents the intercept, βf, α fixed effects, and βr, α random effects. To link with observations by date of report with a missing reference date (Mg, t) we convolve expected notifications with the probability of having a missing reference date and the probability of reporting on a given day as follows,
As for cases with known reference dates other observation models are
supported. For further implementation details see
enw_missing()
.
The model is implemented in the probabilistic programming language
stan
and we use cmdstanr
to interact with the
model[12,13]. Optional within chain
parallelisation is available across times of reference to reduce
runtimes. Sparse design matrices have been used for all covariates to
limit the number of probability mass functions that need to be
calculated. epinowcast
incorporates additional
functionality written in R[16] to enable plotting nowcasts and
posterior predictions, summarising nowcasts, and scoring them using
scoringutils
[17]. A flexible formula
interface is provided to enable easier implementation of complex user
specified models without interacting with the underlying code base. All
functionality is modular allowing users to extend and alter the
underlying model whilst continuing to use the package framework.