--- title: "Hash-based Matched Pseudo-Random Number Generation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{hashprng} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ```{r setup, echo=FALSE} library(hashprng) # required for vignette knitting only library(Rcpp) # used for data transformation + plotting library(data.table) library(ggplot2) library(parallel) ``` # Overview This vignette demonstrates the practical use of `hashprng::hash_seed`; if you want a detailed explanation of *why* to use this approach, read the paper associated with the package citation: (TODO). Briefly, hash-based matched pseudo-random number generation (HBM-PRNG) can ensure that the same "events" across different simulation scenarios are resolved consistently, while also still permitting stochastic simulation, and incurring minimal additional computational overhead. Matched samples can then be analyzed pair-wise, leading to higher resolution calculation of the overall impact of the cross-scenario varying features. However, researchers still need to specify what constitutes "the same" event, both in terms of what features are hashed and how they implement the model structure. In this vignette, we focus on how to use `hashprng::hash_seed`, but do so using two different implementations of the same model world and two different definitions of "same" for events. # Standards of Matching Throughout this vignette we consider a few matching regimes: - Non-matched (NON): this is equivalent to drawing an item from this random process and then one from the same-but-slightly-perturbed process and comparing them. Note: this kind of simulation should still be seeded, to ensure reproducibile PRNG. - Seed-Matched-Only (SMO): this scheme ensures that compared runs both have the same PRNG seeding. This can guarantee, for example, that the history leading up to an intervention is the same. However, generally once simulation trajectory diverges, the PRNG is exercised differently, and events that _should_ be the same between runs are not guaranteed to see the same PRNG draws. - Fully-Event-Matched (FEM): this scheme ensures that compared runs have all of their identical events resolved consistently. Any event which occurs in both simulations gets the same draw from the PRNG. Hash-based matched (HBM) PRNG is an approach to acheiving FEM simulations. There are alternatives, for example managing multiple PRNGs or precomputing all possible outcomes. In general, HBM imposes a guaranteed-but-low computational cost, scaling with the number of the PRNG draws in any given simulation, with minimal implementation burden on the modeller. We'll be making comparisons across NON, SMO, and HBM in terms of implementation, compute cost, and calculation resolution. # Model World and Scenario We'll be considering an infectious disease system of the *S*usceptible-*I*nfectious- *R*emoved variety and an intervention where, after some delay from initial infection introduction, a portion of the population are spontaneously moved to *R* (representing e.g. a reactive vaccination campaign). # Model Implementation We'll implement this model in terms of discrete time steps, discrete individuals, and event probabilities. At each time step: 1. We'll consider if infectious individuals have exposing contact with susceptible individuals - this converts *S* to *I* at the end of the step. 2. We'll consider if infectious individuals recover - *I* to *R*. 3. If it's time for the move, we'll consider which susceptible individuals to remove - *S* to *R*. This is essentially a perturbation to the classic Reed-Frost model, with the modification that people may be infectious for more than a single time step ( as well as introducing an intervention). We're also going to implement this model two ways, each with a switch for whether or not to use hash-based matching for the random number draws. We'll flip this switch to show matching-vs-not changes conclusions comparing across scenarios, as well as to measure the resource cost associated with the higher resolution measurement. We'll also create a setup where the seed is only set once for the NON comparison. The other element we need to finally specify the model is what constitutes an event. We're going to offer two definitions and explore the consequences. Those two definitions hinge on whether the model's discrete individuals have *identity* or not. If individuals have identity, then events are distinguished by *who* is involved. If they do not, then events are only distinguished by *what* and *when*. ```{r definitions} # define some variables to differentiate between individual states # for the non-identity model, these will be compartment indices # for the identity model, these are individual states susceptible <- 1L infectious <- 2L removed <- 3L ``` ## Non-Identity Model Implementation For the non-identity model, we assume that at each time step only the *size* of the relative populations should determine event outcomes, e.g. how many people are infected. We can accomplish that by drawing from the binomial distribution. For HBM, we need to match time and event. But since there are (at most) three draws per step, always occurring in the same order, we can simply match on time and then quantile draws. We are assuming here that `runif` always advances underlying PRNG state by the number of deviates requested, but that `rbinom` might advance state by a variable amount depending on both parameters and state, hence the need to draw quantiles and then invert to outcomes. N.B. `?rbinom` indicates draws are done per "Kachitvichyanukul, V. and Schmeiser, B. W. (1988) Binomial random variate generation. Communications of the ACM, 31, 216–222", which describes and algorithm BPTE which does indeed draw uniform deviates a variable number of times per binomial draw, depending on the $n$ and $p$ arguments. ```{r nonident} #' @param t an integer scalar, the simulation time #' @param y an integer vector, the current counts of individuals in S, I, and R #' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`, #' and `vax_t` #' @param seed optional; if non-null, using HBM nonidentity_dStep <- function(t, y, parameters, seed = NULL) { with(parameters, { # extract the S, I population counts Ninf <- y[infectious] Nsus <- y[susceptible] # if we are matching events if (!is.null(seed)) { # draw the same quantiles for each when hash_seed(seed, t) evtqs <- runif(2) # convert these to the outcomes newinf <- qbinom(evtqs[1], Nsus, 1 - (1 - transmission_p)^Ninf) newrem <- qbinom(evtqs[2], Ninf, recovery_p) } else { # simply make the binomial draws newinf <- rbinom(1, Nsus, 1 - (1 - transmission_p)^Ninf) newrem <- rbinom(1, Ninf, recovery_p) } # move infectees S -> I, recoveries I -> R dY <- c(S = -newinf, I = newinf - newrem, R = newrem) if (t == vax_t) { if (!is.null(seed)) { # n.b. still on a consistent PRNG state for this time, so can just draw newvax <- qbinom(runif(1), Nsus - newinf, vaccination_p) } else { newvax <- rbinom(1, Nsus - newinf, vaccination_p) } # move vaccinees dY[susceptible] <- dY[susceptible] - newvax dY[removed] <- dY[removed] + newvax } # return the overall state change, as well as new incidence return(list(dY, newinf)) }) } ``` ## Higher Resolution Implementation For the identity based model, we assume who *particularly* gets infected or vaccinated matters. Thus, for HBM we need to hash both *who* and *when* for infection and recovery. We can again take advantage of a consistent number of draws (because we always consider exposing all individuals, susceptible or not) to avoid reseting the PRNG for the recovery draw. We also set the PRNG for vaccination. This probably isn't necessary: vaccination is itself the only factor that might change trajectories, so there are no differences in trajectory prior to considering vaccination. However, this is a very fragile arrangement. Adding any other interventions or otherwise perturbing model assumptions could easily change traversal of the PRNG and result in unmatched vaccinee selection. N.B., for both the matched and unmatched versions we're generally overdrawing on random number deviates. While it might be possible to reduce computational time with fewer draws, that would entail either other computational work (e.g. calculating who is eligible for infection) or more convoluted data structures. Those would be plausible for the unmatched version, at this model complexity, but likely become less practical with increasing complexity. ```{r stepfunction} #' @param t an integer scalar, the simulation time #' @param y an integer vector, the individual states: S, I, R #' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`, #' and `vax_t` #' @param salt an integer scalar, optional; if present, use hash-based matching identity_dStep <- function(t, y, parameters, salt = NULL) { with(parameters, { N <- length(y) dY <- integer(N) incI <- 0 # consider whether each infectious individual exposes susceptible # individuals infectable <- which(y == susceptible) for (infector in which(y == infectious)) { if (length(infectable)) { if (!is.null(salt)) { # if HBM, need to reseed RNG for each pair => draw # easier to salt => draw all possible => slice relevant hash_seed(salt, "inf", infector, t) infectees <- infectable[ (runif(tail(infectable, 1)) < transmission_p)[infectable] ] } else { infectees <- sample( infectable, rbinom(1, length(infectable), transmission_p) ) } dY[infectees] <- 1 infectable <- setdiff(infectable, infectees) # if this infector would recovery, indicate 2 => 3 transition if (runif(1) < recovery_p) dY[infector] <- 1 } } incI <- sum((dY == 1) & (y != infectious)) # if its time for vaccination if (t == vax_t) { if (!is.null(salt)) { # if HBM, need to reseed for each individual # include the t here - what if the model changed vax_t? psalt <- hash_seed(salt, "vax", t) vaccinees <- infectable[ (runif(tail(infectable, 1)) < vaccination_p)[infectable] ] } else { vaccinees <- sample( infectable, rbinom(1, length(infectable), vaccination_p) ) } # for people still susceptible, indicate 1 => 3 transition dY[vaccinees] <- 2 } return(list(dY, incI)) }) } ``` # Comparison Let's run these simulation for a population of 4000 individuals, 75 time steps, and 500 samples. We'll consider a very small amount of vaccination - only 3.3%. We want to assess the cumulative averted _incidence_ of this scenario compared to no vaccination. For the "non-matching" version, we will still match on random number seeds; we refer to that as seed-matched-only, or SMO. Because our model starts vaccination after some delay, we should still see absolutely identical outcomes for SMO up to that point. However, after that point, the RNG will see a different traversal. We're going to discard outcomes were the epidemic dies out prior to vaccination. This is a debate-worthy choice - e.g. if we want to understand the potential benefit of purchasing some program, that should include the possibility that it was unnecessary. For this model world, however, those sort of considerations apply equally to either HBM or SMO, and we want to focus on the ability to resolve intervention impact when there is an epidemic. Hence, we discard those outcomes. ```{r parameters} # setup parameters samplen <- 500L pop <- 4000L maxt <- 75L initI <- 5L ps <- list( transmission_p = 1 - exp(-0.78 / pop), # ~ 0.78 infections produced per day in fully susceptible pop recovery_p = 1 - exp(-0.44) # ~ 2.3 days infectious # implies R0 ~ 1.8 ) # non-intervention, set vax_t < 0 nonps <- c(ps, list( vax_t = -1, vaccination_p = NA )) # intervention, vax_t == 10, 3.3% chance to happen intps <- c(ps, list( vax_t = 10, vaccination_p = 0.033 )) # create populations yinit <- rep.int(susceptible, pop) yinit[1:initI] <- infectious yinit_nonid <- c(pop - initI, initI, 0L) stepper <- function( maxtime = maxt, y0, dFUN, pars, seed, HBM) { # if not HBM & a seed provided => SMO => set initial seeding if (!missing(seed) && !HBM) set.seed(seed) # reserve data structure res <- integer(maxtime + 1) res[1] <- 1 y <- y0 # run the simulation for (i in seq_len(maxtime)) { dY <- if (HBM) dFUN(i, y, pars, seed) else dFUN(i, y, pars) y <- y + dY[[1]] res[i + 1] <- dY[[2]] } return(res) } ``` # Sampling Now we run all the samples, using plain sample ID as the SMO seed or HBM salt, for all the models. ```{r runsamples} set.seed(8675309) nonid_no_match <- system.time( ni_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) { data.table( type = "NON", model = "nonID", not_i = cumsum(stepper( y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, HBM = FALSE )), withi = cumsum(stepper( y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, HBM = FALSE )) ) }, mc.cores = 5) |> rbindlist(idcol = "sample") ) set.seed(8675309) id_no_match <- system.time( wi_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) { data.table( type = "NON", model = "ID", not_i = cumsum(stepper( y0 = yinit, dFUN = identity_dStep, pars = nonps, HBM = FALSE )), withi = cumsum(stepper( y0 = yinit, dFUN = identity_dStep, pars = intps, HBM = FALSE )) ) }, mc.cores = 5) |> rbindlist(idcol = "sample") ) nonid_seed_match_only <- system.time( ni_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) { data.table( type = "SMO", model = "nonID", not_i = cumsum(stepper( y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = FALSE )), withi = cumsum(stepper( y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = FALSE )) ) }, mc.cores = 5) |> rbindlist(idcol = "sample") ) id_seed_match_only <- system.time( wi_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) { data.table( type = "SMO", model = "ID", not_i = cumsum(stepper( y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = FALSE )), withi = cumsum(stepper( y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = FALSE )) ) }, mc.cores = 5) |> rbindlist(idcol = "sample") ) nonid_hash_based_matching <- system.time( ni_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) { data.table( type = "HBM", model = "nonID", not_i = cumsum(stepper( y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = TRUE )), withi = cumsum(stepper( y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = TRUE )) ) }, mc.cores = 5) |> rbindlist(idcol = "sample") ) id_hash_based_matching <- system.time( wi_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) { data.table( type = "HBM", model = "ID", not_i = cumsum(stepper( y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = TRUE )), withi = cumsum(stepper( y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = TRUE )) ) }, mc.cores = 5) |> rbindlist(idcol = "sample") ) sampledt <- rbind(ni_non_dt, wi_non_dt, ni_smo_dt, wi_smo_dt, ni_hbm_dt, wi_hbm_dt) ``` Timings: ```{r timings, echo=FALSE} print("Non-matched (NON), non-identity:") print(nonid_no_match) print("Seed-Match-Only (SMO), non-identity:") print(nonid_seed_match_only) print("Hash-Based Match (HBM), non-identity:") print(nonid_hash_based_matching) print("Non-matched (NON), with identity:") print(id_no_match) print("Seed-Match-Only (SMO), with identity:") print(id_seed_match_only) print("Hash-Based Match (HBM), with identity:") print(id_hash_based_matching) ``` Now we compute the difference in series, and determine if there are any to drop: ```{r summarycalcs} sampledt[, averted := not_i - withi] sampledt[, time := seq_len(.N) - 1L, by = .(type, model, sample)] keepers <- sampledt[, .(keep = not_i[intps$vax_t + 1L] != not_i[.N]), by = .(type, model, sample)] keepers[, sum(keep) / .N, by = .(model, type)] reduced_dt <- sampledt[keepers, on = .(type, model, sample)][keep == TRUE] qtiled_dt <- reduced_dt[, { qs <- quantile(averted, probs = c(0.25, 0.5, 0.75)) .(lo = qs[1], md = qs[2], hi = qs[3]) }, by = .(type, model, time) ] ``` ... and lastly, we should visualize these ```{r spaghetti, echo=FALSE, out.width="100%"} ggplot(reduced_dt) + aes(time, averted, color = type, group = interaction(type, sample)) + facet_grid(. ~ model) + geom_line(alpha = 0.2) + scale_color_discrete(NULL, guide = guide_legend(override.aes = list(alpha = 1))) + theme_minimal() + theme( legend.position = c(1, 1) - 0.05, legend.justification = c(1, 1) ) ``` ```{r ribbons, echo=FALSE, out.width="100%"} ggplot(qtiled_dt) + aes(time) + facet_grid(. ~ model) + geom_ribbon(aes(fill = type, ymax = hi, ymin = lo), alpha = 0.25) + geom_line(aes(color = type, y = md)) + scale_color_discrete( NULL, aesthetics = c("color", "fill") ) + theme_minimal() + theme( legend.position = c(1, 1) - 0.05, legend.justification = c(1, 1) ) ``` ```{r noncrazyribbons, echo=FALSE, out.width="100%"} ggplot(qtiled_dt[type != "NON"]) + aes(time) + facet_grid(. ~ model) + geom_ribbon(aes(fill = type, ymax = hi, ymin = lo), alpha = 0.25) + geom_line(aes(color = type, y = md)) + scale_color_discrete( NULL, aesthetics = c("color", "fill") ) + theme_minimal() + theme( legend.position = c(1, 1) - 0.05, legend.justification = c(1, 1) ) ```