---
title: "Hash-based Matched Pseudo-Random Number Generation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{hashprng}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

```{r setup, echo=FALSE}
library(hashprng)
# required for vignette knitting only
library(Rcpp)
# used for data transformation + plotting
library(data.table)
library(ggplot2)
library(parallel)
```

# Overview

This vignette demonstrates the practical use of `hashprng::hash_seed`; if you
want a detailed explanation of *why* to use this approach, read the paper
associated with the package citation: (TODO).

Briefly, hash-based matched pseudo-random number generation (HBM-PRNG) can ensure
that the same "events" across different simulation scenarios are resolved
consistently, while also still permitting stochastic simulation, and incurring
minimal additional computational overhead. Matched samples can then be analyzed
pair-wise, leading to higher resolution calculation of the overall impact of the
cross-scenario varying features.

However, researchers still need to specify what constitutes "the same" event,
both in terms of what features are hashed and how they implement the model
structure. In this vignette, we focus on how to use `hashprng::hash_seed`, but do
so using two different implementations of the same model world and two different
definitions of "same" for events.

# Standards of Matching

Throughout this vignette we consider a few matching regimes:

 - Non-matched (NON): this is equivalent to drawing an item from this random
 process and then one from the same-but-slightly-perturbed process and comparing
 them. Note: this kind of simulation should still be seeded, to ensure
 reproducibile PRNG.
 - Seed-Matched-Only (SMO): this scheme ensures that compared runs both have the
 same PRNG seeding. This can guarantee, for example, that the history leading up
 to an intervention is the same. However, generally once simulation trajectory
 diverges, the PRNG is exercised differently, and events that _should_ be the
 same between runs are not guaranteed to see the same PRNG draws.
 - Fully-Event-Matched (FEM): this scheme ensures that compared runs have all of
 their identical events resolved consistently. Any event which occurs in both
 simulations gets the same draw from the PRNG.

Hash-based matched (HBM) PRNG is an approach to acheiving FEM simulations. There
are alternatives, for example managing multiple PRNGs or precomputing all
possible outcomes. In general, HBM imposes a guaranteed-but-low computational cost,
scaling with the number of the PRNG draws in any given simulation, with minimal
implementation burden on the modeller.

We'll be making comparisons across NON, SMO, and HBM in terms of implementation,
compute cost, and calculation resolution.

# Model World and Scenario

We'll be considering an infectious disease system of the *S*usceptible-*I*nfectious-
*R*emoved variety and an intervention where, after some delay from initial
infection introduction, a portion of the population are spontaneously moved to *R*
(representing e.g. a reactive vaccination campaign).

# Model Implementation

We'll implement this model in terms of discrete time steps, discrete
individuals, and event probabilities. At each time step:

1. We'll consider if infectious individuals have exposing contact with
susceptible individuals - this converts *S* to *I* at the end of the step.
2. We'll consider if infectious individuals recover - *I* to *R*.
3. If it's time for the move, we'll consider which susceptible individuals to
remove - *S* to *R*.

This is essentially a perturbation to the classic Reed-Frost model, with the
modification that people may be infectious for more than a single time step (
as well as introducing an intervention).

We're also going to implement this model two ways, each with a switch
for whether or not to use hash-based matching for the random number draws.
We'll flip this switch to show matching-vs-not changes conclusions comparing
across scenarios, as well as to measure the resource cost associated with the
higher resolution measurement. We'll also create a setup where the seed is only
set once for the NON comparison.

The other element we need to finally specify the model is what constitutes an event. We're going
to offer two definitions and explore the consequences. Those two definitions
hinge on whether the model's discrete individuals have *identity* or not.

If individuals have identity, then events are distinguished by *who* is
involved. If they do not, then events are only distinguished by *what* and
*when*.

```{r definitions}
# define some variables to differentiate between individual states
# for the non-identity model, these will be compartment indices
# for the identity model, these are individual states
susceptible <- 1L
infectious <- 2L
removed <- 3L
```

## Non-Identity Model Implementation

For the non-identity model, we assume that at each time step only the *size* of
the relative populations should determine event outcomes, e.g. how many people
are infected. We can accomplish that by drawing from the binomial distribution.

For HBM, we need to match time and event. But since there are (at most) three
draws per step, always occurring in the same order, we can simply match on time
and then quantile draws. We are assuming here that `runif` always advances
underlying PRNG state by the number of deviates requested, but that `rbinom`
might advance state by a variable amount depending on both parameters and state,
hence the need to draw quantiles and then invert to outcomes.

N.B. `?rbinom` indicates draws are done per "Kachitvichyanukul, V. and Schmeiser, B. W. (1988) Binomial random variate generation. Communications of the ACM, 31, 216–222", which
describes and algorithm BPTE which does indeed draw uniform deviates a variable
number of times per binomial draw, depending on the $n$ and $p$ arguments.

```{r nonident}
#' @param t an integer scalar, the simulation time
#' @param y an integer vector, the current counts of individuals in S, I, and R
#' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`,
#'   and `vax_t`
#' @param seed optional; if non-null, using HBM
nonidentity_dStep <- function(t, y, parameters, seed = NULL) {
  with(parameters, {
    # extract the S, I population counts
    Ninf <- y[infectious]
    Nsus <- y[susceptible]

    # if we are matching events
    if (!is.null(seed)) {
      # draw the same quantiles for each when
      hash_seed(seed, t)
      evtqs <- runif(2)
      # convert these to the outcomes
      newinf <- qbinom(evtqs[1], Nsus, 1 - (1 - transmission_p)^Ninf)
      newrem <- qbinom(evtqs[2], Ninf, recovery_p)
    } else { # simply make the binomial draws
      newinf <- rbinom(1, Nsus, 1 - (1 - transmission_p)^Ninf)
      newrem <- rbinom(1, Ninf, recovery_p)
    }

    # move infectees S -> I, recoveries I -> R
    dY <- c(S = -newinf, I = newinf - newrem, R = newrem)

    if (t == vax_t) {
      if (!is.null(seed)) {
        # n.b. still on a consistent PRNG state for this time, so can just draw
        newvax <- qbinom(runif(1), Nsus - newinf, vaccination_p)
      } else {
        newvax <- rbinom(1, Nsus - newinf, vaccination_p)
      }
      # move vaccinees
      dY[susceptible] <- dY[susceptible] - newvax
      dY[removed] <- dY[removed] + newvax
    }

    # return the overall state change, as well as new incidence
    return(list(dY, newinf))
  })
}
```

## Higher Resolution Implementation

For the identity based model, we assume who *particularly* gets infected or
vaccinated matters. Thus, for HBM we need to hash both *who* and *when* for
infection and recovery. We can again take advantage of a consistent number of
draws (because we always consider exposing all individuals, susceptible or not)
to avoid reseting the PRNG for the recovery draw.

We also set the PRNG for vaccination. This probably isn't necessary: vaccination
is itself the only factor that might change trajectories, so there are no
differences in trajectory prior to considering vaccination. However, this is a
very fragile arrangement. Adding any other interventions or otherwise perturbing
model assumptions could easily change traversal of the PRNG and result in
unmatched vaccinee selection.

N.B., for both the matched and unmatched versions we're generally overdrawing
on random number deviates. While it might be possible to reduce computational
time with fewer draws, that would entail either other computational work (e.g.
calculating who is eligible for infection) or more convoluted data structures.
Those would be plausible for the unmatched version, at this model complexity,
but likely become less practical with increasing complexity.

```{r stepfunction}
#' @param t an integer scalar, the simulation time
#' @param y an integer vector, the individual states: S, I, R
#' @param parameters a list, `transmission_p`, `recovery_p`, `vaccination_p`,
#'   and `vax_t`
#' @param salt an integer scalar, optional; if present, use hash-based matching
identity_dStep <- function(t, y, parameters, salt = NULL) {
  with(parameters, {
    N <- length(y)
    dY <- integer(N)
    incI <- 0
    # consider whether each infectious individual exposes susceptible
    # individuals
    infectable <- which(y == susceptible)
    for (infector in which(y == infectious)) {
      if (length(infectable)) {
        if (!is.null(salt)) {
          # if HBM, need to reseed RNG for each pair => draw
          # easier to salt => draw all possible => slice relevant
          hash_seed(salt, "inf", infector, t)
          infectees <- infectable[
            (runif(tail(infectable, 1)) < transmission_p)[infectable]
          ]
        } else {
          infectees <- sample(
            infectable, rbinom(1, length(infectable), transmission_p)
          )
        }
        dY[infectees] <- 1
        infectable <- setdiff(infectable, infectees)
        # if this infector would recovery, indicate 2 => 3 transition
        if (runif(1) < recovery_p) dY[infector] <- 1
      }
    }

    incI <- sum((dY == 1) & (y != infectious))

    # if its time for vaccination
    if (t == vax_t) {
      if (!is.null(salt)) {
        # if HBM, need to reseed for each individual
        # include the t here - what if the model changed vax_t?
        psalt <- hash_seed(salt, "vax", t)
        vaccinees <- infectable[
          (runif(tail(infectable, 1)) < vaccination_p)[infectable]
        ]
      } else {
        vaccinees <- sample(
          infectable, rbinom(1, length(infectable), vaccination_p)
        )
      }
      # for people still susceptible, indicate 1 => 3 transition
      dY[vaccinees] <- 2
    }

    return(list(dY, incI))
  })
}
```

# Comparison

Let's run these simulation for a population of 4000 individuals, 75 time steps,
and 500 samples. We'll consider a very small amount of vaccination - only 3.3%.
We want to assess the cumulative averted _incidence_ of this scenario compared
to no vaccination.

For the "non-matching" version, we will still match on random number seeds; we
refer to that as seed-matched-only, or SMO. Because our model starts vaccination
after some delay, we should still see absolutely identical outcomes for SMO up
to that point. However, after that point, the RNG will see a different traversal.

We're going to discard outcomes were the epidemic dies out prior to vaccination.
This is a debate-worthy choice - e.g. if we want to understand the potential
benefit of purchasing some program, that should include the possibility that it
was unnecessary. For this model world, however, those sort of considerations
apply equally to either HBM or SMO, and we want to focus on the ability to
resolve intervention impact when there is an epidemic. Hence, we discard those
outcomes.

```{r parameters}
# setup parameters
samplen <- 500L
pop <- 4000L
maxt <- 75L
initI <- 5L
ps <- list(
  transmission_p = 1 - exp(-0.78 / pop),
  # ~ 0.78 infections produced per day in fully susceptible pop
  recovery_p = 1 - exp(-0.44) # ~ 2.3 days infectious
  # implies R0 ~ 1.8
)
# non-intervention, set vax_t < 0
nonps <- c(ps, list(
  vax_t = -1, vaccination_p = NA
))
# intervention, vax_t == 10, 3.3% chance to happen
intps <- c(ps, list(
  vax_t = 10, vaccination_p = 0.033
))
# create populations
yinit <- rep.int(susceptible, pop)
yinit[1:initI] <- infectious
yinit_nonid <- c(pop - initI, initI, 0L)

stepper <- function(
    maxtime = maxt, y0,
    dFUN, pars, seed, HBM) {
  # if not HBM & a seed provided => SMO => set initial seeding
  if (!missing(seed) && !HBM) set.seed(seed)

  # reserve data structure
  res <- integer(maxtime + 1)
  res[1] <- 1
  y <- y0
  # run the simulation
  for (i in seq_len(maxtime)) {
    dY <- if (HBM) dFUN(i, y, pars, seed) else dFUN(i, y, pars)
    y <- y + dY[[1]]
    res[i + 1] <- dY[[2]]
  }

  return(res)
}
```

# Sampling

Now we run all the samples, using plain sample ID as the SMO seed or HBM salt,
for all the models.

```{r runsamples}
set.seed(8675309)
nonid_no_match <- system.time(
  ni_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "NON", model = "nonID",
      not_i = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

set.seed(8675309)
id_no_match <- system.time(
  wi_non_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "NON", model = "ID",
      not_i = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = nonps, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = intps, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

nonid_seed_match_only <- system.time(
  ni_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "SMO", model = "nonID",
      not_i = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

id_seed_match_only <- system.time(
  wi_smo_dt <- seq_len(samplen) |> parallel::mclapply(function(seed) {
    data.table(
      type = "SMO", model = "ID",
      not_i = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = FALSE
      )),
      withi = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = FALSE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

nonid_hash_based_matching <- system.time(
  ni_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) {
    data.table(
      type = "HBM", model = "nonID",
      not_i = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = nonps, seed = seed, HBM = TRUE
      )),
      withi = cumsum(stepper(
        y0 = yinit_nonid, dFUN = nonidentity_dStep, pars = intps, seed = seed, HBM = TRUE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

id_hash_based_matching <- system.time(
  wi_hbm_dt <- seq_len(samplen) |> mclapply(function(seed) {
    data.table(
      type = "HBM", model = "ID",
      not_i = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = nonps, seed = seed, HBM = TRUE
      )),
      withi = cumsum(stepper(
        y0 = yinit, dFUN = identity_dStep, pars = intps, seed = seed, HBM = TRUE
      ))
    )
  }, mc.cores = 5) |> rbindlist(idcol = "sample")
)

sampledt <- rbind(ni_non_dt, wi_non_dt, ni_smo_dt, wi_smo_dt, ni_hbm_dt, wi_hbm_dt)
```
Timings:

```{r timings, echo=FALSE}
print("Non-matched (NON), non-identity:")
print(nonid_no_match)
print("Seed-Match-Only (SMO), non-identity:")
print(nonid_seed_match_only)
print("Hash-Based Match (HBM), non-identity:")
print(nonid_hash_based_matching)
print("Non-matched (NON), with identity:")
print(id_no_match)
print("Seed-Match-Only (SMO), with identity:")
print(id_seed_match_only)
print("Hash-Based Match (HBM), with identity:")
print(id_hash_based_matching)
```

Now we compute the difference in series, and determine if there are any to drop:

```{r summarycalcs}
sampledt[, averted := not_i - withi]
sampledt[, time := seq_len(.N) - 1L, by = .(type, model, sample)]
keepers <- sampledt[, .(keep = not_i[intps$vax_t + 1L] != not_i[.N]), by = .(type, model, sample)]
keepers[, sum(keep) / .N, by = .(model, type)]

reduced_dt <- sampledt[keepers, on = .(type, model, sample)][keep == TRUE]

qtiled_dt <- reduced_dt[,
  {
    qs <- quantile(averted, probs = c(0.25, 0.5, 0.75))
    .(lo = qs[1], md = qs[2], hi = qs[3])
  },
  by = .(type, model, time)
]
```

... and lastly, we should visualize these

```{r spaghetti, echo=FALSE, out.width="100%"}
ggplot(reduced_dt) +
  aes(time, averted, color = type, group = interaction(type, sample)) +
  facet_grid(. ~ model) +
  geom_line(alpha = 0.2) +
  scale_color_discrete(NULL, guide = guide_legend(override.aes = list(alpha = 1))) +
  theme_minimal() +
  theme(
    legend.position = c(1, 1) - 0.05, legend.justification = c(1, 1)
  )
```

```{r ribbons, echo=FALSE, out.width="100%"}
ggplot(qtiled_dt) +
  aes(time) +
  facet_grid(. ~ model) +
  geom_ribbon(aes(fill = type, ymax = hi, ymin = lo), alpha = 0.25) +
  geom_line(aes(color = type, y = md)) +
  scale_color_discrete(
    NULL,
    aesthetics = c("color", "fill")
  ) +
  theme_minimal() +
  theme(
    legend.position = c(1, 1) - 0.05, legend.justification = c(1, 1)
  )
```

```{r noncrazyribbons, echo=FALSE, out.width="100%"}
ggplot(qtiled_dt[type != "NON"]) +
  aes(time) +
  facet_grid(. ~ model) +
  geom_ribbon(aes(fill = type, ymax = hi, ymin = lo), alpha = 0.25) +
  geom_line(aes(color = type, y = md)) +
  scale_color_discrete(
    NULL,
    aesthetics = c("color", "fill")
  ) +
  theme_minimal() +
  theme(
    legend.position = c(1, 1) - 0.05, legend.justification = c(1, 1)
  )
```