--- title: "Fitting distributions using primarycensored and fitdistrplus" description: "A guide on how to fit distributions using primarycensored and fitdistrplus." author: Sam Abbott output: bookdown::html_document2: fig_caption: yes code_folding: show pkgdown: as_is: true bibliography: library.bib csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl link-citations: true vignette: > %\VignetteIndexEntry{Fitting distributions using primarycensored and fitdistrplus} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction ## What are we going to do in this Vignette In this vignette, we'll demonstrate how to use `primarycensored` in conjunction with `fitdistrplus` for fitting distributions. We'll cover the following key points: 1. Simulating censored delay distribution data 2. Fitting a naive model using `fitdistrplus` 3. Evaluating the naive model's performance 4. Fitting an improved model using `primarycensored` functionality 5. Comparing the `primarycensored` model's performance to the naive model ## What might I need to know before starting This vignette assumes some familiarity with the `fitdistrplus` package. If you are not familiar with it then you might want to start with the [Introduction to `fitdistrplus`](https://cran.r-project.org/web/packages/fitdistrplus/vignettes/fitdistrplus_vignette.html) vignette. ## Packages used in this vignette Alongside the `primarycensored` package we will use the `fitdistrplus` package for fitting distributions. We will also use the `ggplot2` package for plotting and `dplyr` for data manipulation. ```{r setup, message = FALSE} library(primarycensored) library(fitdistrplus) library(ggplot2) library(dplyr) ``` # Simulating censored and truncated delay distribution data We'll start by simulating some censored and truncated delay distribution data. We'll use the `rprimarycensored` function (actually we will use the `rpcens ` alias for brevity). ```{r sample-lognormal} set.seed(123) # For reproducibility # Define the true distribution parameters n <- 1000 shape <- 1.77 # This gives a mean of 4 and sd of 3 for a gamma distribution rate <- 0.44 # Generate fixed pwindow, swindow, and obs_time pwindows <- rep(1, n) swindows <- rep(1, n) obs_times <- rep(8, n) # Truncation at 8 # Function to generate a single sample generate_sample <- function(pwindow, swindow, obs_time) { rpcens( 1, rgamma, shape = shape, rate = rate, pwindow = pwindow, swindow = swindow, D = obs_time ) } # Generate samples samples <- mapply(generate_sample, pwindows, swindows, obs_times) # Create initial data frame delay_data <- data.frame( pwindow = pwindows, swindow = swindows, obs_time = obs_times, observed_delay = samples, observed_delay_upper = samples + swindows ) head(delay_data) # Compare the samples with and without secondary censoring to the true # distribution # Calculate empirical CDF empirical_cdf <- ecdf(samples) # Create a sequence of x values for the theoretical CDF x_seq <- seq(0, 8, length.out = 100) # Calculate theoretical CDF theoretical_cdf <- pgamma(x_seq, shape = shape, rate = rate) # Create a long format data frame for plotting cdf_data <- data.frame( x = rep(x_seq, 2), probability = c(empirical_cdf(x_seq), theoretical_cdf), type = rep(c("Observed", "Theoretical"), each = length(x_seq)), stringsAsFactors = FALSE ) # Plot ggplot(cdf_data, aes(x = x, y = probability, color = type)) + geom_step(linewidth = 1) + scale_color_manual( values = c(Observed = "#4292C6", Theoretical = "#252525") ) + geom_vline( aes(xintercept = mean(samples), color = "Observed"), linetype = "dashed", linewidth = 1 ) + geom_vline( aes(xintercept = shape / rate, color = "Theoretical"), linetype = "dashed", linewidth = 1 ) + labs( title = "Comparison of Observed vs Theoretical CDF", x = "Delay", y = "Cumulative Probability", color = "CDF Type" ) + theme_minimal() + theme( panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5), legend.position = "bottom" ) + coord_cartesian(xlim = c(0, 8)) # Set x-axis limit to match truncation ``` # Fitting a naive model using `fitdistrplus` We first fit a naive model using the `fitdistcens()` function. This function is designed to handle secondary censored data but does not handle primary censoring or truncation without extension. ```{r fit-naive-model} fit <- delay_data |> dplyr::select(left = observed_delay, right = observed_delay_upper) |> fitdistcens( distr = "gamma", start = list(shape = 1, rate = 1) ) summary(fit) ``` We see that the naive model has fit poorly due to the primary censoring and right truncation in the data. # Fitting an improved model using `primarycensored` and `fitdistrplus` We'll now fit an improved model using the `primarycensored` package. To do this we need to define the custom distribution functions using the `primarycensored` package that are required by `fitdistrplus`. Rather than using `fitdistcens` we use `fitdist` because our functions are handling the censoring themselves. ```{r fit-improved-model} # Define custom distribution functions using primarycensored # The try catch is required by fitdistrplus dpcens_gamma <- function(x, shape, rate) { result <- tryCatch( { dprimarycensored( x, pgamma, shape = shape, rate = rate, pwindow = 1, swindow = 1, D = 8 ) }, error = function(e) { rep(NaN, length(x)) } ) return(result) } ppcens_gamma <- function(q, shape, rate) { result <- tryCatch( { pprimarycensored( q, pgamma, shape = shape, rate = rate, pwindow = 1, D = 8 ) }, error = function(e) { rep(NaN, length(q)) } ) return(result) } # Fit the model using fitdistcens with custom gamma distribution pcens_fit <- samples |> fitdist( distr = "pcens_gamma", start = list(shape = 1, rate = 1) ) summary(pcens_fit) ``` We see very good agreement between the true and estimated parameters. Rather than using `fitdist()` directly `primarycensored` provides a wrapper function `fitdistdoublecens()` that can be used to estimate double censored and truncated data. A bonus of this approach is we can specify our data using the `fitdistcens` `left` and `right` formulation and support mixed secondary censoring intervals. ```{r primarycensored-fitdistdoublecens} fitdistdoublecens_fit <- delay_data |> dplyr::select(left = observed_delay, right = observed_delay_upper) |> fitdistdoublecens( distr = "gamma", start = list(shape = 1, rate = 1), D = 8, pwindow = 1 ) summary(fitdistdoublecens_fit) ``` **Note** Currently this functionality is limited to a single pwindow, and observation time for all data. If this functionality is of interest then please open an issue on the `primarycensored` GitHub page.