
Phylogenetic Missing Data Imputation
Achaz von Hardenberg
2026-04-16
Source:vignettes/04_missing_data.Rmd
04_missing_data.RmdThe because package handles missing values (NAs)
automatically using Bayesian imputation. Unlike frequentist methods that
often require deleting rows with missing values (list-wise deletion),
because treats missing values as unknown parameters to be
estimated by the model.
When installing the because.phybase extension,
because handles missing values imputation using also the
information available in the phylogenetic correlations among species
both in the response nodes (imputing both based on the predictors as
well as the phylogenetic correlations) as well as in parent-only nodes
(Automatically treating the node as a response variable to enable
phylogenetic imputation.
Setup
First, let’s load the package and the standard rhino
dataset.
library(because.phybase)
data(rhino.dat)
data(rhino.tree)Simulating Missing Data
We will artificially introduce missing values (NA) into
the dataset to demonstrate how because handles them. We’ll
add missingness to: * LS (Litter Size): A response variable
in our model. * BM (Body Mass): A parent-only variable.
# Create a copy of the data
rhino_na <- rhino.dat
# Introduce 10% missing values in Litter Size (Response)
set.seed(123)
na_indices_ls <- sample(seq_len(nrow(rhino_na)), size = 10)
rhino_na$LS[na_indices_ls] <- NA
# Introduce 5% missing values in Body Mass (Predictor)
na_indices_bm <- sample(seq_len(nrow(rhino_na)), size = 5)
rhino_na$BM[na_indices_bm] <- NA
# Verify missingness
sum(is.na(rhino_na$LS))
sum(is.na(rhino_na$BM))The Model
We define the same structural equation model (sem8) as in the
previous vignette: * Body Mass
(BM) affects Litter Size (LS) and Nose Length
(NL). * Range Size (RS) affects Nose Length
(NL). * Nose Length (NL) affects Dispersal
Distance (DD).
sem8.eq <- list(
LS ~ BM,
NL ~ BM + RS,
DD ~ NL
)Running the Model
When we run because() with missing data, it
automatically detects the NA values and adjusts the JAGS
model structure.
sem8_na.bc <- because(
equations = sem8.eq,
data = rhino_na,
structure = rhino.tree
)
summary(sem8_na.bc)
# Mean SD Naive SE Time-series SE 2.5% 50% 97.5%
# alphaBM 1.283 0.545 0.010 0.029 0.170 1.294 2.348
# alphaDD 0.574 0.280 0.005 0.007 0.030 0.566 1.133
# alphaLS 0.346 0.407 0.007 0.015 -0.496 0.362 1.113
# alphaNL -0.014 0.330 0.006 0.013 -0.659 -0.007 0.655
# beta_DD_NL 0.538 0.076 0.001 0.001 0.394 0.539 0.687
# beta_LS_BM 0.478 0.092 0.002 0.002 0.296 0.479 0.656
# beta_NL_BM 0.457 0.070 0.001 0.002 0.316 0.458 0.595
# beta_NL_RS 0.601 0.067 0.001 0.001 0.468 0.602 0.732
# lambdaBM 0.809 0.088 0.002 0.003 0.592 0.827 0.931
# lambdaDD 0.455 0.130 0.002 0.003 0.213 0.453 0.705
# lambdaLS 0.656 0.116 0.002 0.003 0.409 0.668 0.848
# lambdaNL 0.722 0.089 0.002 0.002 0.525 0.731 0.868
# sigma_BM_phylo 1.750 0.260 0.005 0.008 1.244 1.745 2.255
# sigma_BM_res 0.812 0.135 0.002 0.004 0.581 0.799 1.099
# sigma_DD_phylo 0.793 0.171 0.003 0.004 0.497 0.781 1.158
# sigma_DD_res 0.858 0.088 0.002 0.002 0.690 0.854 1.033
# sigma_LS_phylo 1.188 0.208 0.004 0.005 0.811 1.175 1.614
# sigma_LS_res 0.837 0.113 0.002 0.002 0.629 0.829 1.067
# sigma_NL_phylo 1.037 0.150 0.003 0.004 0.772 1.029 1.366
# sigma_NL_res 0.628 0.079 0.001 0.002 0.488 0.622 0.798
# Rhat n.eff
# alphaBM 1.004 356
# alphaDD 1.000 1437
# alphaLS 1.002 734
# alphaNL 1.000 623
# beta_DD_NL 1.000 2910
# beta_LS_BM 1.000 2366
# beta_NL_BM 1.000 2198
# beta_NL_RS 1.002 2819
# lambdaBM 1.000 1152
# lambdaDD 1.000 2211
# lambdaLS 1.000 1630
# lambdaNL 1.003 1421
# sigma_BM_phylo 1.000 1183
# sigma_BM_res 1.000 1493
# sigma_DD_phylo 1.000 2275
# sigma_DD_res 1.000 2853
# sigma_LS_phylo 1.000 1742
# sigma_LS_res 1.000 2156
# sigma_NL_phylo 1.002 1307
# sigma_NL_res 1.002 2183
#
# DIC:
# Mean deviance: 884.1
# penalty 189.7
# Penalized deviance: 1074What happened automatically?
You will notice two important messages if you run the above model:
“Detected missing data in predictor-only variables: BM” JAGS cannot impute missing values in a “fixed” covariate data node. To solve this,
becauseautomatically adds an intercept-only equation for the missing predictor (e.g.,BM ~ 1). This convertsBMinto a stochastic node (a response variable) so that JAGS can sample its missing values during MCMC.“Using Latent Variable (GLMM) approach…” When variables have missing data,
becausehandles them carefully to preserve phylogenetic signal. Instead of simple imputation, it uses an element-wise likelihood that integrates the phylogenetic error structure, ensuring that the relationships are estimated using the full phylogenetic information available.
Inspecting Imputed Values
Since the missing values are now parameters in the model, we can
actually inspect their estimated values (posterior distributions). To do
this, you must run because() with the argument
monitor = "all". This ensures that the imputed stochastic
nodes are saved in the output.
You can use the helper function extract_imputed() to
easily retrieve these values:
# Run strictly monitoring 'all' parameters to get imputed values
sem8_na_monitored.bc <- because(
equations = sem8.eq,
data = rhino_na,
structure = rhino.tree,
monitor = "all"
)
# Extract posterior summaries for imputed values
imputed_values <- extract_imputed(sem8_na_monitored.bc)
# View the first few imputed values
head(imputed_values)
# Variable ID RowIndex Mean SD Q2.5 Q50
# Mean LS s14 14 -0.004710686 0.9342152 -1.8430269 0.02246534
# Mean1 LS s25 25 1.678043988 1.0829263 -0.4576528 1.73566859
# Mean2 LS s31 31 0.561576970 1.1507743 -1.6971686 0.57087689
# Mean3 LS s42 42 1.162445522 1.2150872 -1.3452741 1.13780737
# Mean4 LS s43 43 2.781955491 1.0852266 0.7474752 2.79039031
# Mean5 LS s50 50 2.762965171 1.1253402 0.5564907 2.71909773
# Q97.5
# Mean 1.740653
# Mean1 3.646009
# Mean2 2.760267
# Mean3 3.589646
# Mean4 4.957594
# Mean5 4.982451This returns a tidy data frame with the Mean,
SD, and Credible Intervals (Q2.5,
Q97.5) for each missing data point, labeled by the Species
ID.
The variable LS had missing values. In the JAGS model,
the missing values for LS are treated as latent variables.
The variable BM (missing predictor) was converted to a
response variable (BM ~ 1) and also imputed.