Multivariable Fractional Polynomials with Spike at Zero

In epidemiological and clinical studies, a continuous covariate x may have a mixture of zeros and positive values. Such semi-continuous variables are said to have a “spike” at zero in an otherwise continuous distribution. Examples include alcohol or tobacco consumption, where some individuals have no exposure while others have positive values. When x = 0 defines a distinct subpopulation, the outcome for these individuals can be represented with a binary variable Z, while positive values are modeled using a continuous fractional polynomial (FP) function.

Assume that \(x \ge 0\) for all individuals. In this approach, fractional polynomial (FP) transformations are applied only when \(x > 0\), so there is no need to shift the origin of \(x\). The spike at zero is modeled using a binary indicator \(z\).

Consider a model with index \(\eta\) defined as:

\[ \eta = \begin{cases} \beta_0 + \beta, & x = 0 \\ \beta_0 + \phi_2(x; p), & x > 0 \end{cases} \]

so that \(\eta\) is, for example an FP2 function when \(x > 0\), and when \(x = 0\), it equals the intercept plus the spike effect (\(\beta_0 + \beta\)). This makes \(\eta\) a discontinuous function of \(x\) with a possible jump at \(x = 0\).

Equivalently, \(\eta\) can be expressed as:

\[ \eta = \beta_0 + \beta z + \phi_2^+(x; p), \]

where

\[ z = \begin{cases} 1, & x = 0 \\ 0, & x > 0 \end{cases} \]

and

\[ \phi_2^+(x; p) = \begin{cases} 0, & x = 0 \\ \phi_2(x; p), & x > 0 \end{cases} \]

To select a model, a two-stage procedure has been proposed and implemented in the mfp2() function as described below.

Two-Stage Model Selection Procedure

Stage 1: Functional Form Selection

The first stage of the SAZ algorithm is analogous to the FSP of MFP, except that Z is always forced into the model and FP transformations are applied only to the positive values of x. Therefore, the analyst must choose a significance level (select) if the criterion is "pvalue" for variable selection. For function selection, a significance level (alpha) must also be specified. Other options include "aic" or "bic" if an information criterion is desired. Additionally, the maximum complexity of the FP function must be specified, with FP2 as the default. In this example, FP2 is considered the maximum complexity allowed.

The model selection procedure is sequential. Starting with the most complex model (best FP2 + Z), it is compared to the null model on 5 degrees of freedom. If the test is significant, the steps of the FSP for selecting an FP function are followed, with Z always included in the model; otherwise, the null model is selected. Subsequent comparisons evaluate the best FP2 + Z model against a simpler linear + Z model on 3 degrees of freedom, and then against the best FP1 + Z model on 2 degrees of freedom. At each stage, only models that improve the fit significantly are retained, and the binary indicator Z is always included. To ensure the spike-at-zero variable remains in the model regardless of significance, select can be set to 1, or the keep argument can be used.

Stage 2: Component Assessment

Once a functional form is selected, the contribution of each component is evaluated. If the null model was chosen in Stage 1, Stage 2 is skipped. Otherwise, the selected model (e.g., best FP2 + Z) is assessed by comparing it to two simpler models: best FP2 alone, to assess the contribution of Z, and Z alone, to assess the contribution of the FP term.

Decisions are made as follows:

  • If both comparisons are significant, both Z and the FP2 term are retained in the final model.
  • If only the comparison with FP2 alone is significant, only Z is retained, as the functional form adds no additional information.
  • If only the comparison with Z alone is significant, the FP2 term is retained while Z is excluded.
  • If neither comparison is significant, the final model is chosen based on deviance, selecting the simpler model that better fits the data.

When "aic" or "bic" is used as the criterion, the information criteria for the best FP2 + Z, best FP2 alone, and Z alone models are compared, and the model with the minimum value is selected. If Z is removed, the spike at zero plays no specific role, and a standard FP function is used for the positive values of x. If the FP function is removed, leaving only Z, the effect of the covariate is entirely captured by the binary indicator. Since Z is always included in Stage 1, its presence may influence the selection of FP powers for positive values, resulting in final FP powers that differ from those obtained in a standard FP analysis where argument zero is set to TRUE. In other words, including Z can alter the shape of the fitted FP function for positive values, because some of the variation that would otherwise be modeled by the FP function is now captured by the spike indicator.

Example: Simulation of Spike-at-Zero Data

Below is a simple simulation demonstrating SAZ modeling with a small dataset.

Simulation of Spike-at-Zero (SAZ) Data

In this section, we simulate a covariate x from a Gamma distribution with a mixture of zeros (spike) and positive values. We also generate an outcome variable y that depends on both the spike indicator Z and the positive values of x. Specifically, the outcome y combines the spike effect (beta1 * Z), the effect of positive x values (beta2 * log(x)), and random noise (epsilon).

# ================================================================ SIMPLE
# SIMULATION OF SPIKE-AT-ZERO (SAZ) MODELING
# ================================================================

set.seed(123)  # For reproducibility
n <- 500  # Total sample size
prop_zero <- 0.25  # Proportion of zeros (spike)

# Initialize covariate 'x' and assign zeros
x <- numeric(n)
zero_indices <- sample(1:n, size = round(n * prop_zero))
x[zero_indices] <- 0

# Assign positive values to the remaining observations
x_pos <- rgamma(n - length(zero_indices), shape = 2, rate = 0.5)
x[-zero_indices] <- x_pos

# Create spike indicator Z
Z <- ifelse(x == 0, 1, 0)

# Outcome parameters
intercept <- 1
beta1 <- 1.5  # effect of the spike
beta2 <- 2  # effect of positive x values
sigma <- 1  # standard deviation of random noise

# Generate outcome
epsilon <- rnorm(n, mean = 0, sd = sigma)
y <- beta1 * Z + beta2 * log(ifelse(x > 0, x, 1)) + epsilon

# Combine into a data frame
df1 <- data.frame(y = y, x = x, Z = Z)

# Check the number of spike vs non-spike observations
table(df1$Z)
#> 
#>   0   1 
#> 375 125

Fractional Polynomial Analysis with SAZ

Next, we fit a fractional polynomial (FP) model using the mfp2() function, which accounts for the spike at zero while modeling the positive values of x with an FP function. Setting fp(x, spike = TRUE) ensures that FP transformations are applied only to positive values of x, while also assessing whether the spike indicator Z is needed. Specifying spike = TRUE triggers the two-stage SAZ algorithm.

The argument criterion = "pvalue" selects the FP powers using sequential testing. The maximum complexity of the FP function allowed is FP2, indicated by df = 4. The argument alpha sets the significance level for function selection, while select sets the significance level for variable selection in Stages 1 and 2 of the algorithm.

We can visualize the fitted FP function using fracplot():

  • partial_only = FALSE shows the partial predictor plus residuals, including the effect of the spike.
  • partial_only = TRUE shows only the FP function for positive x values along with the estimated contribution of the spike, \(\hat{\beta}_0 + \hat{\beta}\), from Z.
# Fit a fractional polynomial (FP) model with spike-at-zero
fit1 <- mfp2(y ~ fp(x, spike = TRUE, center = FALSE), data = df1, criterion = "pvalue",
    df = 4, select = 0.05, alpha = 0.05)
#> 
#> i Initial degrees of freedom:
#>    x
#> df 4
#> 
#> i Visiting order: x
#> 
#> ---------------------
#> i Running MFP Cycle 1
#> ---------------------
#> 
#> Variable: x (keep = FALSE, spike = TRUE)
#> Stage 1 of Spike at Zero Algorithm:
#>                  Powers    DF    Deviance    Versus           Deviance diff. P-value
#> FP2 + Binary     0, 3      7     1435.4      .                .              .      
#> null             NA        2     1942.7      FP2 + Binary     507.3          0.0000 
#> linear + Binary  1         4     1577.5      FP2 + Binary     142.1          0.0000 
#> FP1 + Binary     0         5     1435.6      FP2 + Binary     0.2            0.9042 
#> Selected: FP1 + Binary
#> Stage 2 of Spike at Zero Algorithm: 
#>              Powers    DF    Deviance    Versus           Deviance diff. P-value
#> FP1 + Binary 0         5     1435.6      .                .              .      
#> FP1          0         4     1826.5      FP1 + Binary     390.9          0.0000 
#> Binary       1         3     1917.0      FP1 + Binary     481.4          0.0000 
#> Selected: FP1 + Binary
#> 
#> ---------------------
#> i Running MFP Cycle 2
#> ---------------------
#> 
#> Variable: x (keep = FALSE, spike = TRUE)
#> Stage 1 of Spike at Zero Algorithm:
#>                  Powers    DF    Deviance    Versus           Deviance diff. P-value
#> FP2 + Binary     0, 3      7     1435.4      .                .              .      
#> null             NA        2     1942.7      FP2 + Binary     507.3          0.0000 
#> linear + Binary  1         4     1577.5      FP2 + Binary     142.1          0.0000 
#> FP1 + Binary     0         5     1435.6      FP2 + Binary     0.2            0.9042 
#> Selected: FP1 + Binary
#> Stage 2 of Spike at Zero Algorithm: 
#>              Powers    DF    Deviance    Versus           Deviance diff. P-value
#> FP1 + Binary 0         5     1435.6      .                .              .      
#> FP1          0         4     1826.5      FP1 + Binary     390.9          0.0000 
#> Binary       1         3     1917.0      FP1 + Binary     481.4          0.0000 
#> Selected: FP1 + Binary
#> 
#> i Fractional polynomial fitting algorithm converged after 2 cycles.

# Histogram of x including spike at zero
ggplot(df1, aes(x = x)) + geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
    geom_vline(xintercept = 0, linetype = "dashed", color = "red") + labs(title = "Histogram of Covariate x",
    x = "x values", y = "Frequency") + theme_minimal()


# Visualize the fitted model Full fit including spike effect
fracplot(fit1, partial_only = FALSE, color_line = "red", linewidth = 1)
#> $x


# FP effect only (positive x values)
fracplot(fit1, partial_only = TRUE, color_line = "red", linewidth = 1)
#> $x

References

  • Ambler, G. and Royston, P. (2001). Fractional polynomial model selection procedures: investigation of type I error rate. J. Stat. Comput. Simul., 69, 89–108.

  • Becher, H., Lorenz, E., Royston, P., and Sauerbrei, W. (2012). Analysing covariates with spike at zero: a modified FP procedure and conceptual issues. Biom. J., 54 (5), 686–700.

  • Binder, H. and Sauerbrei, W. (2010). Adding local components to global functions for continuous covariates in multivariable regression modeling. Stat. Med., 29, 808–817.

  • Binder, H., Sauerbrei, W., and Royston, P. (2013). Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Stat. Med., 32, 2262–2277.

  • Buchholz, A. and Sauerbrei, W. (2011). Comparison of procedures to assess non-linear and time-varying effects in multivariable models for survival data. Biom. J., 53, 308–331. DOI: 10.1002/bimj.201000159.