Title: | Multivariable Fractional Polynomial Models with Extensions |
---|---|
Description: | Multivariable fractional polynomial algorithm simultaneously selects variables and functional forms in both generalized linear models and Cox proportional hazard models. Key references are Royston and Altman (1994) <doi:10.2307/2986270> and Royston and Sauerbrei (2008, ISBN:978-0-470-02842-1). In addition, it can model a sigmoid relationship between variable x and an outcome variable y using the approximate cumulative distribution transformation proposed by Royston (2014) <doi:10.1177/1536867X1401400206>. This feature distinguishes it from a standard fractional polynomial function, which lacks the ability to achieve such modeling. |
Authors: | Edwin Kipruto [aut, cre], Michael Kammer [aut], Patrick Royston [aut], Willi Sauerbrei [aut] |
Maintainer: | Edwin Kipruto <[email protected]> |
License: | GPL-3 |
Version: | 1.0.2 |
Built: | 2025-02-15 04:55:28 UTC |
Source: | https://github.com/edwinkipruto/mfp2 |
Applies the acd transformation as outlined in Royston (2014) and Royston and
Sauerbrei (2016).
Designed to work with the output of fit_acd()
, Please refer to the corresponding
documentation for more details.
apply_acd(x, beta0, beta1, power, shift, scale, ...)
apply_acd(x, beta0, beta1, power, shift, scale, ...)
x |
a numeric vector. |
beta0 , beta1
|
each a numeric value, representing the coefficients of the FP1 model for the ACD transformation. |
power |
a numeric value, estimated power to be used in the FP1 model for the ACD transformation. |
shift |
a numeric value that is used to shift the values of |
scale |
a numeric value used to scale |
... |
not used. |
The transformed input vector x
.
A function that is used to shift x values to positive values if it contains negative or zero values.If all values of x are positive then the original values of x is returned without shifting but scaled if the scaling factor is not equal to 1. If x has already been shifted and scaled then the function does nothing.
apply_shift_scale(x, scale = NULL, shift = NULL)
apply_shift_scale(x, scale = NULL, shift = NULL)
x |
A vector of predictor variable |
scale |
scaling factors for x of interest. Must be positive integers. Default is NULL and scaling factors are automatically estimated using find_scale_factor() function else it uses user supplied scaling factors. If no scaling is needed just use scale = 1 |
shift |
adjustment factors required for shifting x to positive values. Default is NULL and adjustment factors are estimated automatically using find_shift_factor() function |
A numeric value that has been shifted and scaled.
x = 1:1000 apply_shift_scale(x)
x = 1:1000 apply_shift_scale(x)
The ART data set mimics the GBSG breast cancer study in terms of the distribution of predictors and correlation structure.
data(art)
data(art)
The dataset has 250 observations and 10 covariates
Continuous response variable.
Continuous covariates.
Binary variable.
Ordinal variable with 3 levels.
Binary variable.
Nominal variable with 3 levels.
Determine the number of unique values in a variable. To be used in mfp2()
.
assign_df(x, df_default = 4)
assign_df(x, df_default = 4)
x |
input matrix. |
df_default |
default df to be used. Default is 4. |
Variables with fewer than or equal to three unique values, for example, will be assigned df = 1. df = 2 will be assigned to variables with 4-5 unique values, and df = 4 will be assigned to variables with unique values greater than or equal to 6.
Vector of length ncol(x)
with degrees of freedom for each variable in x
.
x <- matrix(1:100, nrow = 10) assign_df(x)
x <- matrix(1:100, nrow = 10) assign_df(x)
To be used in fit_mfp()
.
calculate_df(p)
calculate_df(p)
p |
power of a variable. |
An example calculation: if p is the power(s) and p = c(1,2), then df = 4 but if p = NA then df = 0.
returns numeric value denoting the number of degrees of freedom (df).
Alternative to likelihood ratio tests in normal / Gaussian error models.
calculate_f_test(deviances, dfs_resid, n_obs, d1 = NULL)
calculate_f_test(deviances, dfs_resid, n_obs, d1 = NULL)
deviances |
a numeric vector of length 2 with deviances. Typically
ordered in increasing order (i.e. null model first, then full model) and
used to test the difference |
dfs_resid |
a numeric vector with residual degrees of freedom. |
n_obs |
a numeric value with the number of observations. |
d1 |
a numeric value giving |
Uses formula on page 23 from here: https://www.stata.com/manuals/rfp.pdf:
where refers to deviances of two models 1 and 2.
is the number of additional parameters used in in model 2 as
compared to model 1, i.e.
dfs_resid[1] - dfs_resid[2]
.
is the number of residual degrees of freedom minus the number of
estimated powers for model 2, i.e.
dfs_resid[2]
.
#' The p-value then results from the use of a F-distribution with
(d1, d2) degrees of freedom.
Note that this computation is completely equivalent to the computation of a F-test using sum of squared errors as in e.g. Kutner at al. (2004), p 263. The formula there is given as
where the terms refer to residual degrees of freedom, and
and
to the reduced (model 1) and full model (model 2), respectively.
A list with three entries giving the test statistic and p-value for the F-test
for the comparison of deviance[1]
to deviance[2]
.
statistic
: test statistic.
pvalue
: p-value.
dev_diff
: difference in deviances tested.
Kutner, M.H., et al., 2004. Applied linear statistical models. McGraw-Hill Irwin.
Function to calculate p-values for likelihood-ratio test
calculate_lr_test(logl, dfs)
calculate_lr_test(logl, dfs)
logl |
a numeric vector of length 2 with log-likelihoods. Typically
ordered in increasing order (i.e. null model first, then full model) and
used to test the ratio |
dfs |
a numeric vector with degrees of freedom. |
Uses Wilk's theorem that -2log(LR) (LR = likelihood ratio) asymptotically approaches a Chi-square distribution under the null hypothesis that both likelihoods are equal.
Model likelihoods can then be compared by computing D = -2 log(likelihood reduced model / likelihood full model), and then use a Chi-square distribution with df_full - df_reduced degrees of freedom to derive a p-value.
This is basically the same way as stats::anova()
implements the
likelihood ratio test.
A list with two entries for the likelihood ratio test for the ratio
logl[1] / logl[2]
.
statistic
: test statistic.
pvalue
: p-value
mfp2
Mostly used within an mfp step to compare between the different fp models of a variable.
calculate_model_metrics(obj, n_obs, df_additional = 0)
calculate_model_metrics(obj, n_obs, df_additional = 0)
obj |
a list returned by |
n_obs |
a numeric value indicating the number of observations for the
data used to fit |
df_additional |
a numeric value indicating the number of additional degrees of freedom to be accounted for in the computations of AIC and BIC. These may be necessary when a model uses FP terms, as these add another degree of freedom per estimated power. |
A numeric vector with the following entries:
df
: number of degrees of freedom of model (i.e. coefficients plus
df_additional
).
deviance_rs
: "deviance", i.e. minus twice the log likelihood.
This is not the usual definition of deviance used by R, which is defined as
twice the difference between the log likelihoods of the saturated model
(one parameter per observation) and the null (or reduced) model.
It is, however, the definition used in Royston and Sauerbrei (2008) and in
mfp
. For selection of fps this does not really play a role, as the common
factor would be cancelled anyway when comparing models based on deviances.
sse
: sum of squared residuals as returned by fit_model()
.
deviance_gaussian
: deviance computed by deviance_gaussian()
,
applicable to Gaussian models and used for F-test computations.
aic
: Akaike information criterion, defined as
-2logL + 2(df + df_additional)
.
bic
: Bayesian information criterion, defined as
-2logL + log(n_obs)(df + df_additional)
.
df_resid
: residual degrees of freedom, defined as n_obs - df
.
For consistency with stata we subtract the scale parameter from df
.
Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building:
A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials
for Modelling Continuous Variables. John Wiley & Sons.
This function takes a list x
containing fractional polynomial powers for all variables
and calculates the total number of powers across the variables.
calculate_number_fp_powers(x)
calculate_number_fp_powers(x)
x |
A list of fractional polynomial powers for all variables. |
Numeric value denoting total number of fractional polynomial powers in the adjustment variables.
To be used in predict.mfp2()
.
calculate_standard_error(model, X, xref = NULL)
calculate_standard_error(model, X, xref = NULL)
model |
fitted |
X |
transformed input matrix with variables of interest for partial predictor. |
xref |
transformed reference value for variable of interest. Default is
|
See pages 91-92 and following in the book by Royston and Sauerbrei 2008 for the formulas and mathematical details.
Standard error.
Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building:
A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials
for Modelling Continuous Variables. John Wiley & Sons.
Simple function to center data
center_matrix(mat, centers = NULL)
center_matrix(mat, centers = NULL)
mat |
a transformed data matrix. |
centers |
a vector of centering values. Length must be equal to the
number of columns in |
Centering is done by means for continuous variables (i.e. more than 2 distinct values), and the minimum for binary variables.
It is assumed all categorical variables in the data are represented by binary dummy variables.
Transformed data matrix. Has an attribute scaled:center
that stores
values used for centering.
mat = matrix(1:100, nrow = 10) center_matrix(mat)
mat = matrix(1:100, nrow = 10) center_matrix(mat)
mfp2
This function is a method for the generic stats::coef()
function for
objects of class mfp2
.
## S3 method for class 'mfp2' coef(object, ...)
## S3 method for class 'mfp2' coef(object, ...)
object |
an object of class |
... |
not used. |
Named numeric vector of coefficients extracted from the model object
.
To be used in fit_mfp()
.
convert_powers_list_to_matrix(power_list)
convert_powers_list_to_matrix(power_list)
power_list |
list of powers created in |
a matrix.
Simple function to create dummy variables for ordinal and nominal variables
create_dummy_variables( data, var_ordinal = NULL, var_nominal = NULL, drop_variables = FALSE )
create_dummy_variables( data, var_ordinal = NULL, var_nominal = NULL, drop_variables = FALSE )
data |
A dataframe containing the ordinal variable. |
var_ordinal |
Names of ordinal variables in the data for which dummy variables should be created. |
var_nominal |
Names of nominal variables in the data for which dummy variables should be created. |
drop_variables |
Specifies whether to drop the original variables after dummy variables have been created. The default value is FALSE, and the original variables are kept in the data. |
This function creates dummy variables based on ordinal and categorical coding described in the Royston and Sauerbrei (2008) book (Chapter 3, Table 3.5). It uses the levels of the categorical variable if they exist; otherwise, it will extract the unique values of the variable, sort them, and use them as levels. We recommend that the user sets the levels of categorical variables and specifies their reference group. You can use the factor() function in base R. If the levels are 1, 2, and 3, then 1 will be the reference group. On the other hand, if the levels are 3, 2, and 1, then 3 will be the reference group. In brief, the first level will be taken as the reference group.
A dataframe with new dummy variables.
data("gbsg") # create dummy variable for grade using ordinal coding gbsg <- create_dummy_variables(gbsg, var_ordinal = "grade", drop_variables = TRUE) head(gbsg)
data("gbsg") # create dummy variable for grade using ordinal coding gbsg <- create_dummy_variables(gbsg, var_ordinal = "grade", drop_variables = TRUE) head(gbsg)
To be used in fit_mfp()
.
create_fp_terms(fp_powers, acdx, df, select, alpha, criterion)
create_fp_terms(fp_powers, acdx, df, select, alpha, criterion)
fp_powers |
powers of the created FP terms. |
acdx |
a logical vector of length nvars indicating which continuous variables should undergo the approximate cumulative distribution (ACD) transformation. |
df |
a numeric vector of length nvars of degrees of freedom. |
select |
a numeric vector of length nvars indicating significance levels for backward elimination. |
alpha |
a numeric vector of length nvars indicating significance levels for tests between FP models of different degrees. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
Dataframe with overview of all fp terms. Each row represents a variable,
with rownames giving the name of the variable. Variables with acd
transformation are prefixed by A_
by the print
and summary
methods.
The dataframe comprises the following columns:
df_initial
: initial degrees of freedom.
select
: significance level for backward elimination.
alpha
: significance level for fractional polyomial terms.
selected
: logical value encoding presence in the model.
df_final
: final estimated degrees of freedom.
acd
: logical value encoding use of ACD transformation.
powerN
: one or more columns with the final estimated fp powers (numbered
1 to N).
Deviance computations as used in mfp in stata
deviance_gaussian(rss, weights, n)
deviance_gaussian(rss, weights, n)
rss |
residual sum of squares. |
weights |
numeric vector of weights used in computation of |
n |
number of observations used to compute |
Note that this is not the usual formula of deviance used in R, but uses the formula found here https://www.stata.com/manuals/rfp.pdf.
It can be applied for normal error models, but should not be used for other kinds of glms.
A numeric value representing the deviance of a Gaussian model.
Used to make sure dimensions of matrix rows match.
ensure_length(x, size, fill = NA)
ensure_length(x, size, fill = NA)
x |
input vector or matrix. |
size |
length or size of |
fill |
value to fill in if |
This function estimates the best FP functions for all predictors in the
current cycle. To be used in fit_mfp()
.
find_best_fp_cycle( x, y, powers_current, df, weights, offset, family, criterion, select, alpha, keep, powers, method, strata, verbose, ftest, control, rownames, nocenter, acdx )
find_best_fp_cycle( x, y, powers_current, df, weights, offset, family, criterion, select, alpha, keep, powers, method, strata, verbose, ftest, control, rownames, nocenter, acdx )
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
df |
a numeric vector of length nvars of degrees of freedom. |
weights |
a vector of observation weights of length nobs. |
offset |
a vector of length nobs of offsets. |
family |
a character string representing a family object. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
select |
a numeric vector of length nvars indicating significance levels for backward elimination. |
alpha |
a numeric vector of length nvars indicating significance levels for tests between FP models of different degrees. |
keep |
a character vector with names of variables to be kept in the model. |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
method |
a character string specifying the method for tie handling in Cox regression model. |
strata |
a factor of all possible combinations of stratification
variables. Returned from |
verbose |
a logical; run in verbose mode. |
ftest |
a logical indicating the use of the F-test for Gaussian models. |
control |
a list with parameters for model fit. See |
rownames |
passed to |
nocenter |
a numeric vector with a list of values for fitting Cox
models. See |
acdx |
a logical vector of length nvars indicating which continuous variables should undergo the approximate cumulative distribution (ACD) transformation. |
A cycle is defined as a complete pass through all the predictors in the input
matrix x
, while a step is defined as the assessment of a single predictor.
This algorithm is described in Sauerbrei et al. (2006) and given in detail
in Royston and Sauerbrei (2008), in particular chapter 6.
Briefly, a cycle works as follows: it takes as input the data matrix along with
a set of current best fp powers for each variable. In each step, the fp
powers of a single covariate are assessed, while adjusting for other
covariates. Adjustment variables are transformed using their current
fp powers (this is done in transform_data_step()
) and the fp powers
of the variable of interest are tested using the closed test procedure
(conducted in find_best_fp_step()
).
Some of the adjustment variables may have their fp power set to NA
,
which means they were not selected from the working model and are not used
in that step. The results from all steps are returned, completing a cycle.
Note that in each cycle every variable is evaluated.This includes variables that may have been eliminated in previous cycles. They will re-enter each new cycle for potential inclusion in the working model or to be re-evaluated for elimination.
The current adjustment set is always given through the current fp powers,
which are updated in each step (denoted as powers_current
).
current FP powers
Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building:
A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials
for Modelling Continuous Variables. John Wiley & Sons.
Sauerbrei, W., Meier-Hirmer, C., Benner, A. and Royston, P., 2006.
Multivariable regression model building by using fractional
polynomials: Description of SAS, STATA and R programs.
Comput Stat Data Anal, 50(12): 3464-85.
Sauerbrei, W. and Royston, P., 1999. Building multivariable prognostic
and diagnostic models: transformation of the predictors by using fractional
polynomials. J Roy Stat Soc a Sta, 162:71-94.
See mfp2()
for a brief summary on the notation used here and
fit_mfp()
for an overview of the fitting procedure.
find_best_fp_step( x, y, xi, weights, offset, df, powers_current, family, criterion, select, alpha, keep, powers, method, strata, nocenter, acdx, ftest, control, rownames, verbose )
find_best_fp_step( x, y, xi, weights, offset, df, powers_current, family, criterion, select, alpha, keep, powers, method, strata, nocenter, acdx, ftest, control, rownames, verbose )
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
y |
a vector for the response variable or a |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
weights |
a vector of observation weights of length nobs. |
offset |
a vector of length nobs of offsets. |
df |
a numeric vector indicating the maximum degrees of freedom for the
variable of interest |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
family |
a character string representing a family object. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
select |
a numeric value indicating the significance level
for backward elimination of |
alpha |
a numeric value indicating the significance level
for tests between FP models of different degrees for |
keep |
a character vector with names of variables to be kept in the model. |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
method |
a character string specifying the method for tie handling in Cox regression. |
strata |
a factor of all possible combinations of stratification
variables. Returned from |
nocenter |
a numeric vector with a list of values for fitting Cox
models. See |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
ftest |
a logical indicating the use of the F-test for Gaussian models. |
control |
a list with parameters for model fit. |
rownames |
a parameter for Cox models. |
verbose |
a logical; run in verbose mode. |
The function selection procedure (FSP) is used if the p-value criterion is chosen, whereas the criteria AIC and BIC select the model with the smallest AIC and BIC, respectively.
It uses transformations for all other variables to assess the FP form of the current variable of interest. This function covers three main use cases:
the linear case (df = 1
) to test between null and linear models (see
select_linear()
). This step differs from the mfp case because
linear models only use 1 df, while estimation of (every) fp power adds
another df. This is also the case applied for categorical variables for
which df
are set to 1.
the case that an acd transformation is requested (acdx
is TRUE
for xi
) for the variable of interest (see find_best_fpm_step()
).
the (usual) case of the normal mfp algorithm to assess non-linear
functional forms (see find_best_fpm_step()
).
Note that these cases do not encompass the setting that a variable is not
selected, because the evaluation is done for each variable in each cycle.
A variable which was de-selected in earlier cycles may be added to the
working model again. Also see find_best_fp_cycle()
.
The adjustment in each step uses the current fp powers given in
powers_current
for all other variables to determine the adjustment set
and transformations in the working model.
Note that the algorithm starts by setting all df = 1
, and higher fps
are evaluated in turn starting from the first step in the first cycle.
A numeric vector indicating the best powers for xi
. Entries can be
NA
if variable is to be removed from the working model. Note that this
vector may include up to two NA
entries when ACD transformation is
requested, but otherwise is either a vector with all numeric entries, or a
single NA
.
There are 3 criteria to decide for the current best functional form of a continuous variable.
The first option for criterion = "pvalue"
is the function selection
procedure as outlined in e.g. Chapters 4 and 6 of Royston and
Sauerbrei (2008), also abbreviated as "RA2".
It is a closed testing procedure and is implemented in select_ra2()
and
extended for ACD transformation in select_ra2_acd()
according to
Royston and Sauerbrei (2016).
For the other criteria aic
and bic
all FP models up to the desired degree
are fitted and the model with the lowest value for the information criteria
is chosen as the final one. This is implemented in select_ic()
.
Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building:
A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials
for Modelling Continuous Variables. John Wiley & Sons.
Royston, P. and Sauerbrei, W., 2016. mfpa: Extension of mfp using the ACD covariate transformation for enhanced parametric multivariable modeling. The Stata Journal, 16(1), pp.72-87.
To be used in fit_acd()
.
find_best_fp1_for_acd(x, y, powers)
find_best_fp1_for_acd(x, y, powers)
x |
a numeric vector. |
y |
normal cdf of rank transform of |
powers |
a vector of allowed FP powers. The default value is |
The best FP power with smallest deviance and the fitted model.
Handles the FP1 and the higher order FP cases. For parameter definitions, see
find_best_fp_step()
.
find_best_fpm_step(x, xi, degree, y, powers_current, powers, acdx, ...)
find_best_fpm_step(x, xi, degree, y, powers_current, powers, acdx, ...)
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
degree |
degrees of freedom for fp transformation of |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
... |
passed to |
The "best" model is determined by the highest likelihood (or smallest deviance by our definition as minus twice the log-likelihood). This is also the case for the use of information criteria, as all models investigated in this function have the same df, so the penalization term is equal for all models and only their likelihoods differ.
Note that the estimation of each fp power adds a degree of freedom. Thus, all fp1s have 2 df, all fp2s have 4 df and so on.
In the case that degree = 1
, the linear model (fp power of 1) is NOT
returned, as it is not considered to be a fractional polynomial in this
algorithm.
A linear model has only one df, whereas the same function regarded as fp
would have 2 fp.
A list with several components:
acd
: logical indicating if an ACD transformation was applied for xi
.
powers
: fp powers investigated in step.
power_best
: the best power found. power_best
will always be a
two-column matrix when an ACD transformation is used, otherwise the number
of columns will depend on degree
.
metrics
: a matrix with performance indices for all models investigated.
Same number of rows as, and indexed by, powers
.
model_best
: row index of best model in metrics
.
This function also handles the case of ACD transformations if acdx
is set
to TRUE
for xi
. In this case, if degree = 1
, then 7 models are
assessed (like for the non-acd case it excludes the linear case),
and if degree = 2
, then 64 models are assessed (unlike the 36 models
for non-acd transformation). Other settings for degree
are currently not
supported when used with ACD transformations.
Function that calculates an integer used to scale predictor
find_scale_factor(x)
find_scale_factor(x)
x |
a numeric vector already shifted to positive values (see
#' @examples x = 1:1000 find_scale_factor(x) |
For details on why scaling is useful see the corresponding section in the
documentation of mfp2()
.
The determination of the scaling factor is independent (i.e. not affected by) shifts in the input data, as it only depends on the range of the input data.
Note that the estimation of powers is unaffected by scaling, the same powers are found for scaled input data. In extreme cases scaling is necessary to preserve accuracy, see Royston and Sauerbrei (2008). This formula uses the scaling formula from Section 4.11.1 of Royston and Sauerbrei (2008). Further information can also be found in the Stata manual for mfp at https://www.stata.com/manuals/rfp.pdf.
An integer that can be used to scale x
to a reasonable range. For binary
variables 1 is returned.
Royston, P., and Sauerbrei, W., 2008. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons.
Function that calculates a value used to shift predictor
find_shift_factor(x)
find_shift_factor(x)
x |
a numeric vector. |
For details on why shifting is necessary see the corresponding section in the
documentation of mfp2()
.
This function implements the formula in Section 4.7 of Royston and Sauerbrei (2008).
A numeric value that can be used to shift x
to positive values.
If all values are positive, or if x
is binary then 0 is returned.
Royston, P., and Sauerbrei, W., 2008. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons.
x = 1:1000 find_shift_factor(x)
x = 1:1000 find_shift_factor(x)
Fits ACD transformation as outlined in Royston (2014). The ACD transformation smoothly maps the observed distribution of a continuous covariate x onto one scale, namely, that of an approximate uniform distribution on the interval (0, 1).
fit_acd(x, powers = NULL, shift = 0, scale = 1)
fit_acd(x, powers = NULL, shift = 0, scale = 1)
x |
a numeric vector. |
powers |
a vector of allowed FP powers. The default value is |
shift |
a numeric that is used to shift the values of |
scale |
a numeric used to scale |
Briefly, the estimation works as follows. First, the input data are shifted to positive values and scaled as requested. Then
is computed, where is the number of elements in
x
,
with ties in the ranks handled as averages. To approximate ,
an FP1 model (least squares) is used, i.e.
, where
is chosen such that it
provides the best fitting model among all possible FP1 models.
The ACD transformation is then given as
where the fitted values of the estimated model are used.
If the relationship between a response Y and acd(x) is linear,
say, , the relationship between Y
and x is nonlinear and is typically sigmoid in shape.
The parameters
and
in such a model are
interpreted as the expected values of Y at the minimum and maximum of x,
that is, at acd(x) = 0 and 1, respectively.
The parameter
represents the range of predictions of
across the whole observed distribution of x (Royston 2014).
A list is returned with components
acd
: the acd transformed input data.
beta0
: intercept of estimated model.
beta1
: coefficient of estimated model.
power
: estimated power.
shift
: shift value used for computations.
scale
: scaling factor used for computations.
Royston, P. and Sauerbrei, W. (2016). mfpa: Extension of mfp using the ACD covariate transformation for enhanced parametric multivariable modeling. The Stata Journal, 16(1), pp.72-87.
Royston, P. (2014). A smooth covariate rank transformation for use in regression models with a sigmoid dose–response function. The Stata Journal, 14(2), 329-341.
set.seed(42) x = apply_shift_scale(rnorm(100)) y = rnorm(100) fit_acd(x, y)
set.seed(42) x = apply_shift_scale(rnorm(100)) y = rnorm(100) fit_acd(x, y)
Function that fits Cox proportional hazards models
fit_cox( x, y, strata, weights, offset, control, method, rownames, nocenter, fast = TRUE )
fit_cox( x, y, strata, weights, offset, control, method, rownames, nocenter, fast = TRUE )
x |
a matrix of predictors excluding intercept with nobs observations. |
y |
a |
strata , control , rownames , nocenter
|
passed to |
weights |
a numeric vector of length nobs of 'prior weights' to be used in the fitting process. |
offset |
a numeric vector of length nobs of of a priori known component to be included in the linear predictor during fitting. |
method |
a character string specifying the method for tie handling.
See |
fast |
a logical which determines how the model is fitted. The default
|
A list with the following components:
logl
: the log likelihood of the fitted model.
coefficients
: regression coefficients.
df
: number of parameters (degrees of freedom).
sse
: residual sum of squares (not used).
fit
: the fitted model object.
Function that fits generalized linear models
fit_glm(x, y, family, weights, offset, fast = TRUE)
fit_glm(x, y, family, weights, offset, fast = TRUE)
x |
a matrix of predictors with nobs observations. |
y |
a vector for the outcome variable. |
family |
a family function e.g. |
weights |
a numeric vector of length nobs of 'prior weights' to be used
in the fitting process. see |
offset |
a numeric vector of length nobs of of a priori known component to be included in the linear predictor during fitting. |
fast |
a logical which determines how the model is fitted. The default
|
A list with the following components:
logl
: the log likelihood of the fitted model.
coefficients
: regression coefficients.
df
: number of parameters (degrees of freedom).
sse
: residual sum of squares.
fit
: the fitted model object.
"Linear" model here refers to a model which includes the variable
of interest xi
with a fp power of 1. Note that xi
may be ACD transformed
if indicated by acdx[xi]
.
For parameter definitions, see find_best_fp_step()
. All parameters
captured by ...
are passed on to fit_model()
.
fit_linear_step(x, xi, y, powers_current, powers, acdx, ...)
fit_linear_step(x, xi, y, powers_current, powers, acdx, ...)
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
... |
passed to |
A list with two entries:
powers
: fp power(s) of xi
(or its ACD transformation) in fitted model.
metrics
: a matrix with performance indices for fitted model.
This function is not exported and is intended to be called from
the mfp2()
function. While most parameters are explained in
the documentation of mfp2()
, their form may differ in this
function. Note that this function does not check its arguments
and expects that its input has been prepared in mfp2()
function.
fit_mfp( x, y, weights, offset, cycles, scale, shift, df, center, family, criterion, select, alpha, keep, xorder, powers, method, strata, nocenter, acdx, ftest, control, verbose )
fit_mfp( x, y, weights, offset, cycles, scale, shift, df, center, family, criterion, select, alpha, keep, xorder, powers, method, strata, nocenter, acdx, ftest, control, verbose )
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
y |
a vector for the response variable or a |
weights |
a vector of observation weights of length nobs. |
offset |
a vector of length nobs of offsets. |
cycles |
an integer representing the maximum number of iteration cycles during which FP powers for all predictor are updated. |
scale |
a numeric vector of length nvars of scaling factors. Not applied,
but re-ordered to conform to |
shift |
a numeric vector of length nvars of shifts. Not applied,
but re-ordered to conform to |
df |
a numeric vector of length nvars of degrees of freedom. |
center |
a logical vector of length nvars indicating if variables are to be centered. |
family |
a character string representing a family object. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
select |
a numeric vector of length nvars indicating significance levels for backward elimination. |
alpha |
a numeric vector of length nvars indicating significance levels for tests between FP models of different degrees. |
keep |
a character vector with names of variables to be kept in the model. |
xorder |
a string determining the order of entry of the covariates into the model-selection algorithm. |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
method |
a character string specifying the method for tie handling in Cox regression model. |
strata |
a factor of all possible combinations of stratification
variables. Returned from |
nocenter |
a numeric vector with a list of values for fitting Cox
models. See |
acdx |
a logical vector of length nvars indicating which continuous variables should undergo the approximate cumulative distribution (ACD) transformation. |
ftest |
a logical indicating the use of the F-test for Gaussian models. |
control |
a list with parameters for model fit. See |
verbose |
a logical; run in verbose mode. |
See mfp2()
for details on the returned object.
Step 1: order variables according to xorder
. This step may involve
fitting a regression model to determine order of significance.
Step 2: input data pre-processing. Setting initial powers for fractional polynomial terms, checking if acd transformation is required and allowed. Note that the initial powers of all variables are always set to 1, and higher FPs are only evaluated in turn for each variables in the first cycle of the algorithm. See e.g. Sauerbrei and Royston (1999).
Step 3: run mfp algorithm cycles. See find_best_fp_cycle()
for more
details.
Step 4: fit final model using estimated powers.
Sauerbrei, W. and Royston, P., 1999. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J Roy Stat Soc a Sta, 162:71-94.
mfp2
Fits generalized linear models and Cox proportional hazard models.
fit_model( x, y, family = "gaussian", weights = NULL, offset = NULL, method = NULL, strata = NULL, control = NULL, rownames = NULL, nocenter = NULL, fast = TRUE )
fit_model( x, y, family = "gaussian", weights = NULL, offset = NULL, method = NULL, strata = NULL, control = NULL, rownames = NULL, nocenter = NULL, fast = TRUE )
x |
a matrix of predictors (excluding intercept) with column names.
If column names are not provided they are set according to
|
y |
a vector for the outcome variable for glms, and a |
family |
a character strong specifying glm family to be used, or "cox" for Cox models. The default family is set to 'Gaussian'. |
method |
a character string specifying the method for tie handling.
See |
strata , control , weights , offset , rownames , nocenter
|
parameters for Cox
or glm. See |
fast |
passed to @return A list with the following components:
|
Computations rely on fit_glm()
and fit_cox()
.
"Null" model here refers to a model which does not include the variable
of interest xi
.
For parameter definitions, see find_best_fp_step()
. All parameters
captured by ...
are passed on to fit_model()
.
fit_null_step(x, xi, y, powers_current, powers, acdx, ...)
fit_null_step(x, xi, y, powers_current, powers, acdx, ...)
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
... |
passed to |
A list with two entries:
powers
: fp power(s) of xi
in fitted model - in this case NA
.
metrics
: a matrix with performance indices for fitted model.
Used in formula interface to mfp2()
.
fp( x, df = 4, alpha = 0.05, select = 0.05, shift = NULL, scale = NULL, center = TRUE, acdx = FALSE, powers = NULL ) fp2(...)
fp( x, df = 4, alpha = 0.05, select = 0.05, shift = NULL, scale = NULL, center = TRUE, acdx = FALSE, powers = NULL ) fp2(...)
x |
a vector representing a continuous variable undergoing fp-transformation. |
df , alpha , select , shift , scale , center , acdx
|
See |
powers |
a vector of powers to be evaluated for |
... |
used in alias |
The vector x
with new attributes relevant for fp-transformation. All
arguments passed to this function will be stored as attributes.
fp2()
: Alias for fp()
- use in formula when both mfp
and mfp2
are loaded to avoid name shadowing.
xr = 1:10 fp(xr) fp2(xr)
xr = 1:10 fp(xr) fp2(xr)
mfp2
objectPlots the partial linear predictors with confidence limits against the selected covariate(s) of interest.
fracplot( model, terms = NULL, partial_only = FALSE, type = c("terms", "contrasts"), ref = NULL, terms_seq = c("data", "equidistant"), alpha = 0.05, color_points = "#AAAAAA", color_line = "#000000", color_fill = "#000000", shape = 1, size_points = 1, linetype = "solid", linewidth = 1, alpha_fill = 0.1 ) plot_mfp(...)
fracplot( model, terms = NULL, partial_only = FALSE, type = c("terms", "contrasts"), ref = NULL, terms_seq = c("data", "equidistant"), alpha = 0.05, color_points = "#AAAAAA", color_line = "#000000", color_fill = "#000000", shape = 1, size_points = 1, linetype = "solid", linewidth = 1, alpha_fill = 0.1 ) plot_mfp(...)
model |
fitted |
terms |
character vector with variable names to be plotted. |
partial_only |
a logical value indicating whether only the partial
predictor (component) is drawn ( |
type , ref , terms_seq
|
arguments of |
alpha |
|
color_line , linetype , linewidth
|
|
color_fill , alpha_fill
|
|
shape , size_points , color_points
|
|
... |
used in alias |
The confidence limits of the partial linear predictors or contrasts are obtained from the variance–covariance matrix of the final fitted model, which takes into account the uncertainty in estimating the model parameters but not the FP powers. This can lead to narrow confidence intervals. A simple way to obtain more realistic confidence intervals within the FP is by using bootstrap, which is not currently implemented. See Royston and Sauerbrei (2008) chapter 4.9.2 for guidance on conducting bootstrapping within the FP class.
The component-plus-residual, is the partial linear predictor plus residuals,
where deviance residuals are used in generalized linear regression models,
while martingale residuals are used in Cox models, as done in Stata mfp program.
This kind of plot is only available if type = "terms"
.
A list of ggplot2
plot objects, one for each term requested. Can be
drawn as individual plots or facetted / combined easily using e.g.
patchwork::wrap_plots
and further customized.
plot_mfp()
: Alias for fracplot.
# Gaussian response data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit = mfp2(x, y, verbose = FALSE) fracplot(fit) # generate plots
# Gaussian response data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit = mfp2(x, y, verbose = FALSE) fracplot(fit) # generate plots
Breast cancer dataset used in the Royston and Sauerbrei (2008) book.
data(gbsg)
data(gbsg)
A dataset with 686 observations and 11 variables.
Patient identifier.
Age in years.
Menopausal status (0 = premeno, 1 = postmeno).
Tumor size (mm).
Tumor grade.
Number of positive lymph nodes.
exp(-0.12*nodes).
Progesterone receptor status.
Estrogen receptor status.
Tamoxifen treatment.
Time (days) to death or cancer recurrence.
Censoring (0 = censored, 1 = event).
This very simple helper generates combinations with replacement.
generate_combinations_with_replacement(x, k)
generate_combinations_with_replacement(x, k)
x |
vector of elements to choose from. |
k |
number of elements to choose. |
This is replicating the functionality from arrangements::combinations
with
replace = TRUE
. Note that base R function utils::combn
only returns
combinations without replacement, thus pairs like (0, 0) are not in the
output.
Note that this function is extremely inefficient and only intended to be used with small use cases, i.e. small k. This is typically the case in the context of MFP, but a warning is given if this is not the case since the algorithm may take a while to compute the combinations, and even longer to do model selection.
A m x k matrix, where m is the number of combinations.
Function that generates a matrix of FP powers for any degree
generate_powers_fp(degree = NULL, powers = NULL) generate_powers_acd(degree = NULL, powers = NULL)
generate_powers_fp(degree = NULL, powers = NULL) generate_powers_acd(degree = NULL, powers = NULL)
degree |
The degree of fractional polynomial. For example, degree = 1 is FP1 and returns 8 powers; degree 2 is FP2 and returns 36 pairs of powers; degree 3 is FP3 and returns 120 triples of powers, and so on. If the ACD transformation is used, this degree is assumed to be 2. |
powers |
the set of allowed powers for the fractional polynomials.
Default is |
For FP powers, this function returns all combinations of the powers of
length degree
, that is all pairs in which each entry is taken from the
set powers
, but no pair is repeated (i.e. the order of the entries does
not matter).
Thus, for the default set of powers and degree 2, this function returns
36 combinations.
For ACD powers, this function simply returns all possible tuples of
powers of length n.
Thus, for the default set of powers, this function returns 8 possible
powers, and for degree 2 it returns 64 pairs of powers. Higher degrees
are not supported by the function. In case that degree = 0
or degree = 1
,
the first column of the matrix representing untransformed data are set to
NA
to indicate that the normal data do not play a role. Higher degrees
than two are not supported.
A matrix of powers with degree columns and rows depending on the degree
.
For ACD powers always a matrix with two columns. For normal fps each row
will be sorted in increasing order (in alignment with
how transform_vector_fp()
processes the data).
generate_powers_acd()
: Function to generate acd powers.
powx <- c(-2, -1, -0.5, 0, 0.5, 1, 2, 3) generate_powers_fp(degree = 2, powers = powx) generate_powers_acd(degree = 2, powers = powx)
powx <- c(-2, -1, -0.5, 0, 0.5, 1, 2, 3) generate_powers_fp(degree = 2, powers = powx) generate_powers_acd(degree = 2, powers = powx)
Function to generate all requested FP transformations for a single variable
generate_transformations_fp(x, degree, powers) generate_transformations_acd(x, degree, powers)
generate_transformations_fp(x, degree, powers) generate_transformations_acd(x, degree, powers)
x |
a numeric vector of length nobs assumed to have been shifted and scaled. |
degree |
numeric indicating the degree of FPs. Assumed to be 2 for acd transformation. |
powers |
a vector of allowed FP powers. |
Any FP transformation is given by a vector of powers, e.g. (p1, p2) for
degree 2. These correspond to powers x^p1 and x^p2. Thus, we only need to
consider combinations of all values in powers
, since order of the entries
does not matter. See generate_powers_fp()
.
A special case are repeated powers, i.e. p1 = p2. In this case, the repeated
entries are multiplied by log(x) (see transform_vector_fp()
).
When the ACD transformation is requested, then all pairs of length 2
are considered, i.e. 64. See generate_powers_acd()
.
If degree = 0
then these functions return the data unchanged for fp,
or simply the acd transformation of the input variable, i.e. in both cases
the power is set to 1 (linear).
A list with two entries:
data
: a list with length equal to the number of possible FPs for the
variable of interest. Each entry is a matrix with degree many columns,
and nobs observations comprising the FP transformed input variable.
For example, for degree = 2 and nobs = 10, each entry is a 10 x 2 matrix.
Values are not centered. If degree = 0
, the single entry has a single
column.
powers
: the associated FP powers for each entry in data.
generate_transformations_acd()
: Function to generate acd transformations.
mfp2
objectSimply extracts all variables for which not all powers are estimated to
be NA
. The names refer to the original names in the dataset and do not
include transformations.
get_selected_variable_names(object)
get_selected_variable_names(object)
object |
fitted |
Character vector of names, ordered as defined by xorder
in mfp2()
.
# Gaussian model data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit = mfp2(x, y, verbose = FALSE) get_selected_variable_names(fit)
# Gaussian model data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit = mfp2(x, y, verbose = FALSE) get_selected_variable_names(fit)
Selects the multivariable fractional polynomial (MFP) model that best predicts
the outcome variable. It also has the ability to model a sigmoid relationship
between x
and an outcome variable y
using the approximate cumulative
distribution (ACD) transformation proposed by Royston (2014).
This function provides two interfaces for input data: one for inputting
data matrix x
and outcome vector y
directly and the other for using a
formula
object together with a dataframe data
. Both interfaces are
equivalent in terms of functionality.
mfp2(x, ...) ## Default S3 method: mfp2( x, y, weights = NULL, offset = NULL, cycles = 5, scale = NULL, shift = NULL, df = 4, center = TRUE, subset = NULL, family = c("gaussian", "poisson", "binomial", "cox"), criterion = c("pvalue", "aic", "bic"), select = 0.05, alpha = 0.05, keep = NULL, xorder = c("ascending", "descending", "original"), powers = NULL, ties = c("breslow", "efron", "exact"), strata = NULL, nocenter = NULL, acdx = NULL, ftest = FALSE, control = NULL, verbose = TRUE, ... ) ## S3 method for class 'formula' mfp2( formula, data, weights = NULL, offset = NULL, cycles = 5, scale = NULL, shift = NULL, df = 4, center = TRUE, subset = NULL, family = c("gaussian", "poisson", "binomial", "cox"), criterion = c("pvalue", "aic", "bic"), select = 0.05, alpha = 0.05, keep = NULL, xorder = c("ascending", "descending", "original"), powers = NULL, ties = c("breslow", "efron", "exact"), strata = NULL, nocenter = NULL, ftest = FALSE, control = NULL, verbose = TRUE, ... )
mfp2(x, ...) ## Default S3 method: mfp2( x, y, weights = NULL, offset = NULL, cycles = 5, scale = NULL, shift = NULL, df = 4, center = TRUE, subset = NULL, family = c("gaussian", "poisson", "binomial", "cox"), criterion = c("pvalue", "aic", "bic"), select = 0.05, alpha = 0.05, keep = NULL, xorder = c("ascending", "descending", "original"), powers = NULL, ties = c("breslow", "efron", "exact"), strata = NULL, nocenter = NULL, acdx = NULL, ftest = FALSE, control = NULL, verbose = TRUE, ... ) ## S3 method for class 'formula' mfp2( formula, data, weights = NULL, offset = NULL, cycles = 5, scale = NULL, shift = NULL, df = 4, center = TRUE, subset = NULL, family = c("gaussian", "poisson", "binomial", "cox"), criterion = c("pvalue", "aic", "bic"), select = 0.05, alpha = 0.05, keep = NULL, xorder = c("ascending", "descending", "original"), powers = NULL, ties = c("breslow", "efron", "exact"), strata = NULL, nocenter = NULL, ftest = FALSE, control = NULL, verbose = TRUE, ... )
x |
for |
... |
not used. |
y |
for |
weights |
a vector of observation weights of length nobs.
Default is |
offset |
a vector of length nobs that is included in the linear
predictor. Useful for the poisson family (e.g. log of exposure time).
Default is |
cycles |
an integer, specifying the maximum number of iteration cycles. Default is 5. |
scale |
a numeric vector of length nvars or single numeric specifying
scaling factors. If a single numeric, then the value will be replicated as
necessary. The formula interface |
shift |
a numeric vector of length nvars or a single numeric specifying
shift terms. If a single numeric, then the value will be replicated as
necessary. The formula interface |
df |
a numeric vector of length nvars or a single numeric that sets the
(default) degrees of freedom (df) for each predictor. If a single numeric,
then the value will be replicated as necessary. The formula interface
|
center |
a logical determining whether variables are centered before
final model fitting. The default |
subset |
an optional vector specifying a subset of observations
to be used in the fitting process. Default is |
family |
a character string representing a |
criterion |
a character string specifying the criterion used to select
variables and FP models of different degrees.
Default is to use p-values in which case the user can specify
the nominal significance level (or use default level of 0.05) for variable and
functional form selection (see |
select |
a numeric vector of length nvars or a single numeric that
sets the nominal significance levels for variable selection on each predictor
by backward elimination. If a single numeric, then the value will be replicated
as necessary. The formula interface |
alpha |
a numeric vector of length nvars or a single numeric that
sets the significance levels for testing between FP models of
different degrees. If a single numeric, then the value will be replicated
as necessary. The formula interface |
keep |
a character vector with names of variables to be kept
in the model. In case that |
xorder |
a string determining the order of entry of the covariates
into the model-selection algorithm. The default is |
powers |
a named list of numeric values that sets the permitted FP
powers for each covariate. The default is NULL, and each covariate is assigned
|
ties |
a character string specifying the method for tie handling in
Cox regression. If there are no tied death times all the methods are
equivalent. Default is the Breslow method. This argument is used for Cox
models only and has no effect on other model families.
See |
strata |
a numeric vector or matrix of variables that define strata
to be used for stratification in a Cox model. A new factor, whose levels are
all possible combinations of the variables supplied will be created.
Default is |
nocenter |
a numeric vector with a list of values for fitting Cox
models. See |
acdx |
a numeric vector of names of continuous variables to undergo
the approximate cumulative distribution (ACD) transformation.
It also invokes the function-selection procedure to determine the
best-fitting FP1(p1, p2) model (see Details section). Not present in the
formula interface |
ftest |
a logical; for normal error models with small samples, critical
points from the F-distribution can be used instead of Chi-Square
distribution. Default |
control |
a list object with parameters controlling model fit details.
Returned by either |
verbose |
a logical; run in verbose mode. |
formula |
for |
data |
for |
mfp2()
returns an object of class inheriting from glm
or copxh
,
depending on the family
parameter.
The function summary()
(i.e. summary.mfp2()
) can be used to obtain or
print a summary of the results.
The generic accessor function coef()
can be used to extract the vector of
coefficients from the fitted model object.
The generic predict()
can be used to obtain predictions from the fitted
model object.
An object of class mfp2
is a list containing all entries as for glm
or coxph
, and in addition the following entries:
convergence_mfp: logical value indicating convergence of mfp algorithm.
fp_terms: a data.frame with information on fractional polynomial terms.
transformations: a data.frame with information on shifting, scaling and centering for all variables.
fp_powers: a list with all powers of fractional polynomial terms. Each entry of the list is named according to the transformation of the variable.
acd: a vector with information for which variables the acd transformation was applied.
x_original: the scaled and shifted input matrix but without transformations.
y: the original outcome variable.
x: the final transformed input matrix used to fit the final model.
call_mfp: the call to the mfp2()
function.
family_string: the family stored as character string.
The mfp2
object may contain further information depending on family.
mfp2(default)
: Default method using input matrix x
and outcome vector y
.
mfp2(formula)
: Provides formula interface for mfp2
.
In the following we denote fractional polynomials for a variable by
increasing complexity as either FP1 or FP2. In this example,
for
is the most flexible FP transformation,
where
When (repeated powers), the FP2 model is given by
The powers and
are usually chosen from a predefined set
of powers
where the power
of 0 indicates the natural logarithm. The best FP2 is then estimated by using a
closed testing procedure that seeks the best combination from all 36 pairs of
powers
. Functions that only involve a single power of
the variable are denoted as FP1, i.e.
For details see e.g. Sauerbrei et al (2006).
family
optionmfp2()
supports the family object as used by stats::glm()
. The built in
families are specified via a character string. mfp2(..., family = "binomial")
fits a logistic regression model, while mfp2(..., family = "gaussian")
fits a linear regression (ordinary least squares) model.
For Cox models, the response should preferably be a Surv
object,
created by the survival::Surv()
function, and the family = "cox"
.
Only right-censored data are currently supported. To fit stratified Cox
models, the strata
option can be used, or alternatively strata
terms
can be included in the model formula when using the formula interface
mfp2.formula
.
Fractional polynomials are defined only for positive variables due to the
use of logarithms and other powers. Thus, mfp2()
estimates shifts for
each variables to ensure positivity or assumes that the variables are
already positive when computing fractional powers of the input variables
in case that shifting is disabled manually.
If the values of the variables are too large or too small, it is important to
conduct variable scaling to reduce the chances of numerical underflow or
overflow which can lead to inaccuracies and difficulties in estimating the
model. Scaling can be done automatically or by directly specifying the
scaling values so that the magnitude of the x
values are not too extreme.
By default scaling factors are estimated by the program as follows.
After adjusting the location of so that its minimum value is positive,
creating
, automatic scaling will divide each value of
by
where the exponent
is given by
The FP transformation of is centered on the mean of the observed
values of
. For example, for the FP1 model
,
the actual model fitted by the software would be
. This approach ensures that
the revised constant
or baseline hazard function in a Cox
model retains a meaningful interpretation.
So in brief: shifting is required to make input values positive, scaling
helps to bring the values to a reasonable range. Both operations are
conducted before estimating the FP powers for an input variable.
Centering, however, is done after estimating the FP functions for each
variable.
Centering before estimating the FP powers may result in different powers and
should be avoided. Also see transform_vector_fp()
for some more details.
subset
argumentNote that subsetting occurs after data pre-processing (shifting and scaling),
but before model selection and fitting. In detail, when the option subset
is
used and scale, shift or centering values are to be estimated, then mfp2()
first estimates these parameters using the full dataset (no subsetting).
It then conduct subsetting before proceeding to perform model selection and
fitting on the specified subset of the data.
Therefore, subsetting in mfp2()
is not equivalent to subsetting the data
before passing it to mfp2()
, and thus cannot be used to implement, for example,
cross-validation or to remove NA
. These tasks should be done by the caller
beforehand. However, it does allow to use the same data pre-processing
for different subsets of the data. An example use case is when separate
models are to be estimated for women and men in the dataset, but a common
data pre-processing should be applied. In this case the subset
option
can be used to restrict model selection to either women or men, but the
data processing (e.g. shifting factors) will be shared between the two models.
The approximate cumulative distribution (ACD) transformation (Royston 2014)
converts each predictor, , smoothly to an approximation,
,
of its empirical cumulative distribution function.
This is done by smoothing a probit transformation of
the scaled ranks of
.
could be used instead of
as a covariate. This has the advantage of providing sigmoid curves, something
that regular FP functions cannot achieve.
Details of the precise definition and some possible uses of the ACD
transformation in a univariate context are given by Royston (2014).
Royston and Sauerbrei (2016) describes how one could go further and replace FP2
functions with a pair of FP1 functions, one in
and the other in
.
This alternative class of four-parameter functions provides about
the same flexibility as the standard FP2 family, but the ACD component offers
the additional possibility of sigmoid functions.
Royston (2014) discusses how the extended class of functions known as
, namely
can be fitted optimally by seeking the best combination of all 64 pairs of
powers (p1, p2). The optimisation is invoked by use of the acdx
parameter.
Royston (2014) also described simplification of the chosen function through
model reduction by applying significance testing to six sub-families of
functions,M1-M6, giving models M1 (most complex) through M6 (null):
M1: FP1(p1, p2) (no simplification)
M2: FP1(p1, .) (regular FP1 function of )
M3: FP1(., p2) (regular FP1 function of )
M4: FP1(1, .) (linear function of )
M5: FP1(., 1) (linear function of )
M6: Null ( omitted entirely)
Selection among these six sub-functions is performed by a closed test
procedure known as the function-selection pocedure FSPA.
It maintains the family-wise type 1 error
probability for selecting at the value determined by the
select
parameter. To obtain a 'final' model, a structured sequence of up
to five tests is carried out, the first at the significance level specified
by the select
parameter, and the remainder at the significance level
provided by the alpha
option.
The sequence of tests is as follows:
Test 1: Compare the deviances of models 6 and 1 on 4 d.f.
If not significant then stop and omit , otherwise continue to step 2.
Test 2: Compare the deviances of models 4 and 1 on 3 d.f. If not significant then accept model 4 and stop. Otherwise, continue to step 3.
Test 3: Compare the deviance of models 2 and 1 on 2 d.f. If not significant then accept model 2 and stop. Otherwise continue to step 4.
Test 4: Compare the deviance of models 3 and 1 on 2 d.f. If significant then model 1 cannot be simplified; accept model 1 and stop. Otherwise continue to step 5.
Test 5: Compare the deviances of models 5 and 3 on 1 d.f. If significant then model 3 cannot be simplified; accept model 3. Otherwise, accept model 5. End of procedure.
The result is the selection of one of the six models.
formula
mfp2
supports model specifications using two different interfaces: one
which allows passing of the data matrix x
and outcome vector y
directly
(as done in e.g. stats::glm.fit()
or glmnet
) and another which conforms
to the formula interface used by many commonly used R modelling functions
such as stats::glm()
or survival::coxph()
.
Both interfaces are equivalent in terms of possible fitted models, only the
details of specification differ. In the standard interface all details
regarding FP-transformations are given as vectors. In the formula interface
all details are specified using special fp()
function. These support the
specification of degrees of freedom (df
), nominal significance level for
variable selection (select
), nominal significance level for functional form
selection (alpha
), shift values (shift
), scale values (scale
),
centering (center
) and the ACD-transformation (acd
). Values specified
through fp()
function override the values specified as defaults and passed to
the mfp2()
function.
The formula may also contain strata
terms to fit stratified Cox models, or
an offset
term to specify a model offset.
Note that for a formula using .
, such as y ~ .
the mfp2()
function may not
fit a linear model, but may perform variable and functional form selection
using FP-transformations, depending on the default settings of df
,
select
and alpha
passed as arguments to mfp2()
.
For example, using y ~ .
with default settings means that mfp2()
will
apply FP transformation with 4 df to all continuous variables and use alpha
equal to 0.05 to select functional forms, along with the selection algorithm
with a significance level of 0.05 for all variables.
mfp
packagemfp2
is an extension of the mfp
package and can be used to reproduce
the results from a model fitted by mfp
. Since both packages implement the
MFP algorithm, they use functions with the same names (e.g fp()
). Therefore,
if you load both packages using a call to library
, there will
be namespace conflicts and only the functions from the package loaded last
will work properly.
Typically, mfp2
requires two to four cycles to achieve convergence. Lack of
convergence involves oscillation between two or more models and is extremely
rare. If the model does not converge, you can try changing the nominal
significance levels for variable (select
) or function selection (alpha
).
Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building:
A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials
for Modelling Continuous Variables. John Wiley & Sons.
Sauerbrei, W., Meier-Hirmer, C., Benner, A. and Royston, P., 2006.
Multivariable regression model building by using fractional
polynomials: Description of SAS, STATA and R programs.
Comput Stat Data Anal, 50(12): 3464-85.
Royston, P. 2014. A smooth covariate rank transformation for use in
regression models with a sigmoid dose-response function.
Stata Journal 14(2): 329-341.
Royston, P. and Sauerbrei, W., 2016. mfpa: Extension of mfp using the
ACD covariate transformation for enhanced parametric multivariable modeling.
The Stata Journal, 16(1), pp.72-87.
Sauerbrei, W. and Royston, P., 1999. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J Roy Stat Soc a Sta, 162:71-94.
summary.mfp2()
, coef.mfp2()
, predict.mfp2()
, fp()
# Gaussian model data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit1 = mfp2(x, y, verbose = FALSE) fit1$fp_terms fracplot(fit1) # generate plots summary(fit1) # formula interface fit1b = mfp2(lpsa ~ fp(age) + fp(svi, df = 1) + fp(pgg45) + fp(cavol) + fp(weight) + fp(bph) + fp(cp), data = prostate) # logistic regression model data("pima") xx <- as.matrix(pima[, 2:9]) yy <- as.vector(pima$y) fit2 <- mfp2(xx, yy, family = "binomial", verbose = FALSE) fit2$fp_terms # Cox regression model data("gbsg") # create dummy variable for grade using ordinal coding gbsg <- create_dummy_variables(gbsg, var_ordinal = "grade", drop_variables = TRUE) xd <- as.matrix(gbsg[, -c(1, 6, 10, 11)]) yd <- survival::Surv(gbsg$rectime, gbsg$censrec) # fit mfp and keep hormon in the model fit3 <- mfp2(xd, yd, family = "cox", keep = "hormon", verbose = FALSE) fit3$fp_terms
# Gaussian model data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit1 = mfp2(x, y, verbose = FALSE) fit1$fp_terms fracplot(fit1) # generate plots summary(fit1) # formula interface fit1b = mfp2(lpsa ~ fp(age) + fp(svi, df = 1) + fp(pgg45) + fp(cavol) + fp(weight) + fp(bph) + fp(cp), data = prostate) # logistic regression model data("pima") xx <- as.matrix(pima[, 2:9]) yy <- as.vector(pima$y) fit2 <- mfp2(xx, yy, family = "binomial", verbose = FALSE) fit2$fp_terms # Cox regression model data("gbsg") # create dummy variable for grade using ordinal coding gbsg <- create_dummy_variables(gbsg, var_ordinal = "grade", drop_variables = TRUE) xd <- as.matrix(gbsg[, -c(1, 6, 10, 11)]) yd <- survival::Surv(gbsg$rectime, gbsg$censrec) # fit mfp and keep hormon in the model fit3 <- mfp2(xd, yd, family = "cox", keep = "hormon", verbose = FALSE) fit3$fp_terms
Helper function to name transformed variables
name_transformed_variables(name, n_powers, acd = FALSE)
name_transformed_variables(name, n_powers, acd = FALSE)
name |
character with name of variable being transformed. |
n_powers |
number of resulting variables from FP-transformation. |
acd |
logical indicating the use of ACD-transformation |
Character vector of names of length n_powers
.
To be used in fit_mfp()
.
order_variables(xorder = "ascending", x = NULL, ...) order_variables_by_significance( xorder, x, y, family, weights, offset, strata, method, control, nocenter )
order_variables(xorder = "ascending", x = NULL, ...) order_variables_by_significance( xorder, x, y, family, weights, offset, strata, method, control, nocenter )
xorder |
a string determining the order of entry of the covariates
into the model-selection algorithm. The default is |
x |
a design matrix of dimension n * p where n is the number of observations and p the number of predictors including intercept for glms, or excluding intercept for Cox models. |
... |
passed to |
y |
a vector of responses for glms, or a |
family |
a character string naming a family function supported by
|
weights , offset
|
parameters for both glm and Cox models, see either
|
strata , method , control , nocenter
|
Cox model specific parameters, see
|
A vector of the variable names in x
, ordered according to xorder
.
order_variables_by_significance()
: Order by significance in regression model. The
number of columns of x
should be greater than 1 for Cox models.
The dataset arises from an investigation of potential predictors of the onset of diabetes in a cohort of 768 female Pima Indians of whom 268 developed diabetes. Missing values were imputed using the ice procedure for Stata.
data(pima)
data(pima)
A dataset with 768 observations and 9 variables.
Patient identifier.
Number of times pregnant.
Plasma glucose concentration at 2h in an oral glucose tolerance test.
Diastolic blood pressure in mmHg.
Triceps skin fold thickness in mm.
2-h serum insulin.
Body mass index.
Diabetes pedigree function.
Age in years.
Binary outcome variable (diabetes, yes/no).
mfp2
FitsObtains predictions from an mfp2
object.
## S3 method for class 'mfp2' predict( object, newdata = NULL, type = NULL, terms = NULL, terms_seq = c("equidistant", "data"), alpha = 0.05, ref = NULL, strata = NULL, newoffset = NULL, ... )
## S3 method for class 'mfp2' predict( object, newdata = NULL, type = NULL, terms = NULL, terms_seq = c("equidistant", "data"), alpha = 0.05, ref = NULL, strata = NULL, newoffset = NULL, ... )
object |
a fitted object of class |
newdata |
optionally, a matrix with column names in which to look for
variables with which to predict. See |
type |
the type of prediction required. The default is on the
scale of the linear predictors. See |
terms |
a character vector of variable names specifying for which
variables term or contrast predictions are desired.
Only used in case |
terms_seq |
a character string specifying how the range of variable
values for term predictions are handled. The default |
alpha |
significance level used for computing confidence intervals in terms prediction. |
ref |
a named list of reference values used when |
strata |
stratum levels used for predictions. |
newoffset |
A vector of offsets used for predictions. This parameter is important when newdata is supplied. The offsets will be directly added to the linear predictor without any transformations. |
... |
further arguments passed to |
To prepare the newdata
for prediction, this function applies any
necessary shifting and scaling based on the factors obtained from the
training data.
It is important to note that if the shifting factors are not sufficiently
large as estimated from the training data, variables in newdata
may end up
with negative values, which can cause prediction errors if non-linear
functional forms are used. A warning is given in this case by the function.
The next step involves transforming the data using the selected
fractional polynomial (FP) powers. If necessary, centering of variables is
conducted. Once the transformation (and centering) is complete, the
transformed data is passed to either predict.glm()
or predict.coxph()
,
depending on the chosen family of models and when type is not
terms
and contrasts
.
For any type
other than "terms"
the output conforms to the output
of predict.glm()
or predict.coxph()
.
If type = "terms"
or type = "contrasts"
, then a named list with entries
for each variable requested in terms
(excluding those not present in the
final model).
Each entry is a data.frame
with the following columns:
variable
: variable values on original scale.
variable_pre
: variable with pre-transformation applied, i.e. shifted,
scaled and centered as required.
value
: partial linear predictor or contrast (depending on type
).
se
: standard error of partial linear predictor or contrast.
lower
: lower limit of confidence interval.
upper
: upper limit of confidence interval.
This function allows to compute the partial linear predictors
for each variable selected into the final model if type = "terms"
. Note
that the results returned from this function are different from those of
predict.glm()
and predict.coxph()
since these functions do not take
into account that a single variable can be represented by multiple terms.
This functionality is useful to assess model fit, since it also allows to
draw data points based on residuals.
This functions allows to compute contrasts with reference to a specified
variable value if type = "contrasts"
. In this case, the fitted partial
predictors will be centered at the reference value (i.e. 0), and also
confidence intervals will have width 0 at that point.
mfp2()
, stats::predict.glm()
, survival::predict.coxph()
# Gaussian model data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit1 = mfp2(x, y, verbose = FALSE) predict(fit1) # make predictions
# Gaussian model data("prostate") x = as.matrix(prostate[,2:8]) y = as.numeric(prostate$lpsa) # default interface fit1 = mfp2(x, y, verbose = FALSE) predict(fit1) # make predictions
To be used in predict.mfp2()
.
prepare_newdata_for_predict( object, newdata, strata = NULL, offset = NULL, apply_pre = TRUE, apply_center = TRUE, check_binary = TRUE )
prepare_newdata_for_predict( object, newdata, strata = NULL, offset = NULL, apply_pre = TRUE, apply_center = TRUE, check_binary = TRUE )
object |
fitted |
newdata |
dataset to be prepared for predictions. Its columns can be a subset of the columns used for fitting the model. |
strata , offset
|
passed from |
apply_pre |
logical indicating wether the fitted pre-transformation is applied or not. |
apply_center |
logical indicating whether the fitted centers are applied after transformation or not. |
check_binary |
passed to |
A dataframe of transformed newdata
Function for verbose printing of function selection procedure (FSP)
print_mfp_step(xi, criterion, fit) print_mfp_pvalue_step(xi, fit, criterion) print_mfp_ic_step(xi, fit, criterion)
print_mfp_step(xi, criterion, fit) print_mfp_pvalue_step(xi, fit, criterion) print_mfp_ic_step(xi, fit, criterion)
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
fit |
intermediary model fit in |
print_mfp_pvalue_step()
: Helper for verbose printing based on p-value.
print_mfp_ic_step()
: Helper for verbose printing based on information criterion.
mfp2
Enhances printing by information on data processing and fractional polynomials.
## S3 method for class 'mfp2' print(x, ...)
## S3 method for class 'mfp2' print(x, ...)
x |
|
... |
passed to |
Two dataframes: the first one contains preprocessing parameters (shifting, scaling, and
centering), and the second one includes additional parameters such as df
, select
, and
alpha
passed through mfp2
. It also returns a list of the final model fitted, which can be
either a GLM or Cox model depending on the chosen family.
Prostate cancer dataset used in the Royston and Sauerbrei (2008) book.
data(prostate)
data(prostate)
A dataset with 97 observations and 8 variables.
Observation number.
Age in years.
Seminal vessel invasion (yes/no).
Percentage Gleason score 4 or 5.
Cancer volume (mm).
Prostate weight (g).
Amount of benign prostatic hyperplasia (g).
Amount of capsular penetration (g).
Log PSA concentration (outcome variable).
To be used in fit_mfp()
.
This function resets the acdx
parameter (logical vector) of variables with
less than 5 distinct values to FALSE
.
reset_acd(x, acdx)
reset_acd(x, acdx)
x |
a design matrix of dimension nobs x nvars where nvars is the number of predictors excluding an intercept. |
acdx |
a named logical vector of length nvars indicating which continuous
variables should undergo the approximate cumulative distribution (ACD)
transformation. May be ordered differently than the columns of |
Logical vector of same length as acdx
.
Used in find_best_fp_step()
when criterion = "aic"
or "bic"
.
For parameter explanations, see find_best_fp_step()
. All parameters
captured by ...
are passed on to fit_model()
.
select_ic( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... ) select_ic_acd( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
select_ic( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... ) select_ic_acd( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
keep |
a character vector with names of variables to be kept in the model. |
degree |
integer > 0 giving the degree for the FP transformation. |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
ftest |
a logical indicating the use of the F-test for Gaussian models. |
select |
a numeric value indicating the significance level
for backward elimination of |
alpha |
a numeric value indicating the significance level
for tests between FP models of different degrees for |
... |
passed to fitting functions. |
In case an information criterion is used to select the best model the selection procedure simply fits all relevant models and selects the best one according to the given criterion.
"Relevant" models for a given degree are the null model excluding the variable of interest, the linear model and all best FP models up to the specified degree.
In case an ACD transformation is requested, then the models assessed are the null model, the linear model in x and A(x), the best FP1 models in x and A(x), and the best FP1(x, A(x)) model.
Note that the "best" FPx model used in this function are given by the models
using a FPx transformation for the variable of interest and having the
highest likelihood of all such models given the current powers for all other
variables, as outlined in Section 4.8 of Royston and Sauerbrei (2008).
These best FPx models are computed in find_best_fpm_step()
.
Keep in mind that for a fixed number of degrees of freedom (i.e. fixed m),
the model with the highest likelihood is the same as the model with the best
information criterion of any kind since all the models share the same
penalty term.
When a variable is forced into the model by including it in keep
, then
this function will not exclude it from the model (by setting its power to
NA
), but will only choose its functional form.
A list with several components:
keep
: logical indicating if xi
is forced into model.
acd
: logical indicating if an ACD transformation was applied for xi
,
i.e. FALSE
in this case.
powers
: (best) fp powers investigated in step, indexing metrics
.
Ordered by increasing complexity, i.e. null, linear, FP1, FP2 and so on.
For ACD transformation, it is null, linear, linear(., A(x)), FP1(x, .),
FP1(., A(x)) and FP1(x, A(x)).
power_best
: a numeric vector with the best power found. The returned
best power may be NA
, indicating the variable has been removed from the
model.
metrics
: a matrix with performance indices for all best models
investigated. Same number of rows as, and indexed by, powers
.
model_best
: row index of best model in metrics
.
pvalue
: p-value for comparison of linear and null model, NA
in this
case..
statistic
: test statistic used, depends on ftest
, NA
in this
case.
select_ic_acd()
: Function to select ACD based transformation.
To be used in find_best_fp_step()
. Only used if df = 1
for a variable.
Handles all criteria for selection.
For parameter explanations, see find_best_fp_step()
. All parameters
captured by ...
are passed on to fit_model()
.
select_linear( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
select_linear( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
keep |
a character vector with names of variables to be kept in the model. |
degree |
not used. |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
ftest |
a logical indicating the use of the F-test for Gaussian models. |
select |
a numeric value indicating the significance level
for backward elimination of |
alpha |
a numeric value indicating the significance level
for tests between FP models of different degrees for |
... |
passed to fitting functions. |
This function assesses a single variable of interest xi
regarding its
functional form in the current working model as indicated by
powers_current
, with the choice between a excluding xi
("null model") and
including a linear term ("linear fp") for xi
.
Note that this function handles an ACD transformation for xi
as well.
When a variable is forced into the model by including it in keep
, then
this function will not exclude it from the model (by setting its power to
NA
), but will only choose the linear model.
A list with several components:
keep
: logical indicating if xi
is forced into model.
acd
: logical indicating if an ACD transformation was applied for xi
.
powers
: fp powers investigated in step, indexing metrics
.
power_best
: a numeric vector with the best power found. The returned
best power may be NA
, indicating the variable has been removed from the
model.
metrics
: a matrix with performance indices for all models investigated.
Same number of rows as, and indexed by, powers
.
model_best
: row index of best model in metrics
.
pvalue
: p-value for comparison of linear and null model.
statistic
: test statistic used, depends on ftest
.
Used in find_best_fp_step()
when criterion = "pvalue"
.
For parameter explanations, see find_best_fp_step()
. All parameters
captured by ...
are passed on to fit_model()
.
select_ra2( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
select_ra2( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
keep |
a character vector with names of variables to be kept in the model. |
degree |
integer > 0 giving the degree for the FP transformation. |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
ftest |
a logical indicating the use of the F-test for Gaussian models. |
select |
a numeric value indicating the significance level
for backward elimination of |
alpha |
a numeric value indicating the significance level
for tests between FP models of different degrees for |
... |
passed to fitting functions. |
In case criterion = "pvalue"
the function selection procedure as outlined
in Chapters 4 and 6 of Royston and Sauerbrei (2008) is used.
Step 1: test the best FPm function against a null model at level
select
with 2m df. If not significant, the variable is excluded.
Otherwise continue with step 2.
Step 2: test the best FPm versus a linear model at level alpha
with 2m - 1 df. If not significant, use a linear model.
Otherwise continue with step 3.
Step 3: test the best FPm versus the best FP1 at
level alpha
with 2m - 2 df. If not significant, use the best FP1 model.
Otherwise, repeat this step for all remaining higher order FPs until
FPm-1, which is tested at level alpha
with 2 df against FPm.
If the final test is not significant, use a FPm-1 model, otherwise use FPm.
Note that the "best" FPx model used in each step is given by the model using
a FPx transformation for the variable of interest and having the highest
likelihood of all such models given the current powers for all other
variables, as outlined in Section 4.8 of Royston and Sauerbrei (2008).
These best FPx models are computed in find_best_fpm_step()
.
When a variable is forced into the model by including it in keep
, then
this function will not exclude it from the model (by setting its power to
NA
), but will only choose its functional form.
A list with several components:
keep
: logical indicating if xi
is forced into model.
acd
: logical indicating if an ACD transformation was applied for xi
,
i.e. FALSE
in this case.
powers
: (best) fp powers investigated in step, indexing metrics
.
Always starts with highest power, then null, then linear, then FP in
increasing degree (e.g. FP2, null, linear, FP1).
power_best
: a numeric vector with the best power found. The returned
best power may be NA
, indicating the variable has been removed from the
model.
metrics
: a matrix with performance indices for all models investigated.
Same number of rows as, and indexed by, powers
.
model_best
: row index of best model in metrics
.
pvalue
: p-value for comparison of linear and null model.
statistic
: test statistic used, depends on ftest
.
Royston, P. and Sauerbrei, W., 2008. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons.
Used in find_best_fp_step()
when criterion = "pvalue"
and an
ACD transformation is requested for xi
.
For parameter explanations, see find_best_fp_step()
. All parameters
captured by ...
are passed on to fit_model()
.
select_ra2_acd( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
select_ra2_acd( x, xi, keep, degree, acdx, y, powers_current, powers, criterion, ftest, select, alpha, ... )
x |
an input matrix of dimensions nobs x nvars. Does not contain intercept, but columns are already expanded into dummy variables as necessary. Data are assumed to be shifted and scaled. |
xi |
a character string indicating the name of the current variable of interest, for which the best fractional polynomial transformation is to be estimated in the current step. |
keep |
a character vector with names of variables to be kept in the model. |
degree |
integer > 0 giving the degree for the FP transformation. |
acdx |
a logical vector of length nvars indicating continuous variables to undergo the approximate cumulative distribution (ACD) transformation. |
y |
a vector for the response variable or a |
powers_current |
a list of length equal to the number of variables,
indicating the fp powers to be used in the current step for all variables
(except |
powers |
a named list of numeric values that sets the permitted FP powers for each covariate. |
criterion |
a character string defining the criterion used to select variables and FP models of different degrees. |
ftest |
a logical indicating the use of the F-test for Gaussian models. |
select |
a numeric value indicating the significance level
for backward elimination of |
alpha |
a numeric value indicating the significance level
for tests between FP models of different degrees for |
... |
passed to fitting functions. |
This function extends the algorithm used in select_ra2()
to allow the
usage of ACD transformations. The implementation follows the description
in Royston and Sauerbrei (2016). The procedure is outlined in detail in
the corresponding section in the documentation of mfp2()
.
When a variable is forced into the model by including it in keep
, then
this function will not exclude it from the model (by setting its power to
NA
), but will only choose its functional form.
A list with several components:
keep
: logical indicating if xi
is forced into model.
acd
: logical indicating if an ACD transformation was applied for xi
,
i.e. FALSE
in this case.
powers
: (best) fp powers investigated in step, indexing metrics
.
Ordering: FP1(x, A(x)), null, linear, FP1(x, .), linear(., A(x)),
FP1(., A(x)).
power_best
: a numeric vector with the best power found. The returned
best power may be NA
, indicating the variable has been removed from the
model.
metrics
: a matrix with performance indices for all models investigated.
Same number of rows as, and indexed by, powers
.
model_best
: row index of best model in metrics
.
pvalue
: p-value for comparison of linear and null model.
statistic
: test statistic used, depends on ftest
.
Royston, P. and Sauerbrei, W., 2016. mfpa: Extension of mfp using the ACD covariate transformation for enhanced parametric multivariable modeling. The Stata Journal, 16(1), pp.72-87.
mfp2
model fitsThis function is a method for the generic base::summary()
function for
objects of class mfp2
.
## S3 method for class 'mfp2' summary(object, ...)
## S3 method for class 'mfp2' summary(object, ...)
object |
an object of class |
... |
further arguments passed to the summary functions for |
An object returned from stats::summary.glm()
or
survival::summary.coxph()
, depending on the family parameter of object
.
mfp2()
, stats::glm()
, stats::summary.glm()
, survival::coxph()
,
survival::summary.coxph()
Function to extract and transform adjustment variables
transform_data_step(x, xi, powers_current, df, powers, acdx)
transform_data_step(x, xi, powers_current, df, powers, acdx)
x |
a matrix of predictors that includes the variable of interest |
xi |
name of the continuous predictor for which the FP function will be
estimated. There are no binary or two-level variables allowed. All variables
except |
powers_current |
a named list of FP powers of all variables of interest,
including |
df |
a numeric vector of degrees of freedom for |
powers |
a set of allowed FP powers. |
acdx |
a logical vector indicating the use of acd transformation. |
After extracting the adjustment variables this function, using their
corresponding FP powers stored in powers_current
, transforms them.
This is necessary When evaluating x of interest, as we must account for other
variables, which can be transformed or untransformed, depending on the
individual powers. It's worth noting that some powers can be NA, indicating
that the variable has been left out of the adjustment variables. It also
returns the FP data, which is dependent on the degrees of freedom. For example,
df = 2
is equivalent to FP degree one, resulting in the generation of 8
variables. If acdx
for the current variables of interest is set to TRUE
,
however, 64 variables are generated.
When df = 1
, this function returns data unchanged, i.e. a "linear"
transformation with power equal to 1. In case acdx[xi] = TRUE
, the
acd transformation is applied.
A list containing the following elements:
powers_fp
: fp powers used for data_fp
.
data_fp
: a list with all possible fp transformations for xi
, see the
data
component of the output of generate_transformations_fp()
and
generate_transformations_acd()
.
powers_adj
: fp powers for adjustment variables in data_adj
.
data_adj
: adjustment data, i.e. transformed input data for adjustment
variables.
Function to transform each column of matrix using final FP powers or acd
transform_matrix( x, power_list, center, acdx, keep_x_order = FALSE, acd_parameter_list = NULL, check_binary = TRUE )
transform_matrix( x, power_list, center, acdx, keep_x_order = FALSE, acd_parameter_list = NULL, check_binary = TRUE )
x |
a matrix with all continuous variables shifted and scaled. |
power_list |
a named list of FP powers to be applied to the columns of
|
center |
a named logical vector specifying whether the columns in |
acdx |
a named logical vector specifying the use of acd transformation. |
keep_x_order |
a logical indicating whether the order of columns
should be kept as in the input matrix |
acd_parameter_list |
a named list. Only required when transformation
are to be applied to new data. Entries must correspond to the entries where
|
check_binary |
passed to |
For details on the transformations see transform_vector_fp()
and
transform_vector_acd()
.
If all elements of power_list
are NA
then this function returns NULL
.
Otherwise a list with three entries: the first x_transformed
is a matrix
with transformed variables as named in power_list
.
The number of columns may possibly be different to the
input matrix due to higher order FP transformations.
The second entry centers
stores the values used to center the variables if
for any variable center = TRUE
(note that usually all variables are
centered, or none of them).
The third entry acd_parameter
stores a named list of estimated
acd_parameters
. May be empty if no ACD transformation is applied.
Generally the original variable names are suffixed with ".i", where
i enumerates the powers for a given variable in power_list
. If a term
uses an acd transformation, then the variable is prefixed with A_
.
x = matrix(1:100, nrow = 10) colnames(x) = paste0("x", 1:ncol(x)) powx = setNames(replicate(ncol(x), c(1,2), simplify = FALSE), colnames(x)) center = setNames(rep(FALSE, ncol(x)), colnames(x)) acdx = setNames(rep(FALSE, ncol(x)), colnames(x)) transform_matrix(x, powx, center, acdx)
x = matrix(1:100, nrow = 10) colnames(x) = paste0("x", 1:ncol(x)) powx = setNames(replicate(ncol(x), c(1,2), simplify = FALSE), colnames(x)) center = setNames(rep(FALSE, ncol(x)), colnames(x)) acdx = setNames(rep(FALSE, ncol(x)), colnames(x)) transform_matrix(x, powx, center, acdx)
These functions generate fractional polynomials for a variable similar to
fracgen
in Stata. transform_vector_acd
generates the acd transformation
for a variable.
transform_vector_fp( x, power = 1, scale = 1, shift = 0, name = NULL, check_binary = TRUE ) transform_vector_acd( x, power = c(1, 1), shift = 0, powers = NULL, scale = 1, acd_parameter = NULL, name = NULL )
transform_vector_fp( x, power = 1, scale = 1, shift = 0, name = NULL, check_binary = TRUE ) transform_vector_acd( x, power = c(1, 1), shift = 0, powers = NULL, scale = 1, acd_parameter = NULL, name = NULL )
x |
a vector of a predictor variable. |
power |
a numeric vector indicating the FP power. Default is 1 (linear).
Must be a vector of length 2 for acd transformation. Ignores |
scale |
scaling factor for x of interest. Must be a positive integer
or |
shift |
shift required for shifting x to positive values. Default is 0,
meaning no shift is applied. If |
name |
character used to define names for the output matrix. Default
is |
check_binary |
a logical indicating whether or not input |
powers |
passed to |
acd_parameter |
a list usually returned by |
The fp transformation generally transforms x
as follows. For each pi in
power
= (p1, p2, ..., pn) it creates a variable x^pi and returns the
collection of variables as a matrix. It may process the data using
shifting and scaling as desired. Centering has to be done after the
data is transformed using these functions, if desired.
A special case are repeated powers, i.e. when some pi = pj. In this case, the fp transformations are given by x^pi and x^pi * log(x). In case more than 2 powers are repeated they are repeatedly multiplied with log(x) terms, e.g. pi = pj = pk leads to x^pi, x^pi * log(x) and x^pi * log(x)^2.
Note that the powers pi are assumed to be sorted. That is, this function
sorts them, then proceeds to compute the transformation. For example,
the output will be the same for power = c(1, 1, 2)
and
power = c(1, 2, 1)
. This is done to make sense of repeated powers and
to uniquely define FPs. In case an ACD transformation is used, there is a
specific order in which powers are processed, which is always the same (but
not necessarily sorted).
Thus, throughout the whole package powers will always be given and processed
in either sorted, or ACD specific order and the columns of the matrix
returned by this function will always align with the powers used
throughout this package.
Binary variables are not transformed, unless check_binary
is set to
FALSE
. This is usually not necessary, the only special case to set it to
FALSE
is when a single value is to be transformed during prediction (e.g.
to transform a reference value). When this is done, binary variables are
still returned unchanged, but a single value from a continuous variable will
be transformed as desired by the fitted transformations. For model fit,
check_binary
should always be at its default value.
Returns a matrix of transformed variable(s). The number of columns
depends on the number of powers provided, the number of rows is equal to the
length of x
. The columns are sorted by increased power.
If all powers are NA
, then this function returns NULL
.
In case an acd transformation is applied, the output is a list with two
entries. The first acd
is the matrix of transformed variables, the acd
term is returned as the last column of the matrix (i.e. in case that the
power for the normal data is NA
, then it is the only column in the matrix).
The second entry acd_parameter
returns a list of estimated parameters
for the ACD transformation, or simply the input acd_parameter
if it was
not NULL
.
transform_vector_acd()
: Function to generate acd transformation.
An important note on data processing. Variables are shifted and scaled
before being transformed by any powers. That is to ensure positive values
and reasonable scales. Note that scaling does not change the estimated
powers, see also find_scale_factor()
.
However, they may be centered after transformation. This is not done by
these functions.
That is to ensure that the correlation between variables stay intact,
as centering before transformation would affect them. This is described
in Sauerbrei et al (2006), as well as in the Stata manual of mfp
.
Also, centering is not recommended, and should only be done for the final
model if desired.
Sauerbrei, W., Meier-Hirmer, C., Benner, A. and Royston, P., 2006. Multivariable regression model building by using fractional polynomials: Description of SAS, STATA and R programs. Comput Stat Data Anal, 50(12): 3464-85.
z = 1:10 transform_vector_fp(z) transform_vector_acd(z)
z = 1:10 transform_vector_fp(z) transform_vector_acd(z)
Simple function to transform vector by a single power
transform_vector_power(x, power = 1)
transform_vector_power(x, power = 1)
x |
a vector of a predictor variable. |
power |
single power. |
A vector of transformed values if power is not equal to 1