9
Copyright © 2019 Society of Actuaries
In regularization methods, we concentrate on estimating the coefficients of the features, the s, and feature selection
arises as a by-product of the regularization. Other feature selection methods concentrate on calculating a goodness-
of-fit measure for different possible subsets of the potential features. For example, subset selection methods calculate
a goodness-of-fit measure, such as AIC or BIC, for all possible subsets. The final model is chosen to be the model with
the best value for the goodness-of-fit measure, such as the smallest value of the AIC or the BIC. There are
possible
models if there are possible features, so this method is feasible only if is relatively small (for example, less than
30). In large-scale problems, iterative algorithms can be used to find an optimal model. For example, in stepwise
methods, suppose we have a chosen subset of features at one iteration; the following iteration considers the effect
on the goodness-of-fit measure of including any one of the features that are currently not used or removing one
feature that is currently used. The algorithm would then choose the change that leads to the largest improvement in
the measure of goodness of fit. Randazzo and Kinney (2015) use a logistic regression to identify likely frequent visitors
to the emergency room. While they do not specify the variable selection approach, they do highlight the top 20
predictors in order of their contribution to
.
Many approaches to feature selection concentrate on finding a single model. Bayesian approaches offer the
opportunity to combine all possible models in the analysis. Predictions can be made by combining the predictions for
each model with weights determined by the fit of each model to the data. In the Bayesian approach, a prior
distribution is defined for all parameters, and we are able to use the decision to include or exclude a feature as a
parameter. We define
to be 1 if the ith feature is included and 0 if it is not included, and
is a vector
for all possible features. Then, conditional on the included feature, a prior distribution can be specified for the
regression coefficients for the included features,
and other parameters, . Although the prior distributions can be
chosen using expert information, they are usually chosen to lead to a suitably simple model. A typical setup is
are independent and
. This implies that the prior expected number of included features is , and this
can be used to choose a value of that leads to a suitable prior mean for the data.
Once a prior distribution has been chosen, Markov chain Monte Carlo (MCMC) algorithms or variational Bayes
algorithms (Carbonetto et al. 2017) can be used to calculate the posterior distribution of sets of included features.
This allows a prediction to be calculated by averaging predictions from the different sets of included features weighted
by their respective posterior probabilities.
Nonlinear Classification and Regression Models
Generalized linear regression models assume that the relationship between features and the response are linear
(usually, after some transformation of the effect of the features through a link function). In particular, this restricts
the expected response to be either increasing or decreasing as a function of each feature. Generalized linear
regression models can be made more flexible by including powers of features or interactions between features.
However, if the number of features is large, this approach can lead to models with huge numbers of parameters,
which makes estimation challenging. These difficulties with generalized linear models have led to interest in nonlinear
classification and regression models, which allow more general relationships between the features and the expected
response, including interactions. In this section, we will first review the popular classification and regression tree
(CART) approach before considering development of this approach to so-called ensemble methods that include
random forests and a Bayesian alternative, Bayesian additive regression trees (BART).
Classification and Regression Trees (CART)
In GLMs, a single model is used for all values of the feature. In contrast, a popular approach to nonlinear models
assumes that different models are used for different combinations of the features. For example, if we have two
features, age and sex, we might be able to build separate models for men over 60, men under 60, women over 50