Generalized Additive Models

Recall the form of a generalized linear model:

g(E[Y|X_1,X_2,..,X_p]) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p

where p predictors/variables have been selected to describe the phenomenon of the link function g(). Here we assume that the response variable Y is explained by a linear combination of independent variables. However, there is no theoretical justification why the dependent variable should have a linear relationship with the independent variables  X_1,X_2,...,X_p.  This oversimplification may cause distortions in the predictions if the true relationship between g() and a given variable X_i is non-linear.  However, this distortion can be reduced by assuming a more complex relationship between g(E[Y|X_1,X_2,..,X_p]) and X_1,X_2,...,X_p.  In fact, it is logical to move away from linear functions and model this dependence in a nonparametric fashion when we have no intuition to believe the relationship is truly linear. This motivates the definition of generalized additive models (GAM).

Generalized additive models assume a non-linear relationship between the response variable Y and explanatory variables. Subsequently, a GAM regression assumes the model:

g(E[Y|X_1,X_2,..,X_p]) = s_0 + s_1(X_1) + ... + s_p (X_p)

where the s_j(X_j)'s are smooth functions. By choosing s_j(X_j):=\beta_j X_j we may accept linear terms in the model. Fitting the model requires some specification of the smooth terms and the fitting algorithm. For example, the smooth terms can be expressed as cubic regression splines, thin plate regression splines, p-splines, spherical splines, and so on. Similarly, parameter estimation can be achieved via generalized cross validation, maximum likelihood estimation, restricted maximum likelihood, etcetera.

To construct an example, we will attempt to estimate the salary of a professor using time characteristics. This allows the model to capture the dynamics of salary as explained by duration effects, especially when these relationships happen to be non-linear.

For our example, we select thin plate regression splines for s_j because they are low rank relative to the number of input covariates (i.e. penalized for higher degrees of smoothness) and can be defined as the optimal smoother of any given basis rank (Wood, 2003). Since thin plate splines are defined in terms of a spectral decomposition (and thus its eigenvalues), they only require information about the rank of the data, and do not require any specification about knots. Broadly speaking penalized thin plate splines tend to provide the best mean squared error performance among the available smooth term options in R. The fitting algorithm selected for parameter estimation is generalized cross validation, which aims to combat over-fitting by partitioning data into subsets for independent estimation.

Mathematically, this example will be given by:

 S=s_0 +s_1(X_1) + s_2 (X_2)

where S is salary, s_1(X_1) is years since they obtained their PhD, and s_2(X_2) is their years in service.

This can be programmed in R using the mgcv package. For example, here we regress the salary of professors against years since they obtained their PhD and their years in service.

library(car) #get package
library(mgcv) #GAM packagae
data(Salaries) #import data
gam_form <- as.formula("salary ~ s( + s(yrs.service)") #define model
gam_model <- gam(gam_form, data=Salaries, method="REML") #run GAM regression











Clearly, the non-linear effects are captured by the GAM regression, evidenced by the plots above.

In summary, GAMs represent a practical class of models that can be used to capture the seemingly non-linear associations between explanatory variables and link functions that describe salary. Thus, by placing less restriction on model structure (relative to a linear model), the predicted salary has the ability to provide a better theoretical estimate of the true value, and is our motivation for implementation.