Recall the form of aÂ generalized linear model:

where predictors/variables have been selected to describe the phenomenon of the link function . Here weÂ assume that the response variable is explained by a linear combination of independent variables. However, there is no theoretical justification why the dependent variable should have a linear relationship with the independent variables Â .Â  This oversimplification may cause distortions in the predictionsÂ if the true relationship between and a given variable is non-linear.Â  However, this distortion can be reduced by assuming a more complex relationship between and .Â  In fact, it is logical to move away from linear functions and model this dependence in a nonparametric fashion when we have no intuition to believe the relationship is truly linear. This motivates the definition of generalized additive models (GAM).

Generalized additive models assume a non-linear relationship between the response variable Y and explanatory variables.Â Subsequently, a GAM regression assumes theÂ model:

where the 's are smooth functions. By choosing Â we may accept linear terms in the model. Fitting the model requires some specification of the smooth terms and the fitting algorithm. For example, the smooth terms can be expressed as cubic regression splines, thin plate regression splines, p-splines, spherical splines, and so on. Similarly, parameter estimation can be achieved via generalized cross validation, maximum likelihood estimation, restricted maximum likelihood, etcetera.

To constructÂ an example, we will attempt to estimate the salary of a professorÂ usingÂ time characteristics. This allows the model to capture the dynamics of salaryÂ as explained byÂ duration effects, especially when these relationships happen to be non-linear.

For ourÂ example, we select thin plate regression splines for because they are low rank relative to the number of input covariates (i.e. penalized for higher degrees of smoothness) and can be defined as the optimal smoother of any given basis rank (Wood, 2003). Since thin plate splines are defined in terms of a spectral decomposition (and thus its eigenvalues), they only require information about the rank of the data, and do not require any specification about knots. Broadly speaking penalized thin plate splines tend to provide the best mean squared error performance among the available smooth term options in R. The fitting algorithm selected for parameter estimation is generalized cross validation, which aims to combat over-fitting by partitioning data into subsets for independent estimation.

Mathematically, this example will be given by:

where is salary, isÂ years since they obtained their PhD, andÂ  is their years in service.

This can be programmed in R using the mgcv package. For example, here we regress the salary of professors against years since they obtained their PhD and their years in service.


library(car) #get package
library(mgcv) #GAM packagae
data(Salaries) #import data
gam_form <- as.formula("salary ~ s(yrs.since.phd) + s(yrs.service)") #define model
gam_model <- gam(gam_form, data=Salaries, method="REML") #run GAM regression
plot(gam_model)



Clearly, the non-linear effects are captured by the GAM regression, evidenced by the plots above.

In summary, GAMs represent a practical class of models that can be used to capture the seemingly non-linear associations between explanatory variables and link functions that describe salary. Thus, by placing less restriction on model structure (relative to a linear model), the predicted salaryÂ has the ability to provide a better theoretical estimate of the trueÂ value, and is our motivation for implementation.