Beta Regression Modelling

It is known that the distribution of LGD, EAD, FUC, or CCF (see for the definitions of CCF and FUC) when capped between 0 and 1 exhibits a non-normal shape with bimodal peaks (near 0 and 1). Hence, it is common to use a logistic link function to achieve a unimodal distribution close to normal, prior to running a regression.

ead_dist

However, it may be sufficient to model our target by transforming it prior to applying the link function. It has recently been suggested that dependent variables with support on (0,1), such as proportions or rates, can be modelled using the class of beta distributions (Smithson and Verkuilen, 2006). Recently, this has been applied to LGD modelling (Loterman et al, 2012).

In this post we consider CCF as our target variable for modelling, which has support on [0,1]. We may similarly swap CCF for any of the aforementioned targets, which share similar characteristics with respect to the shape and support of their distributions.

Since our capped CCF values exist on [0,1] and the beta regression requires support on (0,1), we make the following transformation:

CCF_{transform}=(CCF(n-1)+0.5)/n

Where n is the sample size (Smithson and Verkuilen, 2006). The beta distribution is entirely determined by two shape parameters \alpha,\beta>0 which can be estimated using the sample mean \mu and sample variance \sigma^2 of CCF_{transform}. These parameters can be calculated via the expressions:

\alpha=((1-\mu)/\sigma^2 -1/ \mu) \mu^2
\beta=\alpha(1/\mu-1)

It is convenient to use the alternate parameterization by setting (Ferrari and Cribari-Neto, 2004):

w=\alpha/(\alpha+\beta)

\phi= \alpha + \beta

So that CCF_{transform} \sim Beta(w,\phi). Hence, for a random sample such that CCF_{transform}_i  \sim Beta(w_i,\phi), the beta regression model is then defined as,

g(w_i)=x_i\gamma

Where \gamma=(\gamma_1,...,\gamma_k )^T is a kx1 vector of unknown regression coefficients, x_i=(x_{i1},...,x_{ik})^T is a vector of risk drivers, and g(w_i) is some link function where g(*): (0,1) \rightarrow \R. For our purposes we let g(*) be the logit function \log(w_i/(1-w_i)) so that:

\log(w_i/(1-w_i ))=\gamma_0+\sum_{j=1}^k x_{ij} \gamma_j

Rather than beta-transforming CCF_{transform} and performing an OLS regression, as in (Gupton and Stein, 2002), we employ maximum likelihood estimation on a generalized linear model where CCF is beta-distributed and the link function is logistic (Cribari-Neto and Zeileis, 2010). This approach makes use of the betareg R package.

In the below example, we regress the CCF on two independent variables (that are evidently not strong predictors of CCF, but are chosen for illustration purposes anyways).

#BETA REGRESSION MODELLING

#PACKAGES
library(sas7bdat)
library(betareg)

#GLOBAL VARIABLES
directory="C:\\66220\\BETA"
variable_list=c("Time_to_Default", "Commitment_Size")
target="ccf"

#MAIN
setwd(directory)
data <- read.sas7bdat("cohort1.sas7bdat")
data$ccf_transform <- (data$ccf*(length(data$ccf)-1)+1/2)/length(data$ccf)

#PLOT
hist(data[,target], breaks=seq(0,1,by=0.1), xlim=c(0,1), col='cyan', main="CCF Distribution", xlab="CCF", ylab="Frequency", labels = TRUE)

#REGRESSION
model_form <- as.formula(paste(paste0(target,"_transform")," ~ ", paste(variable_list, collapse="+")))
beta_model <-  betareg(model_form, data=data, na.rm=T)
summary(beta_model)

The summary of our output is:


Call:
betareg(formula = model_form, data = data, na.rm = T)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-1.2188 -0.7586 -0.0776  0.7578  1.2289 

Coefficients (mean model with logit link):
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -8.251e-01  1.385e-01  -5.959 2.54e-09 ***
Time_to_Default  1.036e-01  1.643e-02   6.304 2.90e-10 ***
Commitment_Size  3.037e-05  1.928e-05   1.575    0.115    

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)  0.34591    0.01036   33.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood:  3517 on 4 Df
Pseudo R-squared: 0.05602
Number of iterations: 17 (BFGS) + 1 (Fisher scoring) 

The result of our regression is CCF which can be directly used to solve for EAD (if such is desired).

A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods, Vol. 11, No. 1. (March 2006), pp. 54-71, doi:10.1037/1082-989x.11.1.54 by Michael Smithson, Jay Verkuilen

Gert Loterman, Iain Brown, David Martens, Christophe Mues, Bart Baesens, Benchmarking regression algorithms for loss given default modeling, International Journal of Forecasting, Volume 28, Issue 1, January–March 2012, Pages 161-170,

Ferrari SLP, Cribari-Neto F (2004). “Beta Regression for Modelling Rates and Proportions.”
Journal of Applied Statistics, 31(7), 799–815.

Gupton, G. M., & Stein, M. R. (2002). LOSSCALC: model for predicting loss given default (LGD). Tech. rep. Moody’s.