Processing math: 11%

Abstract

In insurance analytics, it is now common to utilize generalized linear models (GLM) of outcomes such as automobile and homeowners claims for pricing and other purposes. More recently, analysts have explored time to event tools such as logistic regression modeling to understand the duration that a customer will be with a company for long-run profitability.

This tutorial, and the companion paper, considers a joint model of insurance claims and lapsation. For example, if a policyholder is aggressive or a risk seeker (or just careless), then we would expect that customer to have both large auto as well as homeowner claims. As another example, if a policyholder has an auto claim during the year, then we might think that this outcome is related to the decision to lapse (or its converse, renew) an insurance contract.

Using simulated data, this tutorial shows how to estimate two types of claims (auto and homeowners) using familiar GLM models that employ the Tweedie distribution. Logistic regression is used for the lapse model. The novel aspect is that we specify their joint behavior through a copula. Estimation is done using both an established composite likelihood approach as well as a new (in this context) generalized method of moments technique. The models and estimation techniques are built into this demonstration and are not required knowledge in order to review and interact with this tutorial.

1 Background

Consider the case where we follow policyholders over time. During the year, there are three outcome variables of interest. The claims outcomes are

Naturally, these can be claims from any coverage - we use auto and homeowners for illustration purposes. As claims outcomes, these variables may take on value of zero (representing no claim) and are otherwise positive continuous outcomes (representing claim amount). We use subscripts i to distinguish among policyholders and t to distinguish observations over time. Thus, Y1,it and Y2,it represent auto and homeowner claims for the ith policyholder at time t.

The third random variable, L, is a binary variable that represents a policyholder’s decision to lapse the policy. Specifically,

Note that if Li,t=1 then we do not observe the policy at time t. In the same way, if Lit=0, then we observe the policy at time t, subject to limitations on the number of time periods available. We use m to represent the maximal number of observations over time.

Associated with each policyholder is a set of (possibly time varying) rating variables xit for the ith policyholder at time t that is described in Section 2.2. We represent the marginal distribution of each outcome variables in terms of a generalized linear model. Specifically, following standard insurance industry practice, we represent the marginal distributions of the claims random variables using a Tweedie distribution so that the distributions have a mass at zero and are otherwise positive. The marginal distribution of the renewal variable is modeled using a logit function. Marginal distributions may use common rating variables and so are naturally related in this sense.

The dependence among lapse and claims outcomes is captured using a copula function. That is, we allow outcomes from the same time period (and the same policyholder) to depend on one another. This specification permits, for example, large claims to influence (directly) the tendency to lapse a policy or (indirectly) a latent variable to simultaneously influence both lapse and claims outcome. Lapsation dictates the availability of data which may be related to the outcomes, a violation of the statistical principle known as missing at random. This means that analyzing claims while ignoring lapse can lead to biased estimation. Thus, joint modeling of lapse and claims are critical because the claims model depends on the data observed through the lapsation/renewal process.

This tutorial is interactive in two ways. First, if you are viewing the .html file in a web browser, you will be able to reveal R code and output by clicking on the text. For example, here is a list of the R packages needed to run this tutorial.

Click Here to Show R Code for a List of Packages

Second, you may change selected input parameters so that the tutorial reflects a business situation of interest to you. This is accomplished by making changes to the companion .Rmd file that is written in R’s version of markdown, known as R Markdown. The code that includes the label Settings contains input parameters that you can readily change. For example, the code for specifying the number of policyholder is given as follows.

R Code for Data Structure Parameter Settings

From this input, for this demonstration we have

In the following, this tutorial is split into five additional sections.

  1. Data Generation Process. Claims follow a Tweedie model with two explanatory/rating variables. Retention/Lapse follows a logistic model with a covariate (time). The user can set the sample size, correlation among outcome variables, mean claim size, the proportion of zero claims (no claim), and mean retention.
  2. Summarize the Data. Ignoring that the data are simulated, this section provides basic summary statistics to understand data features.
  3. Fit the Marginal Distributions. This section provides the usual regression model fitting using Tweedie distributions for claims and a logistic model for renewal. Residuals from the model fits are calculated and used to display patterns among outcomes that are not accounted for in the marginal models.
  4. Joint Model Specification. A copula (Gaussian) model is specified to accommodate dependencies. This model is fit using composite maximum likelihood, as is available from the literature.
  5. Estimation Using Generalized Method of Moments. This procedure is novel in the copula context.

In addition, two appendices provide an overview of the theory that underpins the joint model and estimation procedures.

2 Data Generation Process

Because the data for this tutorial are simulated, users can conduct sensitivity tests to see how the estimation techniques behave as the data varies. For example, how large a data set is needed for reliable inference? How does the required sample size depend on the strength of the dependencies or what if they follow a different pattern? Users can interactively address these and other questions with access to this tutorial. Moreover, simulated data allows us to readily share the tutorial. In contrast, for real data, restrictions often apply.

2.1 Marginal Models

2.1.1 Marginal Outcome Models

As outlined in the Background Section 1, we represent the marginal distributions of the claims random variables using a Tweedie distribution so that the distributions have a mass at zero and are otherwise positive. For each claim variable, we use a logarithmic link to form the mean claims of the form \mu_{j,it} = \exp\left(\mathbf{x}_{it}^{\prime} \boldsymbol \beta_j\right), j=1,2 . Thus, the parameters are allowed to differ between auto and homeowners claims. As a consequence of this, you need not use the same variables for each claim type (a zero beta means that the variable is not part of the mean). Each claim is simulated using the Tweedie distribution, a mean, and two other parameters, \phi (for dispersion) and P (the power parameter).

Recall, for a Tweedie distribution, that the variance is \phi_j \mu^{P}. For this tutorial, we use P=1.67 for both auto and home based on our experiences analyzing real insurance data sets. In the Tweedie model, the probability of a zero claim is e^{-\lambda}, where \lambda = \mu^{2-P} /(\phi*(2-P)). So, if we use \mu = 1000, then the probability of a zero claim is \exp\left[-1000^{0.33}/(\phi*0.33)\right]. For example, by selecting \phi=42, the probability of a zero claim is 49.4%.

Interactive users have an opportunity to change the overall mean as well as dispersion parameters for each claim type.

R Code for Marginal Claim Parameter Settings

For the lapse variable, the expected value is of the form \pi_{it} = \frac{\exp\left(\mathbf{x}_{it}^{\prime} \boldsymbol \beta_L\right)}{1+\exp\left(\mathbf{x}_{it}^{\prime} \boldsymbol \beta_L\right)}, a common form for the logit model. Interactive users have an opportunity to change the overall mean lapse parameter.

R Code for Marginal Lapse Parameter Settings

2.1.2 Rating Variables

For this tutorial, we have five rating (explanatory) variables:

  • x_1, a binary variable that takes on values 1 or 2 depending on whether or not an attribute holds

  • x_2, x_3, x_4, generic continuous explanatory variables

  • x_5, time trend (x_{it} = t)

Interactive users have an opportunity to change the covariate parameters.

R Code for Covariate Parameters Settings

With these values of covariate parameters, the systematic components are

  • auto: \mathbf{x}_{it}^{\prime} \boldsymbol \beta_1 = \beta_{0,1} + 0.2 x_1 + 2 x_2

  • home: \mathbf{x}_{it}^{\prime} \boldsymbol \beta_2 = \beta_{0,2} + 0.3 x_3 + 3 x_4

  • lapse: \mathbf{x}_{it}^{\prime} \boldsymbol \beta_L = \beta_{0,L} + 2 x_2 + -0.25 x_5 .

Intercept parameters are determined using the overall mean terms specified above.

We specify a negative coefficient associated with the time trend variable (2) to reflect the fact that lapse probabilities tend to decrease with policyholder duration.

The following gives the code to generate the rating variables (covariates).

R Code for Generating Covariates

2.2 Dependence Model

Dependence among outcome variables is taken to be a Gaussian copula with the following structure \boldsymbol \Sigma = \left( \begin{array}{ccccc} 1 & \rho_{LA} & \rho_{LH} \\ \rho_{LA} & 1 & \rho_{AH} \\ \rho_{LH} & \rho_{AH} & 1 \\ \end{array} \right) . Interactive users have an opportunity to change the dependence parameters.

R Code for Dependence Model Parameter Settings

For this tutorial, we have the values \rho_{LA}= 0.4, \rho_{LH}= 0.4, and \rho_{AH}= 0.1. Here, we use negative values for the association between lapse and claims, large claims are associated with lower renewal. We use a positive value for the association between claim types.

The following R code shows how to simulate dependent outcomes.

R Code for Simulating Dependent Outcomes

3 Summarize the Data

This section demonstrates some basic techniques to look at the data. Although the data are simulated, we use the same techniques to understand this sample as if it represented real data.

3.1 Basic Summary Statistics

From the panel of 10^{4} policyholders observed over 5 years, there are N= 40776 potential observations in the data set. Remember that a policyholder who lapses is not observed in subsequent periods so we have fewer than 510^{4} potential observations that one would expect with a balanced panel. Of these potential observations, there were 3491 lapses, for a lapse rate of 8.56%.

For type 1 (auto) claims, we have 3517/510^{4} or about 7.03% (positive) claims. For type 2 (home) claims, we have 4366/510^{4} or about 8.73% (positive) claims.

As is common, we begin by examining basic measures that summarize the distribution of each variable, initially focus on continuous outcomes.

R Code for Summarizing Continuous Variables

Tables can be used to summarize discrete variables.

R Code for Summarizing Discrete Variables

Graphical summaries, such as scatter plots, can be used to demonstrate relationships among variables.

R Code for Basic Scatter Plots

The structure of our problem set-up is complicated. We have three outcome variables, several rating variables, and observe a cross-section of policyholders over time. Analysts encountering data with this structure typically do a far more complete analysis of the basic summary statistics than presented here. The purpose of this section is just to provide a taste of the type of analyses needed at this step. We assume that readers are familiar with these tasks and so proceed to more interesting steps.

3.2 Outcomes by Year

You should examine the distribution of outcomes, auto and home claims, as well as lapse, over time. After all, the whole point is think about how the availability of observations, as dictated by lapse/renewal, impacts the claims distribution.

For this tutorial, by design the distribution of the claims frequency and severity is fairly stable over time. The largest type 1 (auto) claim is 888.27 and the largest type 2 (home) claim is 2366.46, both in thousands.

Claims Summary by Year
2010 2011 2012 2013 2014
Number of Potential Obs 10000.00 8789.00 7904.00 7264.00 6819.00
Number of Lapse 1211.00 885.00 640.00 445.00 310.00
Number of Claims 1 845.00 799.00 686.00 605.00 582.00
Average Claim 1 4448.52 5462.49 4849.93 4585.87 4430.28
Number of Claims 2 1064.00 962.00 828.00 785.00 727.00
Average Claim 2 9487.35 9593.35 9886.32 9234.29 10644.43
R Code for Summarizing Outcomes by Year

3.3 Dependence Summary Statistics

Of particular interest is the relationship among outcome variables. First, we take a look at the number of observations where a policy has:

  • Type 1: neither an auto nor a home claim

  • Type 2: an auto but not a home claim

  • Type 3: not an auto but a home claim

  • Type 4: both an auto and a home claim

This frequency distribution is given for our simulated data in the table below, followed by the R code that generated the table.

Types
    1     2     3     4 
33371  3039  3888   478 
R Code for Summarizing Dependent Claims Numbers

We can also summarize relationships among outcome variables using association measures such as correlations. However, our claims variables are a hybrid combination of zeros (for no claims) and long-tailed continuous variables (for positive claim amounts). Although this feature is captured by the Tweedie distribution, it can sometimes be difficult to establish dependence with basic summary statistics. Depending on parameter values, there can be many zeros (74.52% for this data set) and when positive, claims distributions tend to be right skewed. Here is some code that provides Spearman correlations, a nonparametric correlation coefficient.

As you experiment with different parameter values, you will find that the more zeros in the data, the more difficult it is to establish dependence with basic techniques. This is interesting because we know that, when generating the data, that important dependencies exist. Recall that the known theoretical association measures are \rho_{LA}= 0.4, \rho_{LH}= 0.4, and \rho_{AH}= 0.1.

Spearman Outcome Correlations
Lapse Claims 1 Claims 2
Lapse 1.000 0.209 0.146
Claims 1 0.209 1.000 0.029
Claims 2 0.146 0.029 1.000
R Code for Summarizing Dependent Claims Associations

When summarizing the data, it is sometimes convenient to work in terms of lapse, as this is the event that is of interest to insurers. However, going forward, we work with its complement, renewal ( = one minus lapse). This is slightly more convenient mathematically in that we condition on a policy renewing to examine the claims distribution in subsequent periods.

4 Fit the Marginal Distributions

After careful work to summarize data (only a small portion shown here), the next step is fit marginal models. In this context, the descriptor marginal means analyzing each outcome without reference to the others. In subsequent sections, we join the marginal models via the copula.

Marginal model estimation is typically done assuming that each year has the same set of parameters and that observations from different years are independent. This is not necessary but provides a convenient starting point.

4.1 Logistic (Lapse) Regression Results

To model lapse, we employ a simple logistic regression (marginal) model.

Logistic Lapse Model Summary
Parameter Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.193 -5.893 0.104 -56.426 0
x2 2.000 1.896 0.041 46.736 0
year -0.250 -0.277 0.014 -19.991 0

An advantage of using simulated data is that the underlying theoretical relationships are known. For example, the true coefficient associated with year is \beta_{L1} = 2 can be readily compared to that estimated from the simulated data which is \hat{\beta}_{L1} = 2. When examining the proximity of these two quantities, we can also look to the standard error 1.8964.

R Code for Estimating the Logistic Renewal Marginal Model

4.2 Tweedie (Claims) Regression Results

The Tweedie is commonly used in insurance applications for claims. In part, this is because it can be expressed as a generalized linear model. In the following illustrative code, we have skipped the determination of the power parameters (=1.67 for us).

For type 1 (auto) claims model, we have the following:

Tweedie Claims 1 (Auto) Model Summary
Parameter Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.64 3.807 0.144 26.522 0.000
x1 0.20 0.199 0.058 3.409 0.001
x2 2.00 1.918 0.052 37.209 0.000

The externally specified parameter phi \phi is 500. The estimated value of \phi is 488.09.

As noted above, with simulated data, we know the theoretical relationships. For example, the coefficient associated with x_2 is \beta_2 = 2. From the table, the estimated value is 1.918 with a standard error of 0.0515. If the difference between the true parameter and its estimated value is large (relative to the standard error), then one easy solution is to increase the sample size n.

R Code for Estimating the Tweedie Marginal Auto Marginal Model

The type 2 (home) claims model estimation is similar:

Tweedie Claims 2 (Home) Model Summary
Parameter Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.364 4.194 0.159 26.293 0
x3 0.300 0.346 0.082 4.212 0
x4 3.000 3.052 0.078 39.094 0
R Code for Estimating the Tweedie Marginal Home Model

4.3 Residual Checking

As with all model estimation procedures, it is good standard practice to check model assumptions via an examination of the residuals. For generalized linear models, one typically examines deviance residuals. The following provides an example of a standard set of diagnostic plots based on the deviance residuals.

R Code for Diagnostic Plots for the Tweedie Marginal Auto Model

Standard residual plots from the Tweedie model can be difficult to assess due to mass at zero. Another type of residual can be calculated via the probability integral transform. That is, if Y is a continuous random variable with distribution function F, then F(Y) has a uniform distribution. We can use this relationship to assess quality of our identification of the distribution. Like the deviance residuals, this relationship breaks down in the presence of mass points, e.g. zeros, but can still be used to supplement the usual diagnostic modeling checking. We refer to these as Cox Snell residuals.

R Code for Cox-Snell Residuals from the Probability Integral Transform

Dependence among residuals suggest patterns not accounted for in the marginal modeling. The first matrix shows correlations among residuals via the probability integral transform (Cox-Snell). The second are deviance residuals. These results are qualitatively similar which is reassuring - it means that they analyst can use either definition and may select the measure that fits the audience of the work.

Spearman Correlations of Cox Snell (CS) Residuals
Lapse Resids (CS) Claims 1 Resids (CS) Claims 2 Resids (CS)
Lapse Resids (CS) 1.000 0.643 0.041
Claims 1 Resids (CS) 0.643 1.000 0.003
Claims 2 Resids (CS) 0.041 0.003 1.000
Spearman Correlations of Deviance (Dev) Residuals
Lapse Resids (Dev) Claims 1 Resids (Dev) Claims 2 Resids (Dev)
Lapse Resids (Dev) 1.000 0.612 0.051
Claims 1 Resids (Dev) 0.612 1.000 0.006
Claims 2 Resids (Dev) 0.051 0.006 1.000
R Code for Residual Correlations

5 Joint Model Specification

Estimation procedures are based on a likelihood analysis. With the assumption of conditional independence over time, the joint distribution function is \begin{array}{cl} & \Pr \left(L_{i1} \le r_{1}, \ldots, L_{im} \le r_{m}, Y_{1,i1} \le y_{11}, \ldots, Y_{1,im} \le y_{1m} , Y_{2,i1} \le y_{21}, \ldots, Y_{2,im} \le y_{2m} \right) \\ & \ \ \ \ = \prod_{t=1}^m C\left(F_{Lit}(r_{t}), F_{1,it}(y_{1t}), F_{2,it}(y_{2t}) \right) . \end{array} Here, F_{Lit}, F_{1,it}, and F_{2,it} represent the marginal distributions of L_{it}, Y_{1,it}, and Y_{2,it}, respectively.

Using the so-called inference for margins technique, we first estimate the parameters for the marginal distribution as described in Section 4. With this, the marginal distributions F_{Lit}, F_{1,it}, and F_{2,it} can be taken as given and we only need estimate the parameters of the copula C.

With the joint distribution given above, we can determine the likelihood. This is a straight-forward exercise when all of the outcomes are continuous (you only need take derivatives to determine the likelihood) For our situation, it is more complex because the lapse variable L is discrete (binary) and the claims outcomes Y_1 and Y_2 are hybrid combination of discrete (mass at zero) and continuous (for positive claims). In Appendix A, you will find a short development of the likelihood function. A more detailed treatment is available in the paper.

It is well known that if the observation process depends on the outcomes then parameter estimation may be biased and inconsistent, and hence unreliable. Depending on the parameter settings, you may observe this in the Tweedie coefficient estimates in Sections 4.2 and 4.3. Note, however, that the lapse coefficients are reliable because their estimation does not depend on other outcomes (auto and home claims).

In order to demonstrate dependence estimation, we assume that reliable estimation of the marginals are available from some source. For example, insurer typically have external data for estimation coefficients of rating variables. Alternatively, one can use first year claims that are not affected by a lapse/renewal decision. Here is the R code needed to use true marginal coefficients as inputs for our dependence estimation.

R Code for Setting Marginals to Given Parameters

5.1 R Code for Trivariate Likelihoods

Here is code that shows practical details for evaluating the joint likelihood.`

R Code for Trivariate Likelihood Calculation

5.2 Visualizing Trivariate Likelihoods

With the margins fixed, the likelihood is a function of three association parameters, \rho_{LA}, \rho_{LH}, and \rho_{AH}. Although it is possible to plot the likelihood as a function of these parameters, we find it more helpful to plot the likelihood as a function of a single parameter, holding the other two fixed.

Logarithmic Likelihood

To interpret these plots, recall that we specified the values to be \rho_{LA}= 0.4, \rho_{LH}= 0.4, and \rho_{AH}= 0.1. When creating these plots, we use the (known) specified values to help understand the likelihood curvature for each parameter.

R Code for Visualizing Trivariate Likelihoods

5.3 Estimation Results Based on Trivariate Likelihoods

We maximized this likelihood and determined standard errors in the usual fashion (by using a numerical approximation of the second derivative). The following provides the results, followed by the code.

Likelihood Estimation Results
Parameter Estimate Std Error
Lapse-Auto 0.4 0.4219 0.0191
Lapse-Home 0.4 0.3983 0.0097
Auto-Home 0.1 0.0779 0.0137
R Code for Trivariate Likelihood Estimation

6 Estimation Using Generalized Method of Moments

In Appendix B, you will find a short development of the details needed for estimation using generalized method of moments, GMM. A more detailed treatment is available in the paper.

6.1 GMM Scores

Here is some code for functions needed later.

R Code for GMM Scores

6.2 Visualize the Scores

First, let us see how the sum of scores behaves. Recall that for a score function, the objective is to find those parameters so that the score equals zero. The plots that follow mark the line where the score equals zero.

Sum of Scores

R Code for Visualizing GMM Sum of Scores

For a recursive method such as the GMM, one needs starting values for the recursion. It is possible to use the sum of squared scores for this. If you wish to explore this, click on link below that shows plots and code for this. However, we will use the likelihood estimators developed in Section 5.

Visualizing GMM Sum of Scores

6.3 Visualize the GMM Score Function

With the initial estimator, we can now calculate the GMM score function. This allows us to minimize this function in order to get our GMM estimator, with an asymptotic variance.

GMM Scores

R Code for Visualizing GMM Score Function

6.4 GMM Estimation Results

The following table summarizes the GMM estimation results and compares these results to true parameter values and estimation results from the Section 5 likelihood estimation. The code follows.

GMM Estimation Results
Parameter Like Est Like Std Error GMM Est GMM Std Error
Lapse-Auto 0.4 0.4219 0.0191 0.4185 0.0191
Lapse-Home 0.4 0.3983 0.0097 0.3976 0.0098
Auto-Home 0.1 0.0779 0.0137 0.0761 0.0136
R Code for GMM Score Function Estimation

7 Appendix A – Joint Model Specification

To keep this tutorial self-contained, here is a short development of the likelihood function. A more detailed treatment is available in the paper where you can also find references to supporting work in the academic literature.

Show Likelihood Appendix

Here is some R code for functions used for the likelihood and GMM scores.

Show R Code for Likelihood

8 Appendix B – Estimation Using Generalized Method of Moments

To keep this tutorial self-contained, here is a short development of the details needed for estimation using generalized method of moments, GMM. A more detailed treatment is available in the paper where you can also find references to supporting work in the academic literature.

Show GMM Appendix

–>

Run Time

Time taken for this tutorial to compile:

Time difference of 2.079783 hours