Loss Data Analytics

1.1 Read in Property Fund Data

Insample <- read.csv("Data/PropertyFundInsample.csv", header=T, na.strings=c("."), stringsAsFactors=FALSE)
Insample2010 <- subset(Insample, Year==2010)

A few quick notes on these commands:

read.csv reads a csv file and creates data frame from it, with cases corresponding to rows and variables to columns in the file.
The assignment operator <- is analogous to an equal sign in mathematics. The command Insample <- read.csv("PropertyFundInsample.csv", header=T, na.strings=c("."), stringsAsFactors=FALSE) means we give the name Insample to the data read.
The subset() function is used to select variables and observations. In this illustration, we selected observations from year 2010.

1.2 Claim Frequency Distribution

In 2010 there were 1,110 policyholders in the property fund. Table 1.1 shows the distribution of the 1,377 claims.

1.2.1 Property fund distribution for 2010

Table 1.1

library(pander)
Table<-as.data.frame(table(Insample2010$Freq))
names(Table)<-c("Number of Claims", "Frequncy")
pander(t(Table))

Number of Claims	0	1	2	3	4	5	6	7	8	9	10	11	13	14	15	16	17	18	19	30	39	103	239
Frequncy	707	209	86	40	18	12	9	4	6	1	3	2	1	2	1	2	1	1	1	1	1	1	1

The average number of claims for this sample was 1.24 (=1377/1110). See table 1.2 below.

Table 1.2

pander(summary(Insample2010$Freq))

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
0	0	0	1.241	1	239

A few quick notes on these commands:

Many useful R functions come in packages and to use these functions you have to install them. One way to install a package is by using the command line install.packages("<the package's name>") . In addition, to read more about a function you use the command help("function name") .
The pander function is used here to create nicer tables than regular R output. To use this function you need to download the pander package. For the normal R output in the illustration above, use the command line summary(Insample2010$Freq).
The names() function is used to to get or assign names of an object . In this illustration, we assigned Number of Claims and Frequency to the two columns in the data frame .
The t() function is used to transpose a dataframe or a matrix.

1.3 Average Severity Distribution for 2010

Table 1.3 summarizes the sample distribution of average severity from the 403 policyholders; Figure 1.2 provides further information about the distribution of sample claims, showing a distribution that is dominated by this single large claim so that the histogram is not very helpful. Even when removing the large claim, you will find a distribution that is skewed to the right. A generally accepted technique is to work with claims in logarithmic units especially for graphical purposes; the corresponding figure in the right-hand panel is much easier to interpret.

Table 1.3

InsamplePos2010 <- subset(Insample2010, yAvg>0)
pander(summary(InsamplePos2010$yAvg))

Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
166.7	2226	4951	56330	11900	12920000

length(InsamplePos2010$yAvg)

## [1] 403

Note: The length() function sets the length of a vector (list) or other objects.

1.3.1 Plot of average claims

Figure 1.2

par(mfrow=c(1, 2))
hist(InsamplePos2010$yAvg, main="", xlab="Average Claims")
hist(log(InsamplePos2010$yAvg), main="", xlab="Logarithmic Average Claims")

A few quick notes on these commands:

The par(mfrow) function is handy for creating a simple multi-paneled plot. mfrow is a vector of length 2, where the first argument specifies the number of rows and the second the number of columns of plots.
The hist() computes a histogram of the given data values. You put the name of your dataset in between the parentheses of this function

1.4 Rating Variables

Earlier we considered a sample of 1,110 observations which may seem like a lot. However, as we will seen in our forthcoming applications, because of the preponderance of zeros and the skewed nature of claims, actuaries typically yearn for more data. One common approach that we adopt here is to examine outcomes from multiple years, thus increasing the sample size.

Table 1.4 shows that the average claim varies over time.

1.4.1 Average claims over time

Table 1.4

library(doBy)
T1A <- summaryBy(Freq ~ Year, data = Insample,   
                 FUN = function(x) { c(m = mean(x), num=length(x)) } )
T1B <- summaryBy(yAvg    ~ Year, data = Insample,   
                 FUN = function(x) { c(m = mean(x), num=length(x)) } )
T1C <- summaryBy(BCcov    ~ Year, data = Insample,   
                 FUN = function(x) { c(m = mean(x), num=length(x)) } )
Table1In <- cbind(T1A[1],T1A[2],T1B[2],T1C[2],T1A[3])
names(Table1In) <- c("Year", "Average Freq","Average Sev", "Average Coverage","No. of Policyholders")
pander(Table1In)

Year	Average Freq	Average Sev	Average Coverage	No. of Policyholders
2006	0.9515	9695	32498186	1154
2007	1.167	6544	35275949	1138
2008	0.9742	5311	37267485	1125
2009	1.219	4572	40355382	1112
2010	1.241	20452	41242070	1110

A few quick notes on these commands:

The summaryBy() function provides summary statistics of a variable across different groups. You need to install the doBy package to use the command.
The cbind() combines vector, matrix or data frame by columns. The row number of the two datasets must be equal.
The c() function combines its arguments to form a vector.

For a different look at this five-year sample, Table 1.5 summarizes the distribution of our two outcomes, frequency and claims amount. In each case, the average exceeds the median, suggesting that the distributions are right-skewed.

1.4.2 Frequency and claims statistics of full data

Table 1.5

t1<- summaryBy(Insample$Freq ~ 1, data = Insample, 
                 FUN = function(x) { c(ma=min(x), m1=median(x),m=mean(x),mb=max(x)) } )
names(t1) <- c("Minimum", "Median","Average", "Maximum")
t2 <- summaryBy(Insample$yAvg ~ 1, data = Insample, 
                 FUN = function(x) { c(ma=min(x), m1=median(x), m=mean(x),mb=max(x)) } )
names(t2) <- c("Minimum", "Median","Average", "Maximum")
t3 <- summaryBy(Deduct ~ 1, data = Insample, 
                 FUN = function(x) { c(ma=min(x), m1=median(x), m=mean(x),mb=max(x)) } )
names(t3) <- c("Minimum", "Median","Average", "Maximum")
t4 <- summaryBy(BCcov/1000 ~ 1, data = Insample, 
                 FUN = function(x) { c(ma=min(x), m1=median(x), m=mean(x),mb=max(x)) } )
names(t4) <- c("Minimum", "Median","Average", "Maximum")
Table2 <- rbind(t1,t2,t3,t4)
Table2a <- round(Table2,3)
Rowlable <- rbind("Claim Frequency","Claim Severity","Deductible","Coverage (000's)")
Table2aa <- cbind(Rowlable,as.matrix(Table2a))
pander(Table2aa)

	Minimum	Median	Average	Maximum
Claim Frequency	0	0	1.109	263
Claim Severity	0	0	9291.565	12922217.84
Deductible	500	1000	3364.87	1e+05
Coverage (000’s)	8.937	11353.566	37280.855	2444796.98

A few quick notes on these commands:

The rbind() combines vector, matrix or data frame by rows. The column of the two datasets must be same.
The round() function rounds the values in its first argument to the specified number of decimal places (default 0).

Table 1.6 describes the rating variables considered in this chapter. To handle the skewness, we henceforth focus on logarithmic transformations of coverage and deductibles. To get a sense of the relationship between the non-continuous rating variables and claims, Table 1.7 relates the claims outcomes to these categorical variables.

1.4.3 Rating variable description

See table 1.6 below for variables and variable description.

Table 1.6

R code for table 1.6

pander(des)

Variable	Description
BCcov	Total building and content coverage in dollars
Deduct	Deductible in dollars
Entity Type	Categorical variable that is one of six types: (Village, City, County, Misc, School, or Town)
AlarmCredit	Categorical variable that is one of four types: (0%, 5%, 10%, or 15%), for automatic smoke alarms in main rooms
NoClaimCredit	Binary variable to indicate no claims in the past two years
Fire5	Binary variable to indicate the fire class is below 5. (The range of fire class is 0~10)

1.4.4 Frequency and claims by rating variables

Table 1.7 shows claims summary by Entity Type, Fire Class, and No Claim Credit.

Table 1.7: R code for table 1.7

pander(Table4)

	Freq.num	Freq.m	yAvg.m
Village	1341	0.452	10645.206
City	793	1.941	16924.035
County	328	4.899	15453.206
Misc	609	0.186	43036.076
School	1597	1.434	64346.394
Town	971	0.103	19831.048
Fire5–No	2508	0.502	13935.421
Fire5–Yes	3131	1.596	41421.263
NoClaimCredit–No	3786	1.501	31365.085
NoClaimCredit–Yes	1853	0.31	30498.714
Total	5639	1.109	31206.155

Table 1.8 shows claims summary by Entity Type and Alarm Credit

Table 1.8:

R code for table 1.8

pander(Table4a) #Claims Summary by Entity Type and Alarm Credit==00

	Freq.m	yAvg.m	Freq.num
Village	0.326	11077.997	829
City	0.893	7575.979	244
County	2.14	16012.719	50
Misc	0.117	15122.127	386
School	0.422	25522.708	294
Town	0.083	25257.084	808
Total	0.318	15118.491	2611

pander(Table4b) #Claims Summary by Entity Type and Alarm Credit==05

		Freq.m	yAvg.m	Freq.num
	Village	0.278	8086.057	54
	City	2.077	4150.125	13
t3	County	0	0	1
	Misc	0.278	13063.933	18
	School	0.41	14575.003	122
	Town	0.194	3937.29	31
	Total	0.431	10762.112	239

pander(Table4c) #Claims Summary by Entity Type and Alarm Credit==10

	Freq.m	yAvg.m	Freq.num
Village	0.5	8792.376	50
City	1.258	8625.169	31
County	2.125	11687.969	8
Misc	0.077	3923.375	26
School	0.488	11596.912	168
Town	0.091	2338.06	44
Total	0.517	10194.094	327

pander(Table4d) #Claims Summary by Entity Type and Alarm Credit==15

	Freq.m	yAvg.m	Freq.num
Village	0.725	10543.752	408
City	2.485	20469.514	505
County	5.513	15475.74	269
Misc	0.341	87020.878	179
School	2.008	85139.974	1013
Town	0.261	9489.613	88
Total	2.093	41458.312	2462

Loss Data Analytics - R Codes

Chapter 1 Introduction to Loss Data Analytics

1.1 Read in Property Fund Data

1.2 Claim Frequency Distribution

1.2.1 Property fund distribution for 2010

1.3 Average Severity Distribution for 2010

1.3.1 Plot of average claims

1.4 Rating Variables

1.4.1 Average claims over time

1.4.2 Frequency and claims statistics of full data

1.4.3 Rating variable description

1.4.4 Frequency and claims by rating variables