In the previous post, we explored Binomial and Poisson distributions in R with examples in Actuarial Science. In this post, we will look at two more important distributions Exponential and Gamma in R and their application as a Loss Distributions.
Gamma Distribution
The gamma distribution is a family of continuous distributions. Its importance is largely seen in insurance for modelling claim sizes.
The gamma family of distributions has two parameters - the shape parameter α, and the rate parameter λ. It is important to note here that the rate parameter is not to be misinterpreted as the scale parameter. The scale parameter is defined as the reciprocal of the rate parameter, 1/λ.
A random variable X which is gamma distributed with parameters α and λ, is written as X ~ Gamma (α, λ) and its probability density function is given as:
The gamma function, Γ(α) appears in the denominator of this PDF. It is defined for α > 0 as follows:
In the first figure, the shape parameter α is kept at 3 and all the gamma distributions have the same skewness. However, the gamma distribution becomes more spread out as the rate parameter λ decreases (or as scale parameter 1/λ increases).
In the second figure, the rate parameter λ is kept at 2, the gamma distributions become less skewed as the shape parameter α increases.
Examples of Gamma Distribution, the size of loan defaults, loading of web servers, annual rainfall, varies forms of telecom exchange, etc. can be modelled by a gamma distribution.
In Actuarial/Insurance, the Gamma distribution is often used for, modelling the aggregate size of insurance claims and modelling insurance claims severity.
It is important to understand that it is not easy to compute probabilities of gamma distribution directly from the formula as there is no closed-form. The easiest method is by using the well-known functions and packages in R. The probabilities can also be calculated by using the relationship of converting it into exponential and chi-squared.
Note: In R, the Gamma distribution is represented as dgamma, pgamma, qgamma and rgamma.
Let's look at an example, The annual claim amounts (in $m) for an insurance company is modelled by a gamma distribution with α = 4 and λ = 2. Then let X denote the annual claims amounts (in $m) such that X ~ Gamma(4, 2).
Let's answer some questions!
What is the probability that
(i) the annual claim amounts are between $0.5m and $2m?
(ii) the annual claim amounts are greater than $4m
Solution: (i) The probability that the annual claims are between $0.5m and $2m is P(0.5 < X < 2) = P(X < 2) - P(X < 0.5). This can be computed using the pgamma() function in R.
> pgamma(2, shape = 4, rate = 2) - pgamma(0.5, shape = 4, rate = 2)
[1] 0.5475417
(ii) The probability that the annual claims are greater than $4m is P(X > 4). This can be computed using pgamma() function in R with argument lower.tail = FALSE to compute P(X > x) instead of P(X <= x).
> pgamma(4, shape = 4, rate = 2, lower.tail = FALSE)
[1] 0.04238011
Special Cases of Gamma Distribution
The exponential and the chi-squared distributions are the special cases of the gamma distribution.
If we change the values of the shape parameter (α) and the scale parameter (λ) then we can transform the gamma distribution and change its properties. When the shape parameter, α = 1, and the rate parameter λ being variable then the gamma distribution becomes the Exponential Distribution with mean 1/λ.
When the shape parameter, α = ν/2, and the scale parameter, λ = 1/2 then the gamma distribution becomes a Chi-square distribution with degrees of freedom 2α = v. Another way of looking at this is when W ~ Gamma (α, λ), then 2λW has a Chi-square distribution with 2α = v degrees of freedom.
Exponential Distribution
The exponential distribution is used as a simple model to predict the amount of waiting time until the next event where the events occur according to a Poisson process with rate λ. The parameter of the exponential distribution λ is the same as the parameter of the Poisson process λ, the rate at which the events occur.
For example, the amount of time until the next claim call arrives at a general insurance company's call centre.
A random variable X which is exponentially distributed with parameter λ is written as X ~ Exp(λ).
Suppose that the claims to the general insurance company arrive at a rate of 0.5 per day. Here 0.5 is a rate. The number of claims can then be modelled as a Poisson distribution and the Poisson rate will be 0.5. Now, what it means is that during a day, the claim arrives 0.5 times on average. Converting this into time terms, it takes 2 days until a claim arrives. The number of visitors on your blog in a day, the number of motor accidents in a week, etc., are all rates (λ) of the unit of time, which is the parameter of the Poisson distribution.
Although, when we model the waiting time between events, we consider time instead of rate, e.g., the number of days until a claim arrives at a general insurance company is 2 days (instead of saying 0.5 claims/day, which is a rate). Thus, the mean of the exponential distribution is given as 1/λ.
The probability density function of the exponential distribution is given as:
The Cumulative distribution function of the exponential distribution is given as:
Moments of the exponential distribution: μ = 1/λ and σ² = 1/λ².
The exponential distribution along with gamma, lognormal, etc. is also used as a simple loss distribution to model claim amounts.
We will see later in this blog, what are loss distributions and why exponential, gamma, lognormal, etc. are used as a simple loss distribution to model claim amounts.
Let's look at an example, suppose that the claims to a general insurance company arrive at a rate of 2 per day. Then the number of claims, X, in a day can be modelled as a Poisson distribution with mean λ = 2. Hence, the waiting time, T, between claims can be modelled as an exponential distribution with mean 1/λ = 0.5 and parameter λ = 2, T ~ Exp(2).
Let's answer some questions!
What is the probability that
(i) the next claim arrives within a day
(ii) the next claim arrives after 2 days
Solution: (i) The probability that the next claim arrives within a day is P(X < 1). This can be computed using pexp() function in R.
> pexp(q = 1, rate = 2)
[1] 0.8646647
(ii) The probability that the next claim arrives after 2 days is P(X > 2). This can be computed using pexp() function in R with argument lower.tail = FALSE to compute P(X > x) instead of P(X <= x).
> pexp(q = 2, rate = 2, lower.tail = FALSE)
[1] 0.01831564
Applications of Gamma and Exponential Distributions in Modelling Insurance claims
Loss Distributions
Loss Distributions are statistical distributions that are used to model individual claim amounts. The typical pattern of the individual claim amounts looks like the one shown below, with a few small claims, rising to a peak, then tailing off gradually with a few very large claims.
The tail of the distribution is of fundamental importance to the insurance company since the claim amounts in the tail has a high severity (although they are less frequent). To prevent itself from large losses the insurance company should model the claim amounts appropriately such as the modelled distribution closely follows the tail of the distribution.
The general conclusion about the claim amount distributions is that they tend to be positively skewed and long-tailed.
A range of statistical techniques can be used to describe the distribution of these random variables (individual claim amounts). The object is to find a loss distribution that adequately describes the claims that actually occur. This can be broken down in three steps:
1) Data Exploration - Looking at the data, it's distribution and making an initial assumption about the statistical distribution that can possibly fit the data well.
2) Estimation - First, assuming that the claim amounts are a member of a certain family of distributions. Then estimating the parameters of the family using the claim amounts data by an appropriate method such as the method of moments, maximum likelihood, etc.
3) Goodness-of-fit test - To test whether a given loss distribution provides a good model for the observed claim amounts. This can be done in various ways but two of the most useful ways are using Q-Q Plot and applying the chi-squared goodness-of-fit test.
Why Gamma, Exponential, etc. are used as a Loss Distribution?
As we have seen earlier the claim amount distributions tend to be positively skewed and long-tailed. This is the reason that the positively skewed distributions provide a good approximation to the claim amount distributions. Gamma, Exponential, Lognormal, etc. are positively skewed distributions. Thus they are widely used in the insurance industry for modelling individual claim amounts.
Application
Dataset - We have the following dataset available for claim amounts of a general insurance company (in $).
Importing the claims dataset
Before we import the dataset, save it in your working directory. To know your working directory in R use getwd(). We will import the claims dataset using read.csv function and name it claim_amount. The argument header = T tells the function that our dataset has column names (headers).
> claim_amount <- read.csv(file = "claim_amount.csv", header = T)
Looking at the top 6 rows of the dataset using the head function:
> head(claim_amount)
claimsize
1 1392
2 3424
3 351
4 416
5 839
6 689
As we can see there's only one column in our dataset, so we extract this column of data as a vector, just to make it easier to work with:
> claim_size <- claim_amount$claimsize
Summary statistics
Looking at the summary statistics of the claim sizes using the summary function.
> summary(claim_size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
31.0 626.5 1182.5 1358.8 1801.0 6220.0
We can see here that the size of the claims ranges from $31 to $6220. Thus, It would be a better choice to scale the values before fitting a distribution to them. This will help us speed our code and will prevent us from any future errors. From now on we will work with claim amounts in ($ 000s).
Scaling the values
For this, we divide the claim amounts by 1000.
> claim_size <- claim_size/1000
> summary(claim_size)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0310 0.6265 1.1825 1.3588 1.8010 6.2200
Histogram of claim amounts
The maximum claim size observation is 6.22 (in $ 000s) so we’ll plot a histogram with the x-axis on the range (0, 7):
> hist(claim_size, freq = F, xlim = c(0, 7), ylim = c(0, 0.6), xlab = "Claim Amounts (in $000s)", main = "Histogram of Claim Amounts")
The option freq = FALSE tells R to plot the PDF, rather than the frequencies.
"fitdistrplus" package
The fitdistrplus package is a general package that aims at helping the fit of univariate parametric distributions to censored or non-censored data. The two main functions are fitdist for fit on non-censored data and fitdistcens for fit on censored data - Package Documentation.
Installing the package using install.packages function and then loading the package using library function.
#To install the package: install.packages("fitdistrplus")
> library(fitdistrplus)
Using the fitdist function to fit the "gamma" distribution to the claim amounts by the method of maximum likelihood estimation. Here the arguments distr and method tells the function which distribution to fit and which method to apply.
> gamMLE <- fitdist(data = claim_size, distr = "gamma", method = "mle")
> summary(gamMLE)
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters :
estimate Std. Error
shape 1.960318 0.1818196
rate 1.442773 0.1523692
Loglikelihood: -239.2944 AIC: 482.5888 BIC: 489.1855
Correlation matrix:
shape rate
shape 1.0000000 0.8782432
rate 0.8782432 1.0000000
We have got our estimates of the parameters. Now, we will store the estimates in a new variable est to make it easier to work with them.
> est <- gamMLE$estimate
> est
shape rate
1.960318 1.442773
Superimposing the density curve of the fitted gamma distribution
We have already plotted the histogram of the claim amounts. Now we plot our fitted gamma distribution on the same graph as the histogram. For this, we first create a sequence of x-values for plotting the density curve on the same range as that of the histogram.
> x <- seq(from = 0, to = 7, by = 0.1)
> dens_gam_est <- dgamma(x, shape = est[1], rate = est[2])
> lines(x, dens_gam_est, col = "red")
We can clearly see from the above graph that the gamma distribution has provided a very good approximation to the claim amounts. But still, we would be fitting exponential distribution to get an idea about when to fit exponential and when to fit gamma or some other distribution.
Fitting the exponential distribution
We will again use fitdist function but this time to fit the exponential distribution and then store the estimates as earlier we did.
> expMLE <- fitdist(data = claim_size, distr = "exp", method = "mle")
> summary(expMLE)
Fitting of the distribution ' exp ' by maximum likelihood
Parameters :
estimate Std. Error
rate 0.7359354 0.05203839
Loglikelihood: -261.3226 AIC: 524.6452 BIC: 527.9435
> estE <- expMLE$estimate
> estE
rate
0.7359354
Superimposing the density curve of the fitted gamma distribution
Now we plot our fitted exponential distribution on the same graph. For this, we will use the same sequence of x-values for plotting the density curve.
> dens_exp_est <- dexp(x, estE)
> lines(x, dens_exp_est, col = "green")
> legend("topright", col = c("red", "green"), lty = 1, leg = c("Gamma", "Exp"))
As we can see gamma distribution provides a good fit to the claim amounts data as compared to the exponential distribution. The reason behind this is the shape of the gamma distribution which closely follows the general shape of claim amounts distribution. That is, a few small claims at the start, then rising to a peak, then tailing off gradually with a few very large claims. On the other hand, the exponential distribution provides a good fit to the type of claim amount distribution which starts with a peak, then tail off gradually with a few very large claims at the end.
Alternative: Plotting the empirical density function against the fitted distribution
We’ll use the density function on the range from 0 to max(claim_size):
> plot(density(claim_size, from=0, to=max(claim_size)), xlab="Claim Size (in $000s)", main = "Empirical density function", ylim = c(0,0.6), col = "blue")
We check the maximum claim size:
> max(claim_size)
[1] 6.22
Hence, we set our x coordinates as:
> x <- seq(0, 6.22, 0.1)
We have already calculated the PDF of the fitted gamma distribution and stored it in dens_gam_est. But we will not be using it because the x lengths differ. Finally, we add the fitted gamma curve to our empirical density plot:
> lines(x, dgamma(x, est[1], est[2]), col = "red")
> legend("topright", lwd = 1, col = c("red", "blue"), leg = c("fitted gamma", "Empirical density"))
This looks like a good fit.
The Goodness of Fit test - Q-Q Plot
One way to test for goodness of fit is to create a Q-Q Plot.
A Q‐Q plot is a scatterplot created by plotting two sets of quantiles against one another. It will take your sample data, sort it in ascending order, and then plot each point in the data against the quantiles calculated from a theoretical (fitted) distribution.
The Q‐Q plot is used for inspecting how closely the dataset fits the chosen theoretical distribution. A straight diagonal line indicates a perfect fit.
We have already estimated the parameters of the fitted gamma and exponential distribution. To plot the Q‐Q plot for the fitted distributions, we first simulate a set of data from the fitted distributions (first using gamma). Note: Our sample only has 200 observations (length(claim_size)) but that doesn’t mean we have to simulate 200 values from the fitted distributions. In fact, it’s a good idea to simulate a fairly large sample so that R won’t have to interpolate so much between missing quantiles:
> set.seed(999)
> fitted_gam <- rgamma(1000, est[1], est[2])
Now we’ll create a Q‐Q plot with the fitted quantiles on the x-axis and the sample quantiles on the y-axis:
> par(mfrow=c(1,2))
> qqplot(fitted_gam, claim_size, main = "Gamma")
> abline(0, 1)
Here we have used the set.seed function for reproducible result and the abline function in the code above adds a straight line with intersect = 0 and slope = 1. So this indicates the line of the perfect fit. (Note: The par() function is used to plot the graphs side-by-side).
Similarly, we did the same for the fitted exponential distribution.
> fitted_exp <- rexp(1000, estE)
> qqplot(fitted_exp, claim_size, main = "Exp")
> abline(0,1)
Clearly, the gamma distribution is a much better fit to the data! Of course, this is no guarantee that these particular parameters for the gamma distribution are the best fit for our claims data. It does not even guarantee that the gamma distribution is any better than, say, the lognormal or Pareto distributions. Nonetheless, Q‐Q plots are a useful tool to visualize the goodness‐of‐fit of a distribution to a given dataset.
The Goodness of Fit test - Chi-squared test
Another useful way of analysing the goodness-of-fit is by using a chi-squared test.
Here we are testing:
H0: claim amounts are gamma-distributed
H1: claim amounts are not gamma-distributed
First, we create the frequency table of the claims data by dividing the data into separate intervals. Here we created the last interval (3, 7] like this because the sample data has very few observations for claim amounts greater than $4,000.
> claim.cut <- cut(claim_size, breaks=c(0:3, 7)) #binning data
> table(claim.cut) # binned data table
size.cut
(0,1] (1,2] (2,3] (3,7]
86 74 26 14
Next, we convert the table of observed values into vector form using the as.vector function and store them in variable f.os for later use.
> f.os <- as.vector(table(claim.cut))
> f.os
[1] 86 74 26 14
Now we need to compute the expected proportions from the fitted gamma distribution. For example, for the interval (1, 2] the expected proportion will be calculated as P(X<2)-P(X<1). We store the expected proportions as a vector.
> f.ex <- c((pgamma(1,est[1],est[2])-pgamma(0,est[1],est[2])),(pgamma(2,est[1],est[2])-pgamma(1,est[1],est[2])),(pgamma(3,est[1],est[2])-pgamma(2,est[1],est[2])),(pgamma(7,est[1],est[2])-pgamma(3,est[1],est[2])))
> f.ex
[1] 0.43534146 0.35598714 0.14186906 0.06638094
> sum(f.ex)
[1] 0.9995786
The sum of the expected proportions is very close to 1 but not exactly equal to 1. Thus, to prevent error while running the chi-squared test we set the rescale.p argument to TRUE in the chisq.test function. Doing this the values of the expected proportions are rescaled to sum to 1. Other arguments of the function are x (vector of observed values from the data) and p (vector of expected proportions calculated using the fitted distribution).
> test <- chisq.test(x = f.os, p = f.ex, rescale.p = T)
> test
Chi-squared test for given probabilities
data: f.os
X-squared = 0.3613, df = 3, p-value = 0.9481
The p-value of the chi-squared test is much greater than the significance level of 0.05 (say). So we do not have enough evidence to reject H0 and we can conclude that the claim amounts are gamma-distributed.
We can also extract the expected and observed values from the test as follows:
> test$expected
[1] 87.10500 71.22744 28.38577 13.28178
> test$observed
[1] 86 74 26 14
You can download the text file of all the R code used above from below:
Thank you for reading the post. We hope you enjoyed it! Subscribe to the blog to get notified about the next posts. We have a question for you, do you know any other actuarial applications of these distributions discussed above? Do let us know in the comments, if you enjoyed reading this post, we would be very grateful if you’d help it spread by sharing it with your friends and network. Thank you!
—Abhishek Goel and Bhavyadeep Gupta
We both are currently actuarial students and the above blog is to the best of our knowledge. We do not claim to be the experts in the field. We are just sharing our learning with others and this is an effort to help other students learn R. Your feedback and suggestions are welcomed.
Useful Resources:
Comments