PredictionIO Blog

Customer LTV: A Machine Learning Approach (Part 1)21 Jan 2016

This is first in a two-part series on Machine Learning approaches to estimating Customer Lifetime Value (LTV). The first post covers how to define high-value customers.

Customer lifetime value, or LTV, is a quantity that features in many marketing and e-commerce blogs and articles. However, LTV is not by any means a well-defined quantity, and formulas arise by modeling different aspects of customer behavior.

In this series we will take a different perspective on a customer’s worth to a company. In particular, we explicitly define customer value, or CV, to be the total amount spent by a customer in the first $n$ months from their recorded signup time. We also define a high-value customer to be those whose customer value is greater than or equal to $r$, the high-value cutoff, which we will determine shortly.

Ultimately, this will allow early prediction of high-value customers based on sign up activity. First and foremost, we provide some different approaches to systematically determining an appropriate high-value cutoff answering the question:

Who are my high-value customers?

Different Approaches to Defining High Value Customers

It is up to each individual company to choose a time interval for $n$. However, this choice should be made according the data set constraints, for our example $n$ is twelve months (or $365$ days). This results in a data set with a little over one million observations.

In order to systematically choose the high value cutoff, $r$, it helps to rephrase the problem as the following question:

For customers who will spend money in their first year, what is an unusually large amount?

We can use this question to motivate a formal definition of the high value cutoff. Fixing some number $p$ between $0$ and $1$, we will define the high-value cutoff, $r$ such that, given a CV greater than $0$, the probability of observing a CV value greater than $r$ is less than $1 - p$. This is more commonly known as the $(100 \cdot p)^{\text{th}}$ percentile of the positive CV distribution. In this post, I’ll fix $p = 0.9$ for illustration purposes, and I will show some tools for arriving at and quantifying your confidence in an estimate for the true $90^{\text{th}}$ percentile of this distribution, including:

1. Empirical quantile estimates
2. Bootstrapped confidence intervals for quantile estimates
3. Density-based quantile estimates

You may be wondering at this point what I mean by the “positive CV distribution”. It’s helpful to think about the CV values in the following way: imagine that there is some factory that’s generating random CV values over time. The underlying distribution dictates the probability of observing a particular CV value that is generated by this imaginary factory. This is a bit abstract, but it is precisely this line of thinking that lets you generalize any insights extracted from the data to future observations, rather than providing statistics that are, from a probabilistic viewpoint, only relevant to the specific data set at hand.

The Data

We have a data set consisting of a company’s customer behavioral data which we cleaned up and processed. The resulting data set contains roughly $30000$ positive CV scores (i.e., customers that made purchases) which we will use for the remainder of this blog post. The positive CV score data is plotted as a histogram below.

Empirical Quantile Estimates

Remember that we want to estimate the $90^{\text{th}}$ quantile for CV scores greater than $0$. This section will cover the simplest way to estimate this quantity from the data.

I will start with a very simple example, suppose that you observe the following list of observations: $\200,\200,\100,\300,\500, \2000,\200,\100,\300,\500$ and you want to compute the $90^{\text{th}}$ percentile. Note that the list contains ten values, and also note that $\frac{9}{10} = 0.9$. Based on this information, one approach to computing the $90^{\text{th}}$ percentile is to sort the list of values, and take the percentile to be the ninth value on the list:

The resulting percentile estimates are called empirical quantile estimates. The empirical quantile estimate for the $90^{\text{th}}$ percentile of the sample data set (i.e. the high-value cutoff) was computed to be approximately $455.72$. I saved the code used for computing quantile estimates as a GitHub Gist.

Now, go back to our imaginary factory thought experiment. As you can imagine, if we look at different sets of observations generated by the same factory then the resulting empirical quantile estimates will likely differ from set to set since we would be looking at a different list of numbers each time.

Given this observation, how can we quantify our confidence in our particular estimate for $90^{\text{th}}$ percentile? One tool that statisticians like to use for this purpose is called a confidence set which will be the topic of the next section.

To explain the intuition behind this, suppose that we could get $B$ different sets of positive CV values from our imaginary factory. We could then compute an empirical quantile estimate for each of these different data sets. You can see that this procedure would effectively give us a distribution for the empirical quantile estimates, which, again, would allow us to make statements about the probability of observing particular values using this estimation approach. A $95%$ confidence set for our estimate gives us a range of empirical quantile estimates for which the probability of observing an empirical quantity estimate outside of this range is less than $5%$.

Bootstrapping Confidence Intervals

Unfortunately, it is rarely the case that we can organically produce a collection of quantile estimates in this manner for obvious reasons. In order to emulate this procedure and compute a confidence set, statisticians like to employ a tool called bootstrapping.

Bootstrap re-sampling, or bootstrapping, generates $B$ different data sets of size $n$ by sampling observations with replacement from the original data set. As before, empirical quantile estimates are then computed for each of the different sets, from which we can make probabilistic inference regarding the values produced by using empirical quantile estimates for the $90^{\text{th}}$ percentile of the positive CV distribution, or the high-value cutoff.

It is this probabilistic inference that allows us to estimate $95%$ confidence sets for the empirical quantile estimates. The particular confidence sets we will be interested in estimating are $95\%$ confidence intervals centered around a given CV score $c$. A $95%$ confidence interval centered around $c$ is a numeric interval $[a, b]$ such that (1) the probability of observing a value between $a$ and $c$ is $\left(\frac{95}{2}\right)\%$, and (2) the probability of observing a value between $c$ and $b$ is $\left(\frac{95}{2}\right)\%$.

I will now explain how to use bootstrapping to approximate these type of confidence intervals via a toy example. Let’s say you have your own data set of CV scores and you compute empirical quantile estimate for the $90^{\text{th}}$ percentile to be $\100$. You then proceed by creating the following $10$ empirical quantile bootstrap estimates:

Now, you append the empirical quantile estimate obtained from the full data set, which in this toy example is $\100$, and sort the resulting list:

I underlined the full-data empirical quantile estimate in the list above, as we will use this list to compute the full-data estimate’s percentile relative to the bootstrap samples. You can see that this is roughly $\frac{6}{11} \approx 0.55$.

Since we are trying to compute a $95%$ confidence interval around $\100$, we want to compute the $\left(55 - \frac{95}{2}\right) = 7.5^{\text{th}}$ and $\left(55 + \frac{95}{2}\right) = 102.5^{\text{th}}$ percentiles of the sorted list of values. Now, finding the $102.5^{\text{th}}$ percentile just doesn’t make sense, so we settle for a confidence interval whose endpoints correspond to the $7.5^{\text{th}}$ and $100^{\text{th}}$ percentile values.

You can use the code I provided in the previous section to calculate these percentiles. You should get that an approximate $95%$ confidence interval centered around $\100$ for this toy example is $[95, 110]$. You can find my R code for carrying out this procedure in this GitHub Gist.

I used this code to obtain an approximate $95%$ confidence interval for my positive CV score data centered around the full-data empirical quantile estimate, which was computed in the previous section to be $455.72$. The computed interval is $[442.47, 495.62]$.

The plot below shows a histogram of the bootstrap estimates with the confidence region shaded in red and the full-data estimate as a blue line:

The important thing to keep in mind is that this interval is making a statement regarding the precision of the empirical quantile estimate, and not its accuracy.

The last section will very briefly talk about a different approach for estimating quantiles using the notion of a probability density function.

Density Quantile Estimates

A probability density function at a high level is a mathematical function that takes an input and tells you the probability of observing the particular value. This is precisely the function that determines how our imaginary factory produces different values for CV.

If we somehow knew the probability density function associated with the customer value data, then we could compute quantiles in a non-random fashion. Density estimation is a large topic and the mathematics can become quite complex, however, I will give a very high-level overview of two approaches that can be used to approximate probability density functions.

The first method that is commonly used for estimating a probability density is called kernel density estimation. This method is intimately related to the histograms we have been using to analyze the data. The $90^{\text{th}}$ percentile estimated from the entire data set using the kernel density estimation method is roughly $524.50$, which is outside of the confidence set produced in the previous section.

The second method used for estimating probability densities arises from assuming that the data comes from a known probability distribution that is governed by some set of parameters. The parameters are then estimated using the data. This procedure is called fitting a probability distribution to the data. After delving into the data, I decided to fit an inverse-gamma distribution and obtained the $90^{\text{th}}$ percentile estimate to be $444.99$ which is contained in the confidence set produced in the previous section.

This is all I will say regarding these two different estimation methods, as a detailed explanation requires much more technical machinery than should be discussed here. I should also note that bootstrap confidence intervals can also be computed for both of these percentile estimates in exactly the same fashion as before.

In particular, the plot below shows the bootstrap estimates for the Inverse-Gamma quantile estimation method with the confidence set shaded in red and the full-data estimate shown as a blue line.

You should compare this with the plot of the empirical quantile bootstrap estimates from the previous section.

Predicting High-value Customers

In Part 1, we set out to complete the first step in predicting high-value users which was to determine the cutoff, $r$, that defines a high-value customer. Our approach was to define the cutoff as a percentile in the data and showed a simple, but powerful way to compute this percentile. We briefly mentioned more sophisticated methods for quantile estimation. We also showed a robust method for estimating confidence intervals that can be used for any statistic you wish to estimate from a data set.

For Part 2 in this series on Machine Learning approaches to Customer Lifetime Value we will use the average of the empirical and inverse-gamma quantile estimates, which is approximately $\450$, as our cutoff. We will cover methods for predicting high-value users (i.e., those who spend $\450$ or more) using only data obtained from a customer’s signup event, such as sign-up time, country of registration, page views and interactions with marketing campaigns.

By Marco Vivero