*This is first in a two-part series on Machine Learning approaches to estimating
Customer Lifetime Value (LTV). The first post covers how to define high-value
customers.*

Customer lifetime value, or **LTV**, is a quantity that features in many
marketing and e-commerce
blogs and
articles.
However, LTV is not by any means a well-defined quantity, and formulas arise by
modeling different aspects of customer behavior.

In this series we will take a different perspective on a customer’s worth to a
company. In particular, we explicitly define customer value, or **CV**, to be
the *total amount spent by a customer in the first months from their
recorded signup time*. We also define a **high-value customer** to be those
whose customer value is greater than or equal to , the high-value
**cutoff**, which we will determine shortly.

Ultimately, this will allow early prediction of high-value customers based on sign up activity. First and foremost, we provide some different approaches to systematically determining an appropriate high-value cutoff answering the question:

*Who are my high-value customers?*

### Different Approaches to Defining High Value Customers

It is up to each individual company to choose a time interval for . However, this choice should be made according the data set constraints, for our example is twelve months (or days). This results in a data set with a little over one million observations.

In order to systematically choose the high value cutoff, , it helps to rephrase the problem as the following question:

*For customers who will spend money in their first year, what is an unusually
large amount?*

We can use this question to motivate a formal definition of the high value
cutoff. Fixing some number between and , we will define the
high-value cutoff, such that, given a CV greater than , the
probability of observing a CV value greater than is less than .
This is more commonly known as the percentile of
the positive CV distribution. In this post, I’ll fix for
illustration purposes, and I will show some tools for arriving at and
quantifying your confidence in an estimate for the *true*
percentile of this distribution, including:

- Empirical quantile estimates
- Bootstrapped confidence intervals for quantile estimates
- Density-based quantile estimates

You may be wondering at this point what I mean by the “positive CV distribution”. It’s helpful to think about the CV values in the following way: imagine that there is some factory that’s generating random CV values over time. The underlying distribution dictates the probability of observing a particular CV value that is generated by this imaginary factory. This is a bit abstract, but it is precisely this line of thinking that lets you generalize any insights extracted from the data to future observations, rather than providing statistics that are, from a probabilistic viewpoint, only relevant to the specific data set at hand.

### The Data

We have a data set consisting of a company’s customer behavioral data which we cleaned up and processed. The resulting data set contains roughly positive CV scores (i.e., customers that made purchases) which we will use for the remainder of this blog post. The positive CV score data is plotted as a histogram below.

### Empirical Quantile Estimates

Remember that we want to estimate the quantile for CV scores greater than . This section will cover the simplest way to estimate this quantity from the data.

I will start with a very simple example, suppose that you observe the following list of observations: and you want to compute the percentile. Note that the list contains ten values, and also note that . Based on this information, one approach to computing the percentile is to sort the list of values, and take the percentile to be the ninth value on the list:

The resulting percentile estimates are called empirical quantile estimates. The empirical quantile estimate for the percentile of the sample data set (i.e. the high-value cutoff) was computed to be approximately . I saved the code used for computing quantile estimates as a GitHub Gist.

Now, go back to our imaginary factory thought experiment. As you can imagine, if we look at different sets of observations generated by the same factory then the resulting empirical quantile estimates will likely differ from set to set since we would be looking at a different list of numbers each time.

Given this observation, how can we quantify our confidence in our particular
estimate for percentile? One tool that statisticians like to
use for this purpose is called a **confidence set** which will be the topic of
the next section.

To explain the intuition behind this, suppose that we could get different
sets of positive CV values from our imaginary factory. We could then compute an
empirical quantile estimate for each of these different data sets. You can see
that this procedure would effectively give us a distribution *for the empirical
quantile estimates*, which, again, would allow us to make statements about the
probability of observing particular values *using this estimation approach*. A
confidence set for our estimate gives us a range of empirical quantile
estimates for which the probability of observing an empirical quantity estimate
outside of this range is less than .

### Bootstrapping Confidence Intervals

Unfortunately, it is rarely the case that we can organically produce a collection of quantile estimates in this manner for obvious reasons. In order to emulate this procedure and compute a confidence set, statisticians like to employ a tool called bootstrapping.

Bootstrap re-sampling, or **bootstrapping**, generates different data sets
of size by sampling observations with replacement from the original data
set. As before, empirical quantile estimates are then computed for each of the
different sets, from which we can make probabilistic inference regarding the
values produced by using empirical quantile estimates for the
percentile of the positive CV distribution, or the high-value cutoff.

It is this probabilistic inference that allows us to estimate confidence sets for the empirical quantile estimates. The particular confidence sets we will be interested in estimating are confidence intervals centered around a given CV score . A confidence interval centered around is a numeric interval such that (1) the probability of observing a value between and is , and (2) the probability of observing a value between and is .

I will now explain how to use bootstrapping to approximate these type of confidence intervals via a toy example. Let’s say you have your own data set of CV scores and you compute empirical quantile estimate for the percentile to be . You then proceed by creating the following empirical quantile bootstrap estimates:

Now, you append the empirical quantile estimate obtained from the full data set, which in this toy example is , and sort the resulting list:

I underlined the full-data empirical quantile estimate in the list above, as we will use this list to compute the full-data estimate’s percentile relative to the bootstrap samples. You can see that this is roughly .

Since we are trying to compute a confidence interval around , we want to compute the and percentiles of the sorted list of values. Now, finding the percentile just doesn’t make sense, so we settle for a confidence interval whose endpoints correspond to the and percentile values.

You can use the code I provided in the previous section to calculate these percentiles. You should get that an approximate confidence interval centered around for this toy example is . You can find my R code for carrying out this procedure in this GitHub Gist.

I used this code to obtain an approximate confidence interval for my positive CV score data centered around the full-data empirical quantile estimate, which was computed in the previous section to be . The computed interval is .

The plot below shows a histogram of the bootstrap estimates with the confidence region shaded in red and the full-data estimate as a blue line:

The important thing to keep in mind is that this interval is making a statement
regarding the *precision* of the empirical quantile estimate, and not its
accuracy.

The last section will very briefly talk about a different approach for estimating quantiles using the notion of a probability density function.

### Density Quantile Estimates

A **probability density function** at a high level is a mathematical function
that takes an input and tells you the probability of observing the particular
value. This is precisely the function that determines *how* our imaginary
factory produces different values for CV.

If we somehow knew the probability density function associated with the customer value data, then we could compute quantiles in a non-random fashion. Density estimation is a large topic and the mathematics can become quite complex, however, I will give a very high-level overview of two approaches that can be used to approximate probability density functions.

The first method that is commonly used for estimating a probability density is
called **kernel density estimation**. This method is intimately related to the
histograms we have been using to analyze the data. The
percentile estimated from the entire data set using the kernel density
estimation method is roughly , which is outside of the confidence set
produced in the previous section.

The second method used for estimating probability densities arises from assuming
that the data comes from a known probability distribution that is governed by
some set of parameters. The parameters are then estimated using the data. This
procedure is called fitting a probability distribution to the data. After
delving into the data, I decided to fit an **inverse-gamma distribution** and
obtained the percentile estimate to be which is
contained in the confidence set produced in the previous section.

This is all I will say regarding these two different estimation methods, as a detailed explanation requires much more technical machinery than should be discussed here. I should also note that bootstrap confidence intervals can also be computed for both of these percentile estimates in exactly the same fashion as before.

In particular, the plot below shows the bootstrap estimates for the Inverse-Gamma quantile estimation method with the confidence set shaded in red and the full-data estimate shown as a blue line.

You should compare this with the plot of the empirical quantile bootstrap estimates from the previous section.

### Predicting High-value Customers

In *Part 1*, we set out to complete the first step in predicting high-value
users which was to determine the cutoff, , that defines a high-value
customer. Our approach was to define the cutoff as a percentile in the data and
showed a simple, but powerful way to compute this percentile. We briefly
mentioned more sophisticated methods for quantile estimation. We also showed a
robust method for estimating confidence intervals that can be used for any
statistic you wish to estimate from a data set.

For *Part 2* in this series on Machine Learning approaches to Customer Lifetime
Value we will use the average of the empirical and inverse-gamma quantile
estimates, which is approximately , as our cutoff. We will cover
methods for predicting high-value users (i.e., those who spend or
more) using only data obtained from a customer’s signup event, such as sign-up
time, country of registration, page views and interactions with marketing
campaigns.