Appendix: Primer in hierarchical Bayesian inference and Poisson-Gamma models

This appendix illustrates some key principles of hierarchical Bayesian inference.

The purpose of this appendix is to illustrate some key principles of hierarchical Bayesian inference. This statistical method is an integral part of generalizing scientific information across products in the Net Impact Model. The appendix does not aim to be comprehensive with regards to hierarchical Bayesian inference, but rather explain the Poisson-Gamma model that is used in quantifying net impact.

The Poisson distribution with a known event rate

If an event X happens randomly at a known rate, Poisson distribution tells us the likelihood that the event happens N times during one time interval 𝑃(k events in interval t)=λkeλk!𝑃(\text{k events in interval t})=\frac{\lambda^k e^{-\lambda}}{k!}, where λ\lambda is the average number of events in time tt​.

Or as Wikipedia describes it:

In probability theory and statistics, the Poisson distribution, named after French mathematician Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

For instance, a call center receives an average of 180 calls per hour, 24 hours a day. The calls are independent; receiving one does not change the probability of when the next one will arrive. The number of calls received during any minute has a Poisson probability distribution: the most likely numbers are 2 and 3 but 1 and 4 are also likely and there is a small probability of it being as low as zero and a very small probability it could be 10.

Another example: We’re running a tiny cafe and want to know how many seats we need for our morning customers. We know from last year’s bookkeeping that in the mornings we get an average of 8 customers per hour.

Let’s mark each customer arrival as event X. Let’s also assume the arrivals have a constant mean rate and are independent of each other. In this case, the probability of getting k customers in a hour is P(k)=8ke8k!P(k) = \frac{8^ke^{-8}}{k!}, where kk is the number of arrivals in an hour.

Unknown (latent) event rate

In many real-life cases we don’t know the actual average rate of events since it is a latent variable which we can’t directly measure. However, we can observe the number of occurrences and make an educated guess on the underlying event rate. The longer interval we observe, the better our educated guess becomes.

In the cafe example, we might want to know how many customers our rival cafe has per hour. They won’t show us their bookkeeping, but we can observe customer arrivals and make an educated guess. Observing one hour gives us some idea, and observing an hour every day for a month gives a pretty good estimate.

One way to answer the rival cafe question is by using Poisson distribution as a likelihood function.

The likelihood function

The likelihood function answers the what-if question: If the latent event rate was X, how likely would it be to get these observations?

Or as Wikipedia describes it:

In statistics, the likelihood function (often simply called the likelihood) measures the goodness of fit of a statistical model to a sample of data for given values of the unknown parameters.

For example, if we observed that the rival cafe had 10 customers during an hour, we could ask ourselves:

  • How likely would 10 occurrences be if the latent event rate was 5?

  • How likely would 10 occurrences be if the latent event rate was 10?

  • How likely would 10 occurrences be if the latent event rate was 20?

And answer these questions with our Poisson likelihood function:

  • Likelihood of 10 observations if event rate was 5: 510𝑒510!1.8\frac{5^{10}𝑒^{−5}}{10!} \approx 1.8%

  • Likelihood of 10 observations if event rate was 10: 1010𝑒1010!12.5\frac{10^{10}𝑒^{−10}}{10!} \approx 12.5%

  • Likelihood of 10 observations if event rate was 20: 2010𝑒2010!0.6\frac{20^{10}𝑒^{−20}}{10!} \approx 0.6%

These are not yet very useful insights. But we can turn the likelihood function into a (posterior) event rate estimate using Bayesian inference.

Bayesian inference

Bayesian inference allows us to estimate the value of a random variable based on a likelihood function and observations. Bayesian inference also allows combining prior beliefs and observations into a posterior estimate (as seen later in this section).

Or as Wikipedia describes it:

Bayesian inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

Bayesian inference may become complex and often requires sampling instead of analytical solving. Luckily, some smart mathematician has shown that gamma distribution is the conjugate prior of a Poisson distribution 1.

This means that if we estimate a latent event rate X and:

  • we believe that the likelihood function for X is a Poisson distribution

  • we have no prior beliefs about the value of X

  • we have observed k event occurrences over a period of t intervals

Then the best estimate for the underlying event rate is a gamma distribution with shape k and rate t: 𝑋𝐺𝑎𝑚𝑚𝑎(𝑘,𝑡)𝑋 \sim 𝐺𝑎𝑚𝑚𝑎(𝑘,𝑡).

In our cafe example, the posterior estimate for our rivals customer rate is 𝑋𝐺𝑎𝑚𝑚𝑎(10,1)𝑋 \sim 𝐺𝑎𝑚𝑚𝑎(10,1)

The Gamma distribution

Gamma distribution is driven by its parameters shape (𝛼) and rate (𝛽).

Interpreting the point estimate and certainty of a gamma distribution:

  • The point estimate (most likely value) of a variable following the gamma distribution is its shape divided by its rate αβ\frac{\alpha}{\beta}

  • The certainty (informativeness) of a gamma distribution can be described by its rate parameter 𝛽. The variance of the Gamma distribution is αβ2\frac{\alpha}{\beta^2}, or the point estimate divided by the rate.

For example, if X is estimated to follow Gamma(8,2)Gamma(8,2), then the most likely value of X would be 4 and X would likely fall between 2 and 6. (with probability 86% to be precise)

Code Snippet
from scipy.stats import gamma
alpha, beta = 8, 2
# scipy parametrizes the gamma with scale that is the inverse of beta
lb, ub = gamma.cdf(x=2, a=alpha, scale=1/beta), gamma.cdf(x=6, a=alpha, scale=1/beta)
ub - lb

Inference with a prior

Sometimes some prior information about X exists and is relevant to include in our estimate.

For example, we might know that our competitor has the same customer arrival rate than we have (8 per hour). This belief could be trusted, for example, equally as much than we trust our 1-hour observation. In this case we would formulate our prior belief as a gamma distribution: 𝑋𝑝𝑟𝑖𝑜𝑟𝐺𝑎𝑚𝑚𝑎(8,1)𝑋_{𝑝𝑟𝑖𝑜𝑟}∼𝐺𝑎𝑚𝑚𝑎(8,1)​.

Based on the rules of a conjugate prior, this leads to the posterior estimate XposteriorGamma(αprior+k,βprior+t)=Gamma(18,2)X_{posterior} \sim Gamma(\alpha_{prior}+k,\beta_{prior}+t)=Gamma(18,2).

Inference with several observation intervals

In the cafe example, we might go to observe our rival on another morning and notice 12 customers in two hours.

Based on the rules of conjugate prior and Bayesian inference, this allows us to update our belief about our rival’s customer arrival rate XposteriorGamma(αprior+k1+k2,βprior+t1+t2)=Gamma(30,4)X_{posterior} \sim Gamma(\alpha_{prior}+k_1+k_2,\beta_{prior}+t_1+t_2)=Gamma(30,4)

Summarizing information from multiple Poisson distributed observations of the same process

Because the conjugate inference is based on addition:

  • The order of observations doesn’t matter

  • It doesn’t matter whether we update our prior belief “all observations at once” or “one observation at a time”

Because of these, inferred Gamma random variables can be summarized by adding up the shapes and rates. If multiple observation intervals of a single process yield separate inferences

xiGamma(αi,βi),i=1,2,3...,x_i \sim Gamma(\alpha_i, \beta_i), i=1,2,3...,

then the best estimate of how the process behaves is

xGamma(iαi,iβi).x \sim Gamma(\sum_i \alpha_i, \sum_i \beta_i).

Hierarchical interpretation

Besides working with several observation intervals, the above arithmetic has uses in working with hierarchical information. If information about a general phenomenon can be understood as informative about a specific variant of the same phenomenon, it is reasonable to treat the general information about the phenomenon as a prior of the specific phenomenon. Direct observations about the specific phenomenon then update our generic, prior beliefs. The volumes of observations of the generic and specific phenomena determine how much either affects the updated inference.

To illustrate, let's assume that a new apple retailer has just opened and needs to understand how much apples they are going to sell in a given time period so that they can arrange the right amount of (perishable) inventory.

Before opening and selling even the first apples, the retailer needs to make an educated guess about their rate of sales. With no direct data, the retailer investigates similar retailers, and finds that other fruit stores operating in the area are selling roughly 5 fruits per hour. There is some observed variance between stores, so the retailer estimates a variance of 2 for this data. The retailer assumes a Poisson process for the fruit retailing business and solves the equation pair

αβ=5αβ2=2\begin{align*} \frac{\alpha}{\beta} &= 5 \\ \frac{\alpha}{\beta^2} &= 2 \end{align*}

Gamma(12.5,2.5)Gamma(12.5, 2.5).

Assume that during their opening day, the retailer operates for 8 hours and sells 64 apples in that period. Exceeding their prior expectations, the retailer updates their belief about their rate of sales to Gamma(12.5+64,2.5+8)=Gamma(76.5,10.5)Gamma(12.5 + 64, 2.5 + 8)=Gamma(76.5, 10.5)​.

Having direct observations about their business, the retailer now has more confidence in their rate of apple sales, but it is worth noting that they do not completely disregard the prior benchmarking data they collected. This posterior belief now reflects both the direct and indirect data the retailer has collected.

It is likely that the retailer sells apples at a higher rate than the average fruit retailer, but this is not certain yet. As the retailer keeps operating and collecting data, they will subsequently update their beliefs about their rate of sales and learn if they indeed have lasting traction with their customers or if they just had a successful opening day.

Last updated