In [149]:
%load_ext cython
import numpy as np
import pandas as pd
import seaborn as sns
import time
import numpy.linalg as la
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import preprocessing
from itertools import product
%matplotlib inline
The cython extension is already loaded. To reload it, use:
  %reload_ext cython

Hamiltonian Monte Carlo

STA663 Final Project

Authors: Ilan Man and Sanjay Hariharan

1. Introduction

Hamiltonian Monte Carlo (HMC) is a relatively recent approach to posterior estimation. It improves on the simple random-walk proposal of the Metropolis algorithm, by ensuring that these proposals are in the direction of steepest ascent (Neal, 2011). In doing so, it avoids the slow exploration of the state space from these random proposals. HMC reduces the correlation between successive sampled states by using Hamiltonian evolution between states, a concept derived from quantum mechanics. The energy preserving aspect of Hamiltonian dynamics will be extremely useful in derivating and implementating this algorithm.

This paper is structured as follows:

  • In section 2 we introduce and review common MCMC methods.
    • In particular we discuss Gibbs Sampling and Metropolis in the context of a bivariate Normal distribution.
  • In section 3 we introduce Hamiltonian dynamics and discuss its connection to MCMC and probability models.
  • In section 4 we apply Hamiltonian Monte Carlo to the same bivariate Normal distribution and discuss its convergence under various parameter settings
  • In section 5 we present empirical comparisons between Gibbs Sampling, Metropolis and Hamiltonian Monte Carlo when we fit a linear regression model.
    • We also include examples of using Cython to speed up inefficiencies in the Metropolis algorithm.

2. MCMC review

Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a proabability distribution based on sequential draws where each draw is dependent only on the previous value drawn, hence the draws form a Markov Chain.

In Bayesian inference, parameter estimates are found by repeated sampling from the posterior distribution. MCMC methods are used extensively in calculating posterior distributions, which are often complex and intractable, and therefore must be approximated numerically. MCMC algorithms have certain attractive properties such as convergence to a limiting distribution, which are exploited in solving Bayesian inference problems.

2.1 Motivation: Bivariate Normal

In the case of hierarchical Bayesian models, often the posterior distribution is comprised of several parameters and is often impossible to solve analytically in closed form. Even when tractable, as the dimensionality of the problem increases, we may find ourselves having to compute several complex integrals which require large amounts of computing time. In real-world applications, practitioners commonly use MCMC methods to solve for these parameters. There are many flavors of MCMC and the modeler should identify which one to apply to their specific case.

For the purposes of comparing the convergence of MCMC algorithms, we use a bivariate Normal distribution. The parameters are as follows:

$$(\theta_1, \theta_2) \sim N(\mu, \Sigma), \:\:\: \mu = \begin{pmatrix} x_1\\x_2\end{pmatrix}, \:\:\: \Sigma=\begin{pmatrix} 1 && \rho \\ \rho && 1 \end{pmatrix}$$

In our examples, $x_1 = 2, x_2 = 1$ and $\rho = 0.2$ and each algorithm is run for $N = 1000$ iterations.

Algorithm specific parameters are:

  1. Gibbs
    • Assume uniform prior.
  2. Metropolis
    • Random Walk $N(0,1)$ proposal distribution
  3. Hamiltonian Monte Carlo
    • Leap steps: 3
    • $\delta$ = 0.3
In [150]:
"""
INITIALIZE PARAMETERS FOR CONVERGENCE COMPARISONS FOR BIVARIATE DISTRIBUTION
"""

# shared parameters
N = 1000
corr = 0.2

Covariance = np.array([[1,corr],
                       [corr,1]])

# gibbs parameters
mu = np.array([2,1])

# hmc parameters
delta = 0.3
leap = 3
In [151]:
def acceptance_criterion(accept, alpha, theta_star, theta_previous):
    """
    Acceptance criterion

    Parameters
    ----------
    accept: current number of accepted samples
    alpha: acceptance ratio
    theta_star : proposed sample
    theta_previous: previous value of theta
    
    Returns
    -------
    next value for theta[t] (either theta_star or theta_previous)
    updated value for acceptance (either accept + 1 or accept)
    """
    
    return (theta_star, accept + 1) if np.random.random() < alpha else (theta_previous, accept)

2.2 Gibbs Sampling

Gibbs sampling is a specific variant of MCMC where each parameter of interest can be sampled directly, when conditioned across all the other parameters. For many problems involving standard statistical models, it is possible to sample directly from most or all of the conditional posterior distributions of the parameters (Gelman, 2014). This algorithm generates an instance from the distribution of each variable in turn, conditional on the current values of the other variables. This sequence of samples constitutes a Markov Chain whose stationary distribution is the sought-after joint distribution, $p(\theta_1, \theta_2,\dots,\theta_D)$

Gibbs Sampling Pseudo-Code

  1. set $t = 0$
  2. generate an initial state $\theta^{(0)} \sim \pi^{(0)}$
  3. repeat until $t = N$
    1. set $t = t+1$
    2. for each dimension $i = 1, ..., D$
      • draw $\theta_i^{(t)}$ from $p(\theta_i|\theta_1^{(t)},\theta_2^{(t)},\dots,\theta_{i-1}^{(t)},\theta_{i+1}^{(t)},\dots,\theta_D^{(t)})$

For the case of the bivariate normal, the conditonal distributions are:

$$\theta_1|\theta_2, x \sim N(x_1 + \rho(\theta_2-x_2), 1-\rho^2)$$$$\theta_2|\theta_1, x \sim N(x_2 + \rho(\theta_1-x_1), 1-\rho^2)$$
In [152]:
def gibbs_sampler(N, mu, corr):
    """
    Gibbs Sampling algorithm for a bivariate normal

    Parameters
    ----------
    N: number of iteration of the algorithm
    mu: mean of bivariate
    corr: correlation between x1 and x2
    
    Returns
    -------
    theta : ndarray of shape 2 x N. Transformed variable with Gibbs samples
    
    Raises
    ------
    Error if mu is not 2 dimensional
    Error if abs(corr) > 1
    
    """
    
    assert(mu.shape[0] == 2), "mean vector must be 2 dimensional for bivariate normal"
    assert(abs(corr) <= 1), "correlation must be between -1 and 1"
    
    theta = np.zeros(2*N).reshape(N,2)
    dimensions = range(mu.shape[0])

    for t in np.arange(N):
        for dim in dimensions:
            
            # conditional distribution for bivariate normal
            theta[t][dim] = stats.norm.rvs(loc = mu[dim] + corr*(theta[t][1-dim]-mu[1-dim]), 
                                           scale = np.sqrt(1 - corr**2))
    
    return theta

Shortcomings

The main shortcoming of Gibbs Sampling is the successive autocorrelation of the samples. Due to the fact that the sampling generates a Markov Chain, successive samples are correlated with each other. Thinning and block sampling may help reduce this correlation, but these heuristics sometimes don't work well or require lots of data (Gelman, 2014). Finally, an obvious shortcoming specific to Gibbs Sampling is that it requires identifying full conditional distributions for each parameter. As with the posterior, some of these conditional distributions may have complex, intractable forms and will need other MCMC methods to compute them. Note that it is common to combine Gibbs Sampling when a specific conditional distribution for $\theta_i$ is known, with other methods when the conditional distribution of $\theta_j$ (for $j \ne i$) is unknown.

2.3 Metropolis

The Metropolis algorithm is a generalization of Gibbs Sampling, used when the conditional sampling distributions cannot be computed in closed form. At each iteration of this algorithm, a candidate is proposed as the next sample value of the parameter in question. The most common proposal method is a simple random walk, i.e. a Standard Normal Distribution. Then, with probability $\alpha$ the candidate is either accepted (updating the chain's next value with the candidate) or rejected (making the chain's next value equal to the current). $\alpha$ is the minimum of 1 and the ratio, $r$, of the posterior distribution with the proposed parameter over the posterior distribution with the previous parameter.

The acceptance/rejection rule of the Metropolis algorithm can be understood as follows:

  1. if the jump increases the posterior density, set $\theta^{t} = \theta^{*}$
  2. if the jump decreases the posterior density, set $\theta^{t} = \theta^{*}$
    1. with probability equal to the density ratio, $r$
    2. otherwise set $\theta^t = \theta^{t-1}$

Metropolis Pseudo-Code

  1. set $t = 0$
  2. generate an initial state $\theta^{(0)} \sim \pi^{(0)}$
  3. repeat until $t = N$
    1. set $t = t+1$
    2. generate a proposal state $\theta^*$ from $q(\theta | \theta^{(t-1)})$
    3. calculate the proposal correction factor $c = \frac{q(\theta^{(t-1)} | \theta^*) }{q(\theta^*|\theta^{(t-1)})}$
    4. calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(\theta^*)}{p(\theta^{(t-1)})} \times c\right )$
    5. draw a random number $u$ from Unif(0,1)
      1. if $u \leq \alpha$ accept the proposal state $\theta^*$ and set $\theta^{(t)}=\theta^*$
      2. else set $\theta^{(t)} = \theta^{(t-1)}$
In [153]:
def metropolis_MC(N, mu, Cov):
    """
    Metropolis Monte Carlo algorithm for a bivariate normal

    Parameters
    ----------
    N: number of iteration of the algorithm
    mu: mean of bivariate
    Cov: covariance matrix
    
    Returns
    -------
    theta : ndarray of shape 2 x N. Transformed variable with Metropolis samples
    acceptance_rate : % of samples accepted
    
    Raises
    ------
    Error if covariance is not positive semi definite
    Error if mu is not 2 dimensional
    
    """
    
    assert(mu.shape[0] == 2), "mean vector must be 2 dimensional for bivariate normal"
    assert(np.all(la.eigvals(Cov)>=0)), "Covariance matrix must be Positive Semi-Definite"
    
    theta = np.zeros(2*N).reshape(N,2)
    accept = 0
                  
    for t in np.arange(1,N):

        # multivariate normal proposal
        theta_star = np.random.multivariate_normal(theta[t-1], Cov)

        # acceptance ratio
        alpha = min(1, stats.multivariate_normal.pdf(theta_star, mean = mu, cov = Cov)/
                    stats.multivariate_normal.pdf(theta[t-1], mean = mu, cov = Cov))
        
        # accept/reject criterion
        (theta[t], accept) = acceptance_criterion(accept, alpha, theta_star, theta[t-1])
    
    return theta, float(accept)/N

Shortcomings

The three main shortcomings of the Metropolis Algorithm are that:

  1. It explores the posterior space very slowly. This is due to the random walk behavior of the proposals. This makes the algorithm inefficient and requires a lot of computation to accurately estimate low-variance parameters.
  2. As with other MCMC methods, assessing convergence is difficult in practice, and one must run the chain long enough to be satisfied that it converged. Otherwise the simulations may be very unrepresentative of the target distribution.
  3. As with other iterative procedures (like Gibbs Sampling above) simulation inference from correlated draws is generally less precise than from the same number of independent draws. This adds to the required number of runs necessary.

3. Hamiltonian Dynamics

In order to understand Hamiltonian Monte Carlo, one must become familiar with Hamiltonian dynamics. Hamiltonian dynamics describes an object's motion in terms of its location $x$ and momentum $p$ (ref). Recall from high school physics that momentum $p$ is equal to an object's mass $m$ times its velocity $v$. For each location an object takes, there is an associated potential energy $U(x)$, and for each momentum there is an associated kineric energy $K(p)$. The total energy of the system is constant and is known as the Hamiltonian $H(x,p)$, defined as:

$$H(x,p) = U(x) + K(p)$$

This description is implemented quantitatively via a set of differential equations known as the Hamiltonian equations:

$$\frac{\partial x_i}{\partial t} = \frac{\partial H}{\partial p_i} = \frac{\partial K(\mathbf{p})}{\partial p_i}$$$$\frac{\partial p_i}{\partial t} = -\frac{\partial H}{\partial x_i} = -\frac{\partial U(\mathbf{x})}{\partial x_i}$$

If we can solve these equations, we can apply Hamiltonian Dynamics to more general cases. However, in order to simulate these dynamics numerically for computation, it is necessary to approximate the Hamiltonian equations by discretizing time. This is done by splitting up the interval $T$ into a series of smaller intervals of length $\delta$. The smaller the value of $\delta$ the closer the approximation is to the dynamics in continuous time, but the more computationally expensive is the procedure.

3.1 The Leap Frog Method

The Leap Frog Method is a popular method for numerically integrating differential equations, such as the Hamiltonian equations. The method updates the momentum and position variables sequentially, starting by simulating the momentum dynamics over a small interval of time $\frac{\delta}{2}$, then simulating the position dynamics over a slightly longer interval in time $\delta$, then completing the momentum simulation over another small interval of time $\frac{\delta}{2}$ so that $x$ and $p$ now exist at the same point in time. Specifically:

  1. Take a half step in time to update the momentum variable:

    $p_i(t + \delta/2) = p_i(t) - (\delta /2)\frac{\partial U}{\partial x_i(t)}$

  2. Take a full step in time to update the position variable:

    $x_i(t + \delta) = x_i(t) + \delta \frac{\partial K}{\partial p_i(t + \delta/2)}$

  3. Take the remaining half step in time to finish updating the momentum variable:

    $p_i(t + \delta) = p_i(t + \delta/2) - (\delta/2) \frac{\partial U}{\partial x_i(t+\delta)}$

This method can be run for $L$ steps to simulate dynamics over $L \times \delta$ units of time. This particular method preserves the energy of the system and is time-reversible (Neal, 2011), and thus it is a popular choice for numerical approximations, especially in comparison with other approaches such as Euler's method.

3.2 MCMC from Hamiltonian dynamics

Using Hamiltonian dynamics to sample from a distribution requires translating the density function of the distribution to a potential energy function by introducing momentum variables to correspond to the original variables of interest, now denoted as position variables. Note that the act of introducing momentum variables, which are used only in helping to determine the position variables, is an example of auxiliary MCMC. We can then simulate a Markov chain in which each iteration resamples the momentum and then does a Metropolis update with a proposal found using Hamiltonian dynamics.

In order to choose the Hamiltonian function, we relate $H(x, p)$ to $p(x)$ using a concept known as the canonical distribution (Neal, 2011). For any energy function $E(\theta)$, define the corresponding canonical distribution as:

$p(\theta) = \frac{1}{Z}e^{-E(\mathbf\theta)}$

The variable $Z$ is a normalizing constant called the partition function that scales the canonical distribution such that is sums to one, creating a valid probability distribution.

Using the above equation for the total energy, we can factorize the canonical distribution into functions of $x$ and $p$:

$p(x,p) \propto e^{-H(x,p)} \\ = e^{-[U(x) - K(p)]} \\ = e^{-U(x)}e^{-K(p)} \\ \propto p(x)p(p)$

We can now use Hamiltonian dynamics to sample from the joint canonical distribution over $p$ and $x$ and simply ignore the momentum contributions.

3.3 Probability and the Hamiltonian

In HMC, we use Hamiltonian dynamics as a proposal function in order to explore the canonical (posterior) density $p(x)$ defined by $U(x)$ more efficiently. Starting at an initial state $[x_0, p_0]$, we simulate Hamiltonian dynamics for a short time using the Leap Frog method. We then use the position and momentum variables at the end of the simulation as our proposed state variables $x^*$ and $p^*$. This proposed state is accepted with an update rule similar to Metropolis above:

$p(x^*, p^*) \propto e^{-[U(x^*) + K(p^*)]} \\ p(x_0, p_0) \propto e^{-[U(x^{(t-1)}), K(p^{(t-1)})]}$

Accept with probability:

min(1, $\frac{p(x^*, p^*)}{p(x^0, p^0)}$)

If the state is rejected, the next state of the Markov Chain is set to the previous state. Note that Hamiltonian dynamics will follow contours of constant energy in phase space. Therefore, in order to explore all of the posterior distribution, we draw a random momentum from the corresponding canonical distribution $p(\mathbf{p})$ before running the dynamics prior to each sampling iteration $t$ (Neal, 2011).

3.4 Hamiltonian Monte Carlo Algorithm

  1. set $t = 0$
  2. generate an initial position state $x^{(0)} \sim \pi^{(0)}$
  3. repeat until $t = N$:
    1. set $t = t+1$
    2. sample a new initial momentum variable from the momentum canonical distribution $p_0 \sim p(p)$
    3. set $x_0 = x^{(t-1)}$
    4. run Leap Frog algorithm starting at $[x_0, p_0]$ for L steps and stepsize $\delta$ to obtain proposed states $x^*$ and $p^*$
    5. calculate the Metropolis acceptance probability:
    6. - $\alpha = \text{min}(1,\exp(-U(x^*) + U(x_0) - K(p^*) + K(p_0)))$
    7. draw a random number u from $\text{Unif}$(0,1)
      - if $u \leq \alpha$ accept the proposed state position $x^*$ and set the next state in the Markov chain $x^{(t)}=x^*$
      - else set $x^{(t)} = x^{(t-1)}$

One of the open questions in using HMC is parameter tuning. Below we discuss this, but note that it's generally accepted that one should choose values for $\delta$ and $L$ such that $\delta \times L=1$ (Gelman, 2014).

In [154]:
""" Hamiltonian helper functions """

def potential_energy(theta, Cov, mu):
    """
    Evaluates the Posterior Distribution for theta
    
    Parameters
    ----------
    theta : Matrix of shape n x 2
    Cov: covariance matrix
    mu : mean of bivariate distribution to sample from
    
    Returns
    -------
    The Posterior Distribution under inputs
    """
        
    return (theta - mu).dot(la.inv(Cov.T)).dot((theta - mu).T)
In [155]:
def grad_potential_energy(theta, Cov, mu):
    """
    Gradient of the potential energy function

    Parameters
    ----------
    theta : Matrix of shape n x 2
    Cov: covariance matrix
    mu : mean of bivariate distribution to sample from
    
    Returns
    -------
    The gradient of the posterior distribution under theta
    """
        
    return (theta - mu).dot(la.inv(Cov))
In [156]:
def kinetic_energy(p):
    """
    Evaluates the PDF for the Momentum Variables (Multivariate Normal). This is used in accept/reject step.
    
    Parameters
    ----------
    p : Vector of Momentum Values to calculate the PDF for
    
    Returns
    -------
    Single value containing the PDF for a MVN as the p vetor values
    """

    return np.sum(p.T.dot(p)/2)
In [157]:
def hamiltonian_monte_carlo(N, leap, delta, Cov, mu):
    """
    
    Hamiltonian Monte Carlo algorithm for a bivariate normal

    Parameters
    ----------
    N: number of iteration of the algorithm
    Cov: covariance matrix
    mu : mean of bivariate distribution to sample from
    leap: number of steps of leap frog
    delta: step size for discrete approximation
    
    Returns
    -------
    theta : ndarray of shape 2 x N. Transformed variable with Hamiltonian samples
    acceptance_rate : % of samples accepted
        
    Raises
    ------
    Error if mu is not 2 dimensional
    Error if leap or delta are negative
    Error if covariance is not positive semi definite
    
    """
    
    assert(leap > 0), "leap must be positive"
    assert(delta > 0), "delta must be positive"
    assert(np.all(la.eigvals(Cov)>=0)), "Covariance matrix must be Positive Semi-Definite"
    
    theta = np.zeros(2*N).reshape(N,2)
    accept = 0

    for t in range(1,N):

        #   SAMPLE RANDOM MOMENTUM
        p0 = np.random.randn(2)

        #   SIMULATE HAMILTONIAN DYNAMICS
        #   FIRST 1/2 STEP OF MOMENTUM
        p_star = p0 - 1/2*delta*grad_potential_energy(theta[t-1], Cov, mu)

        #   FIRST FULL STEP FOR POSITION/SAMPLE
        theta_star = theta[t-1] + delta*p_star

        #   FULL STEPS
        for i in range(leap):
            # MOMENTUM
            p_star = p_star - delta*grad_potential_energy(theta_star, Cov, mu)
            # POSITION/SAMPLE
            theta_star = theta_star + delta*p_star

        p_star = p_star - 1/2*delta*grad_potential_energy(theta_star, Cov, mu)

        U0 = potential_energy(theta[t-1], Cov, mu)
        U_star = potential_energy(theta_star, Cov, mu)

        K0 = kinetic_energy(p0)
        K_star = kinetic_energy(p_star)

        # acceptance ratio
        alpha = min(1, np.exp((U0 + K0) - (U_star + K_star)))

        # accept/reject criterion
        (theta[t], accept) = acceptance_criterion(accept, alpha, theta_star, theta[t-1])
        
    return theta, float(accept)/N

4. Empirical Analysis: Bivariate Normal

Below we run the 4 MCMC methods on the bivariate distribution. We show convergence plots and trace plots. Recall the parameters of the target distribution:

$$(\theta_1, \theta_2) \sim N(\mu, \Sigma), \:\:\: \mu = \begin{pmatrix} 2\\1\end{pmatrix}, \:\:\: \Sigma=\begin{pmatrix} 1 && 0.2 \\ 0.2 && 1 \end{pmatrix}$$
In [158]:
x_GIBBS = gibbs_sampler(N, mu, corr)
x_MMC, MMC_accept = metropolis_MC(N, mu, Covariance)
x_HMC, HMC_accept = hamiltonian_monte_carlo(N, leap, delta, Covariance, mu)
In [159]:
def convergenceplots(x_GIBBS, x_MMC, x_HMC, early_samples):
    """
    CONVERGENCE PLOTS
    
    Parameters
    ----------
    x_GIBBS : N x 2 samples from a run of the Gibbs sampler
    x_MHMC : N x 2 samples from a run of the Metropolis
    x_HMC : N x 2 samples from a run of the Hamiltonian Monte Carlo
    early_samples : highlight the first "early_samples" number of samples to check early convergence
    
    Returns
    -------
    3 plots of the coverage of each algorithm, specifically the early convergence
    """

    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16,6))

    axes[0].scatter(x_GIBBS[:,0],x_GIBBS[:,1])
    axes[0].plot(x_GIBBS[:early_samples,0],x_GIBBS[:early_samples,1],'ro-')
    axes[0].legend(['First 50 states','Samples'],fontsize=12)
    axes[0].set_title("Gibbs Sampler for Bivariate Normal\n correlation = {}".format(corr),size=14)
    axes[0].tick_params(axis='both', which='major', labelsize=12)
    axes[0].set_xlabel(r"$\theta_1$",size=16)
    axes[0].set_ylabel(r"$\theta_2$",size=16)

    axes[1].scatter(x_MMC[:,0],x_MMC[:,1])
    axes[1].plot(x_MMC[:early_samples,0],x_MMC[:early_samples,1],'ro-')
    axes[1].legend(['First 50 states','Samples'],fontsize=12)
    axes[1].set_title("Metropolis Monte Carlo for Bivariate Normal\n correlation = {}".format(corr),size=14)
    axes[1].tick_params(axis='both', which='major', labelsize=12)
    axes[1].set_xlabel(r"$\theta_1$",size=16)
    axes[1].set_ylabel(r"$\theta_2$",size=16)

    axes[2].scatter(x_HMC[:,0],x_HMC[:,1])
    axes[2].plot(x_HMC[:early_samples,0],x_HMC[:early_samples,1],'ro-')
    axes[2].legend(['First 50 states','Samples'],fontsize=12)
    axes[2].set_title("Hamiltonian Monte Carlo for Bivariate Normal\n correlation = {}".format(corr),size=14)
    axes[2].tick_params(axis='both', which='major', labelsize=12)
    axes[2].set_xlabel(r"$\theta_1$",size=16)
    axes[2].set_ylabel(r"$\theta_2$",size=16)
    fig.tight_layout()

    pass
In [160]:
convergenceplots(x_GIBBS, x_MMC, x_HMC, 50)
In [161]:
def traceplots(x_GIBBS, x_MMC, x_HMC, burnin, N):   
    """
    TRACEPLOTS
    
    Parameters
    ----------
    x_GIBBS : N x 2 samples from a run of the Gibbs sampler
    x_MHMC : N x 2 samples from a run of the Metropolis
    x_HMC : N x 2 samples from a run of the Hamiltonian Monte Carlo
    burn in : burnin period
    N : number of iterations
    
    Returns
    -------
    3 traceplots, post burn in
    """

    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(14,4))
    xloc = plt.MaxNLocator(5)

    axes[0].plot(range(burnin,N),x_GIBBS[burnin:,0],alpha=0.6)
    axes[0].hlines(2,burnin,N,color='red',linewidth=3,linestyle='--')
    axes[0].set_title("Traceplot for " r"$\theta_1$: Gibbs Sampler",size=14)
    axes[0].tick_params(axis='both', which='major', labelsize=12)
    axes[0].set_ylabel(r"$\theta_1$",size=12)
    axes[0].set_xlabel("Number of iterations",size=12)
    axes[0].xaxis.set_major_locator(xloc)

    axes[1].plot(range(burnin,N),x_MMC[burnin:,0],alpha=0.6)
    axes[1].hlines(2,burnin,N,color='red',linewidth=3,linestyle='--')
    axes[1].set_title("Traceplot for " r"$\theta_1$: Metropolis Monte Carlo",size=14)
    axes[1].tick_params(axis='both', which='major', labelsize=12)
    axes[1].set_ylabel(r"$\theta_1$",size=12)
    axes[1].set_xlabel("Number of iterations",size=12)
    axes[1].xaxis.set_major_locator(xloc)

    axes[2].plot(range(burnin,N),x_HMC[burnin:,0],alpha=0.6)
    axes[2].hlines(2,burnin,N,color='red',linewidth=3,linestyle='--')
    axes[2].set_title("Traceplot for " r"$\theta_1$: Hamiltonian Monte Carlo",size=14)
    axes[2].tick_params(axis='both', which='major', labelsize=12)
    axes[2].set_ylabel(r"$\theta_1$",size=12)
    axes[2].set_xlabel("Number of iterations",size=12)
    axes[2].xaxis.set_major_locator(xloc)
    fig.tight_layout()

    pass
In [162]:
traceplots(x_GIBBS, x_MMC, x_HMC, 100, 1000)
In [163]:
print ("Metropolis ratio acceptance ratio: ", MMC_accept)
print ("Hamiltonian ratio acceptance ratio: ", HMC_accept)
Metropolis ratio acceptance ratio:  0.538
Hamiltonian ratio acceptance ratio:  0.701

Visually, the plots above all converge to the target distribution in less than 50 samples. This is expected since the target distribution is a simple bivariate normal and the parameters are not very correlated, i.e. $\rho$ = 0.2. Also note that the acceptance ratio for Metropolis and HMC are similar, with HMC having slightly higher acceptance. According to Gelman, the optimal acceptance ratio is between 0.5 and 0.6, making these rates satisfactory.

Now we'll perform a similar analysis with more correlated variables:

$$(\theta_1, \theta_2) \sim N(\mu, \Sigma), \:\:\: \mu = \begin{pmatrix} 2\\1\end{pmatrix}, \:\:\: \Sigma=\begin{pmatrix} 1 && 0.8 \\ 0.8 && 1 \end{pmatrix}$$
In [164]:
corr = 0.8
Covariance = np.array([[1,corr],
                       [corr,1]])
x_GIBBS = gibbs_sampler(N, mu, corr)
x_MMC, MMC_accept = metropolis_MC(N, mu, Covariance)
x_HMC, HMC_accept = hamiltonian_monte_carlo(N, leap, delta, Covariance, mu)
In [165]:
convergenceplots(x_GIBBS, x_MMC, x_HMC, 50)
In [166]:
traceplots(x_GIBBS, x_MMC, x_HMC, 100, 1000)
In [167]:
print ("Metropolis ratio acceptance ratio: ", MMC_accept)
print ("Hamiltonian ratio acceptance ratio: ", HMC_accept)
Metropolis ratio acceptance ratio:  0.563
Hamiltonian ratio acceptance ratio:  0.757

When the variables are more highly correlated, Gibbs sampling appears to converge to an incorrect value for $\theta_1$. Metropolis requires more than the first 50 samples to explore the space. Finally, Hamiltonian Monte Carlo covers the space most efficiently. And its traceplot reinforces this.

Lastly, HMC accepts a greater number of samples, for the same set of inputs, than Metropolis. Along with the traceplot showing convergences, this indicates that HMC more efficiently finds the target distribution than Metropolis.

4.1 Hamiltonian parameter tuning

Below we analyze the impact of tuning the Hamiltonian Monte Carlo in 2 ways:

  • The Scaling Factor $\delta$ of the step-size
  • The number of Leapfrog Steps $L$ per iteration

Specifically, we look at the rate of acceptance of the algorithm under various values for $\delta$ and $L$.

In [168]:
deltas = [0.005,0.05,0.1,0.2,0.3,0.5,0.7,1,1.5,2]
leaps = [1,5,10,20,30,50,80]
corr = 0.8
Covariance = np.array([[1,corr],
                       [corr,1]])
In [169]:
def acceptances_HMC(deltas, leaps, Covariance, mu):
    """
    Calculate acceptance rates for various values of deltas and leaps
    
    Parameters
    ----------
    deltas: list of values for delta
    leaps: list of values for leap steps
    Covariance: covariance of bivariate normal
    mu: mean of bivariate to test
    
    Returns
    -------
    List of tuples of acceptance ratios for each delta/leap pair
    """

    acceptance_rates = []
    
    for i, j in product(deltas,leaps):
        acceptance_rates.append((i,j,hamiltonian_monte_carlo(N, j, i, Covariance, mu)[1]))
    
    return acceptance_rates
In [170]:
def create_dataframe(acceptances):
    """
    Pretty dataframe output of the list of tuples which have the acceptance ratios for each delta/leap pair
    
    Parameters
    ----------
    acceptances : List of tuples of acceptance ratios for each delta/leap pair
    
    Returns
    -------
    Pretty looking data frame to analyze results
    """
    
    df = pd.DataFrame.from_records(acceptances)
    
    df = df.pivot(index=0,columns=1,values=2)
    df.index.name = "Delta"
    df.columns.name = "Leaps"
    
    return df
In [171]:
create_dataframe(acceptances_HMC(deltas, leaps, Covariance, mu))
Out[171]:
Leaps 1 5 10 20 30 50 80
Delta
0.005 0.986 0.982 0.954 0.928 0.913 0.868 0.796
0.050 0.937 0.834 0.750 0.769 0.763 0.762 0.899
0.100 0.894 0.735 0.770 0.675 0.747 0.742 0.837
0.200 0.776 0.773 0.668 0.977 0.661 0.670 0.750
0.300 0.727 0.688 0.700 0.643 0.752 0.766 0.694
0.500 0.689 0.678 0.717 0.759 0.638 0.733 0.616
0.700 0.680 0.528 0.553 0.741 0.575 0.566 0.500
1.000 0.099 0.005 0.000 0.000 0.000 0.000 0.000
1.500 0.004 0.000 0.000 0.000 0.000 0.000 0.000
2.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

From the table above, its clear that when $\delta$ is large, the algorithm doesn't accept many samples, especially for values greater than 1.0. When $\delta$ is very small, the algorithm accepts almost all the samples. Accepting too many samples could be ineffecient and perhaps it won't explore the entire space before the number of iterations has run out. As mentioned above, acceptance rates should be between 0.5 and 0.6. Therefore we should use values for $\delta$ around 0.3.

It appears that the number of leap steps doesn't impact the acceptance ratio too much. As a result, in order to speed up the algorithm run time, we should aim for a smaller number of leap steps, given a $\delta$. As mentioned above, using a rule of thumb of $\delta \times L = 1$ is consistent with an acceptance ratio around 0.5.

5. Empirical Analysis: Linear Regression

We compare different MCMC methods on a larger, real data set. The dataset contains 9568 observations collected in a Turkish Power Plant over 6 years. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

For our model, we assume a simple Linear Regression framework, with 4 Predictors (T, AP, RH, V) and 1 Response (EP). We normalize the data in advance, as we are more interested in comparisons across methods than interpretation of coefficients.

$$\text{Model: }y = X\beta + \epsilon$$$$\epsilon \sim N(0, \tau^{-1})$$$$\beta \sim N(0, I)$$$$\tau \sim \text{Gamma}(1, 1)$$
In [172]:
#Read in Power Plant Data and Normalize#
Data = pd.read_csv("MV_Data.csv")
std_scale = preprocessing.StandardScaler().fit(Data)
Data = std_scale.transform(Data)

#Set Predictor Matrix X and Response Vector Y#
X = Data[:,:4]; Y = Data[:,4]

Gibbs Sampling

In Bayesian Linear Regression, the conditional distribution of the parameters under our Priors are:

$$p(\beta | y, X, \tau) \sim N_4(\hat{\mu}, \hat{\Sigma})$$$$\hat{\Sigma} = (I + \tau X^TX)^{-1}$$$$\hat{\mu} = (I + \tau X^TX)^{-1} (\tau X^Ty)$$$$p(\tau | y, X, \beta) \sim \text{Gamma}(1 + \frac{n}{2}, 1 + \frac{1}{2} \displaystyle\sum_{i=1}^{n} (y_i - \beta^T x_i)^2)$$
In [173]:
def Gibbs_LinReg(X, y, niters):
    """
    Gibbs Sampler Algorithm for Bayesian Linear Regression Model
    
    Parameters
    ----------
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
    niters : Number of Iterations to Run
        
    Returns
    -------
    Beta : Matrix of Beta Samples after Burn-In
    Tau : Vector of Tau Samples after Burn-In
    """ 
    
    ##Set Priors##
    
    #Prior for Beta#
    b0 = np.zeros(4); E0 = np.eye(4)
    
    #Prior for Tau#
    a = 1; b = 1
    
    #Number of Observations
    n = X.shape[0]
    
    #Create Empty Matrix to store samples for Memory Allocation#
    Beta = np.zeros((niters+1)*4).reshape(niters+1,4)
    Tau = np.zeros(niters+1)

    #Run Sampler for 'niters' Iterations#
    for j in range(niters):
    
        #Calculate Mean and Variance of Beta Conditional Posterior#
        Mean = np.dot(la.inv(la.inv(E0) + np.dot(X.T, X)*Tau[j]),(np.dot(la.inv(E0), b0) + np.dot(X.T, y)*Tau[j]))        
        Var = la.inv(la.inv(E0) + np.dot(X.T, X)*Tau[j])
        
        #Update Beta
        Beta[j+1,:] = np.random.multivariate_normal(Mean, Var)
        
        #Update Tau
        Tau[j+1] = np.random.gamma(a + n/2, (b + (1/2)*np.dot(y - np.dot(X,Beta[j+1,:]).T,(y - np.dot(X, Beta[j+1,:]))))**(-1))
        
    #Set Burn In#
    Burn_In = int(niters*0.1)
    
    return Beta[Burn_In:,:], Tau[Burn_In:]

Metropolis

We follow the same formulation as above in our Metropolis algorithm. In testing, we found that proposing each Beta value individually resulted in more acceptancces and thus more accurate estimations of our parameters. As the Metropolis algorithm involves a slow exploration of the posterior distribution, we Cythonize the Metropolis algorithm to speed it up

Python

In [174]:
def MH_Posterior_Python(Beta1, Beta2, Beta3, Beta4, Tau, Y, X):
    """
    Calculate Posterior Log-PDF For Specific Values of Beta and Tau
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Value of Posterior Log-PDF at specific Beta and Tau values
    """
    
    #Initialize n and Beta Vector#
    n = X.shape[0]
    Beta = np.array([Beta1, Beta2, Beta3, Beta4])
    
    #Calculate Log-Likelihood#
    Log_Lik = (n/2)*np.log(Tau) - (Tau/2) * np.dot(Y - np.dot(X,Beta).T,(Y - np.dot(X, Beta)))
    
    #Initialize Beta and Tau RV Objects#
    Beta_Prior = stats.multivariate_normal(mean = np.array([0,0,0,0]), cov = np.eye(4))
    Tau_Prior = stats.gamma(a = 1)
    
    #Return Log Posterior PDF#
    return(Log_Lik + Beta_Prior.logpdf(Beta) + Tau_Prior.logpdf(Tau))

def Beta1_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y):
    """
    Run Metropolis Step for Beta1
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta1 vector after accept/reject decision
    """
    Beta1_Star = np.random.normal(loc = Beta1[-1], scale = 0.1)
    rho = MH_Posterior_Python(Beta1_Star, Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X) \
        - MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    U = np.random.uniform()
    if np.log(U) < rho:
        return(np.append(Beta1, Beta1_Star))
    else:
        return(Beta1)
    
def Beta2_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y):
    """
    Run Metropolis Step for Beta2
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta2 vector after accept/reject decision
    """
    Beta2_Star = np.random.normal(loc = Beta2[-1], scale = 0.05)
    rho = MH_Posterior_Python(Beta1[-1], Beta2_Star, Beta3[-1], Beta4[-1], Tau[-1], Y, X) \
        - MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    U = np.random.uniform()
    if np.log(U) < rho:
        return(np.append(Beta2, Beta2_Star))
    else:
        return(Beta2)
    
def Beta3_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y):
    """
    Run Metropolis Step for Beta3
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta3 vector after accept/reject decision
    """
    Beta3_Star = np.random.normal(loc = Beta3[-1], scale = 0.01)
    rho = MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3_Star, Beta4[-1], Tau[-1], Y, X) \
        - MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    U = np.random.uniform()
    if np.log(U) < rho:
        return(np.append(Beta3, Beta3_Star))
    else:
        return(Beta3)
    
def Beta4_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y):
    """
    Run Metropolis Step for Beta4
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta4 vector after accept/reject decision
    """
    Beta4_Star = np.random.normal(loc = Beta4[-1], scale = 0.03)
    rho = MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4_Star, Tau[-1], Y, X) \
            - MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    U = np.random.uniform()
    if np.log(U) < rho:
        return(np.append(Beta4, Beta4_Star))
    else:
        return(Beta4)
    
def Tau_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y):
    """
    Run Metropolis Step for Tau
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Tau vector after accept/reject decision
    """
    Tau_Star = np.random.normal(loc = Tau[-1], scale = 1)
    rho = MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau_Star, Y, X) \
            - MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    U = np.random.uniform()
    if np.log(U) < rho:
        return(np.append(Tau, Tau_Star))
    else:
        return(Tau)

def MH_LinReg_Python(X, Y, niters):
    """
    Metropolis algorithm for Bayesian Linear Regression
    
    Parameters
    ----------
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
    niters : Number of Iterations to Run
        
    Returns
    -------
    Beta : Matrix of Beta Samples after Burn-In
    Tau : Vector of Tau Samples after Burn-In
    """
    
    #Initialize Beta and Tau Vectors Individually
    Beta1 = np.array([0]); Beta2 = np.array([0]); Beta3 = np.array([0]); Beta4 = np.array([0])
    Tau = np.array([1])
    
    #Run sampler for 'niters' iterations#
    for T in range(niters):
        
        ##Propose Values, Calculate Log-Rho Value, and Accept/Reject##
        
        Beta1 = Beta1_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
        Beta2 = Beta2_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
        Beta3 = Beta3_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
        Beta4 = Beta4_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
        Tau = Tau_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)

        
        # Beta1 #
        Beta1_Star = np.random.normal(loc = Beta1[-1], scale = 0.1)
        rho = MH_Posterior_Python(Beta1_Star, Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X) - \
            MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
        U = np.random.uniform()
        if np.log(U) < rho:
            Beta1 = np.append(Beta1, Beta1_Star)
            
        # Beta2 #
        Beta2_Star = np.random.normal(loc = Beta2[-1], scale = 0.05)
        rho = MH_Posterior_Python(Beta1[-1], Beta2_Star, Beta3[-1], Beta4[-1], Tau[-1], Y, X) - \
            MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
        U = np.random.uniform()
        if np.log(U) < rho:
            Beta2 = np.append(Beta2, Beta2_Star)
            
        # Beta3 #
        Beta3_Star = np.random.normal(loc = Beta3[-1], scale = 0.01)
        rho = MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3_Star, Beta4[-1], Tau[-1], Y, X) - \
                MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
        U = np.random.uniform()
        if np.log(U) < rho:
            Beta3 = np.append(Beta3, Beta3_Star)
            
        # Beta4 #
        Beta4_Star = np.random.normal(loc = Beta4[-1], scale = 0.03)
        rho = MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4_Star, Tau[-1], Y, X) - \
            MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
        U = np.random.uniform()
        if np.log(U) < rho:
            Beta4 = np.append(Beta4, Beta4_Star)
        
        # Tau #
        Tau_Star = np.random.normal(loc = Tau[-1], scale = 1)
        rho = MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau_Star, Y, X) - \
            MH_Posterior_Python(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
        U = np.random.uniform()
        if np.log(U) < rho:
            Tau = np.append(Tau, Tau_Star)
    
        
    return np.array([Beta1, Beta2, Beta3, Beta4]), Tau

Cython

In [175]:
%%cython -lgsl

#Import Necessary Python Modules#
import numpy as np
cimport numpy as np
import cython
from cython_gsl cimport *
from libc.math cimport sqrt, log, cos

cdef float pi = 3.14159265

cdef gsl_rng_type * T
cdef gsl_rng * r
gsl_rng_env_setup()
T = gsl_rng_default
r = gsl_rng_alloc (T)

cdef double MH_Posterior(double Beta1, double Beta2, double Beta3, double Beta4, \
                         double Tau, double[:] Y, double[:,:] X):
    """
    Calculate Posterior Log-PDF For Specific Values of Beta and Tau
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Value of Posterior Log-PDF at specific Beta and Tau values
    """
    
    #Initialize n and Beta Vector#
    cdef int n = X.shape[0]
    cdef double[:] Beta = np.array([Beta1, Beta2, Beta3, Beta4])

    #Calculate Log-Likelihood#
    cdef double Log_Lik = (n/2)*log(Tau) - (Tau/2) * np.dot(Y - np.dot(X,Beta).T,(Y - np.dot(X, Beta)))
    
    cdef double b1 = log(1/sqrt(2*pi)) - (1/2)*(Beta1)**2
    cdef double b2 = log(1/sqrt(2*pi)) - (1/2)*(Beta2)**2
    cdef double b3 = log(1/sqrt(2*pi)) - (1/2)*(Beta3)**2
    cdef double b4 = log(1/sqrt(2*pi)) - (1/2)*(Beta4)**2
    
    #Return Log Posterior PDF#
    return(Log_Lik + b1 + b2 + b3 + b4 + Tau)
    
cdef double Box_Muller():
    """
    Box Muller Transform to sample Standard Random Normal variables
        
    Returns
    -------
    Standard Random Normal Sample
    """
    cdef double U1 = gsl_rng_uniform(r)
    cdef double U2 = gsl_rng_uniform(r)
    return(sqrt(-2 * log(U1)) * cos(2*pi*U2))

cdef double[:] Beta1_Step(double[:] Beta1, double[:] Beta2, double[:] Beta3, double[:] Beta4, \
                          double[:] Tau, double[:,:] X, double[:] Y):
    """
    Run Metropolis Step for Beta1
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta1 vector after accept/reject decision
    """
    cdef double Beta1_Star = Beta1[-1] + sqrt(.1)*Box_Muller()
    cdef double rho = MH_Posterior(Beta1_Star, Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X) \
                    - MH_Posterior(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    cdef double U = gsl_rng_uniform(r)
    if log(U) < rho:
        return(np.append(Beta1, Beta1_Star))
    else:
        return(Beta1)
    
cdef double[:] Beta2_Step(double[:] Beta1, double[:] Beta2, double[:] Beta3, double[:] Beta4, \
                          double[:] Tau, double[:,:] X, double[:] Y):
    """
    Run Metropolis Step for Beta2
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta2 vector after accept/reject decision
    """
    cdef double Beta2_Star = Beta2[-1] + sqrt(0.05)*Box_Muller()
    cdef double rho = MH_Posterior(Beta1[-1], Beta2_Star, Beta3[-1], Beta4[-1], Tau[-1], Y, X) \
                    - MH_Posterior(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    cdef double U = gsl_rng_uniform(r)
    if log(U) < rho:
        return(np.append(Beta2, Beta2_Star))
    else:
        return(Beta2)
    
cdef double[:] Beta3_Step(double[:] Beta1, double[:] Beta2, double[:] Beta3, double[:] Beta4, \
                          double[:] Tau, double[:,:] X, double[:] Y):
    """
    Run Metropolis Step for Beta3
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta3 vector after accept/reject decision
    """ 
    cdef double Beta3_Star = Beta3[-1] + sqrt(0.01)*Box_Muller()
    cdef double rho = MH_Posterior(Beta1[-1], Beta2[-1], Beta3_Star, Beta4[-1], Tau[-1], Y, X) \
                    - MH_Posterior(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    cdef double U = gsl_rng_uniform(r)
    if log(U) < rho:
        return(np.append(Beta3, Beta3_Star))
    else:
        return(Beta3)
    
cdef double[:] Beta4_Step(double[:] Beta1, double[:] Beta2, double[:] Beta3, double[:] Beta4, \
                          double[:] Tau, double[:,:] X, double[:] Y):
    """
    Run Metropolis Step for Beta4
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Beta4 vector after accept/reject decision
    """    
    cdef double Beta4_Star = Beta4[-1] + sqrt(0.03)*Box_Muller()
    cdef double rho = MH_Posterior(Beta1[-1], Beta2[-1], Beta3[-1], Beta4_Star, Tau[-1], Y, X) \
                    - MH_Posterior(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    cdef double U = gsl_rng_uniform(r)
    if log(U) < rho:
        return(np.append(Beta4, Beta4_Star))
    else:
        return(Beta4)
    
cdef double[:] Tau_Step(double[:] Beta1, double[:] Beta2, double[:] Beta3, double[:] Beta4, \
                        double[:] Tau, double[:,:] X, double[:] Y):
    """
    Run Metropolis Step for Tau
    
    Parameters
    ----------
    Beta1 : Coefficient for Temperature (T)
    Beta2 : Coefficient for Ambient Pressure (AP)
    Beta3 : Coefficient for Relative Humidity (RH)
    Beta4 : Coefficient for Exhaust Vaccuum (EV)
    Tau : Coefficient for Precision
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
        
    Returns
    -------
    Tau vector after accept/reject decision
    """
    cdef double Tau_Star = Tau[-1] + Box_Muller()
    cdef double rho = MH_Posterior(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau_Star, Y, X) \
                    - MH_Posterior(Beta1[-1], Beta2[-1], Beta3[-1], Beta4[-1], Tau[-1], Y, X)
    cdef double U = gsl_rng_uniform(r)
    if log(U) < rho:
        return(np.append(Tau, Tau_Star))
    else:
        return(Tau)

@cython.boundscheck(False)
def MH_LinReg_Cython(double[:,:] X, double[:] Y, int niters):
    """
    Metropolis algorithm for Bayesian Linear Regression
    
    Parameters
    ----------
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
    niters : Number of Iterations to Run
        
    Returns
    -------
    Beta : Matrix of Beta Samples after Burn-In
    Tau : Vector of Tau Samples after Burn-In
    """
    #Initialize Beta and Tau Vectors Individually

    Beta1 = np.array([0.0]); Beta2 = np.array([0.0]); Beta3 = np.array([0.0]); Beta4 = np.array([0.0])
    Tau = np.array([1.0])
    
    #Run sampler for 'niters' iterations#
    cdef int T

    for T in range(niters):
        
        ##Propose Values, Calculate Log-Rho Value, and Accept/Reject##
        Beta1 = Beta1_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
        
        Beta2 = Beta2_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)

        Beta3 = Beta3_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
  
        Beta4 = Beta4_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
    
        Tau = Tau_Step(Beta1, Beta2, Beta3, Beta4, Tau, X, Y)
     
    #return np.array([Beta1, Beta2, Beta3, Beta4]), Tau
    return([np.array([Beta1]), np.array([Beta2]), np.array([Beta3]), np.array([Beta4])], 
           np.array([Tau]))
In [176]:
%timeit MH_LinReg_Python(X, Y, 10)
1 loop, best of 3: 240 ms per loop
In [177]:
%timeit MH_LinReg_Cython(X, Y, 10)
100 loops, best of 3: 8.43 ms per loop

Speed

Above we include the Metropolis MCMC Sampler in Python, as well as the more efficient and faster version in Cython. Some steps we took to improve the speed in Cython:

  1. Use GSL Random Number Functions for random number generation
  2. Use Box-Muller Transform to generate random value from Standard Normal Distribution
    • Given $U_1$ and $U_2$ Independent Random Variables Drawn from Unif(0,1)
    • Z = $\sqrt{-2\text{ln}U_1}\text{cos}(2\pi U_2)$ is a random draw from the Standard Normal Distribution
  3. Eliminate the use of scipy.stats package and calculate Posterior manually
  4. Calculate Log, Exp, and Sin from C Math Package rather than NumPy

Performing these changes resulted in a 30X Speedup of our Metropolis code, which is phenomenal. We could have achieved even greater speedup by pre-allocating array space and using memory typed arrays to not rely on NumPy arrays

Hamiltonian Monte Carlo

For our Hybrid Monte Carlo Algorithm, we need 2 pieces of information from our Linear Regression Model. For ease of computation, we will use the Logarithm of both results below, and set $\theta = \text{log}(\tau)$

  1. Posterior Distribution
    • The 'Potential Energy' Function that we evaluate within our acceptance criterion
    • $p(\beta, \theta|X, y) = (\frac{n}{2} + 1)\theta - e^{\theta}(\frac{1}{2}(y - X\beta)^T(y - X\beta) + 1) - \frac{1}{2}\beta^T \beta$
  2. Gradient of Posterior Distribution
    • Simulating Hamiltonian Dynamics, to simulate the subsequent Momentum and Parameter Variables
    • $\nabla L(\beta, \theta|X, y) = \begin{bmatrix} e^{\theta}X^T(y - X\beta) - \beta \\ \frac{n}{2} + 1 - e^{\theta}(\frac{1}{2}(y - X\beta)^T(y - X\beta) + 1) \end{bmatrix}$

Given the complexity of writing a an HMC algorithm for a complex Linear Regression model, we opted to not Cythonize this code. As such, it does take quite a while to run for 10,000 iterations, but it is important to know that even with 100 iterations, the algorithm converges to the true value very quickly.

In [178]:
def unif(delta, low=.85, high=1.15):
    """
    Function to adaptively vary step size
    
    Parameters
    ----------
    delta : Current Step Size
    low : Lower bound of Uniform Random Sample
    high : Upper Bound of Uniform Random Sample
    
    Returns
    -------
    Updated Step Size between 0.85 and 1.15 times current Step Size
    
    """
    return np.random.uniform(low, high) * delta
    
def log_grad(X, Y, Beta, Tau):
    """
    Evaluates the Log-Gradient of the Posterior Distribution for Beta and Tau
    
    Parameters
    ----------
    X : Matrix of shape n x p containing Predictors
    Y : Vector of length n containing Response
    Beta : Vector of Current Beta Values
    Tau : Vector of Current Tau Value
    
    Returns
    -------
    Gradient vector of length p+1
    
    """
    ##Set Priors##
    
    #Prior for Beta#
    b0 = np.zeros(4); E0 = np.eye(4)
    
    #Prior for Tau#
    a = 1; b = 1
    
    #Calculate number of observations#
    n = X.shape[0]
    
    dBeta = np.exp(Tau) * np.dot(X.T,(Y - np.dot(X,Beta))) - np.dot(la.inv(E0),(Beta - b0))
    dTau = n/2 + a - np.exp(Tau)*((1/2)*np.dot((Y - np.dot(X,Beta)).T,(Y - np.dot(X,Beta))) + b)
    return np.append(dBeta, dTau)
    
def potential_energy(X, Y, Beta, Tau):
    """
    Evaluates the Log of the Posterior Distribution for Beta and Tau
    
    Parameters
    ----------
    X : Matrix of shape n x p containing Predictors
    Y : Vector of length n containing Response
    Beta : Vector of Current Beta Values
    Tau : Vector of Current Tau Value
    
    Returns
    -------
    The Log of the Posterior Distribution under inputs
    
    """
    ##Set Priors##
    
    #Prior for Beta#
    b0 = np.zeros(4); E0 = np.eye(4)
    
    #Prior for Tau#
    a = 1; b = 1
    
    #Calculate number of observations#
    n = X.shape[0]
    
    return((n/2 + a)*Tau \
           - np.exp(Tau)*((1/2)*np.dot(Y - np.dot(X,Beta).T,(Y - np.dot(X, Beta))) + b) \
           - (1/2)*np.dot(Beta.T,np.dot(la.inv(E0),Beta)))
        
def kinetic_energy(p):
    """
    Evaluates the PDF for the Momentum Variables (Multivariate Normal). This is used in accept/reject step
    
    Parameters
    ----------
    p : Vector of Momentum Values to calculate the PDF for
    
    Returns
    -------
    Single value containing the PDF for a MVN as the p vetor values
    
    """
    mvn = stats.multivariate_normal()
    return (np.sum(mvn.logpdf(p)))

def HMC_LinReg(X, Y, niters):
    """
    Hamiltonian Monte Carlo algorithm for Bayesian Linear Regression
    
    Parameters
    ----------
    X : Matrix of shape n x p containing Predictors
    y : Vector of length n containing Response
    niters : Number of Iterations to Run
        
    Returns
    -------
    Beta : Matrix of Beta Samples after Burn-In
    Tau : Vector of Tau Samples after Burn-In
    """
    
    #Calculate number of observations#
    n = X.shape[0]
    
    #Set Step-Size (Delta) And Number of Leaps#
    delta = 0.25/np.sqrt(n)
    leap = int(1/delta)
    
    #Initialize Array to Hold Beta and Tau Accepted Proposals
    Vars = np.array([0,0,0,0,0]).reshape(1,5)
    
    #Run sampler for 'niters' iterations
    for t in range(niters):
        
        #Choose step size adaptively at each iteration#
        delta = unif(delta)
                    
        #Sample Random Momentum#
        p0 = np.random.randn(5)
            
        ##Simulate Hamiltonian Dynamics##
        
        #First 1/2 Step of Momentum
        p_star = p0 - (1/2)*delta*-log_grad(X, Y, Vars[-1,:4], Vars[-1,4])
        
        #Set Proposed Variable Values to previous accepted Value#
        Vars_Star = Vars[-1,:]
 
        #Full Steps of Momentum and Variable#
        for i in range(leap):
            
            #First full step of position/sample
            Vars_Star = Vars_Star + delta*p_star
            
            if i != leap-1:
            
                #Update Momentum on all but last leap step#
                p_star = p_star - delta*-log_grad(X, Y, Vars_Star[:4], Vars_Star[4])
        
        #Final 1/2 Step of Momentum#
        p_star = p_star - delta/2*-log_grad(X, Y, Vars_Star[:4], Vars_Star[4])
        
        #Negate p_star to conserve energy#
        p_star = -p_star
        
        #Potential Energy at Proposed and Previous Variables#
        U0 = potential_energy(X, Y, Vars[-1,:4], Vars[-1,4])
        U_star = potential_energy(X, Y, Vars_Star[:4], Vars_Star[4])
    
        #Kinetic Energy for Proposed and Previous Momentum#
        K0 = kinetic_energy(p0)
        K_star = kinetic_energy(p_star)
        
        #Log Acceptance Ratio#
        alpha = U_star + K_star - (U0 + K0)
        
        #Accept/Reject Condition
        if np.log(np.random.random()) < alpha:
            Vars = np.vstack([Vars, Vars_Star])

    return Vars[:,:4], np.exp(Vars[:,4])
In [179]:
Beta_G, Tau_G = Gibbs_LinReg(X, Y, 10000)
Beta_M, Tau_M = MH_LinReg_Cython(X, Y, 10000)
Beta_HMC, Tau_HMC = HMC_LinReg(X, Y, 10000)
OLS = np.dot(la.inv(np.dot(X.T, X)), np.dot(X.T, Y))

Comparisons

$$\text{Parameter Estimates}$$
In [180]:
OLS_Beta = np.dot(la.inv(np.dot(X.T, X)), np.dot(X.T, Y))
OLS = np.array([np.round(OLS_Beta[0], 4), np.round(OLS_Beta[1], 4), 
                np.round(OLS_Beta[2], 4), np.round(OLS_Beta[3], 4),"---"])
B_G = np.mean(Beta_G, axis = 0)
G = np.array([B_G[0], B_G[1], B_G[2], B_G[3], np.mean(Tau_G)])
M = np.array([np.mean(Beta_M[0]), np.mean(Beta_M[1]), np.mean(Beta_M[2]), 
              np.mean(Beta_M[3]), np.mean(Tau_M)])
B_HMC = np.mean(Beta_HMC, axis = 0)
HMC = np.array([B_HMC[0], B_HMC[1], B_HMC[2], B_HMC[3], np.mean(Tau_HMC)])
pd.DataFrame(np.vstack([OLS,np.around(G, 4), np.around(M, 4), np.around(HMC, 4)]), 
             columns = ["Beta1", "Beta2", "Beta3", "Beta4", "Tau"], 
             index = ["OLS", "Gibbs", "Metropolis", "HMC"])
Out[180]:
Beta1 Beta2 Beta3 Beta4 Tau
OLS -0.8635 -0.1742 0.0216 -0.1352 ---
Gibbs -0.8633 -0.1743 0.0216 -0.1351 13.9798
Metropolis -0.8344 -0.1871 0.0242 -0.129 14.008
HMC -0.87 -0.1716 0.0157 -0.1377 13.8447

We can see that all MCMC Methods and OLS provide very accurate estimates for the $\beta$ and $\tau$ coefficients. Across all the methods, it is interesting to look at the variance of the estimates as well as the number of iterations until convergence.

$$\text{Variances}$$
In [181]:
B_G = np.var(Beta_G, axis = 0)
G = np.array([B_G[0], B_G[1], B_G[2], B_G[3], np.var(Tau_G)])
M = np.array([np.var(Beta_M[0]), np.var(Beta_M[1]), np.var(Beta_M[2]), 
              np.var(Beta_M[3]), np.var(Tau_M)])
B_HMC = np.var(Beta_HMC, axis = 0)
HMC = np.array([B_HMC[0], B_HMC[1], B_HMC[2], B_HMC[3], np.var(Tau_HMC)])
pd.DataFrame(np.vstack([G, M, HMC]), columns = ["Beta1", "Beta2", "Beta3", "Beta4", "Tau"], 
             index = ["Gibbs", "Metropolis", "HMC"])
Out[181]:
Beta1 Beta2 Beta3 Beta4 Tau
Gibbs 0.000045 0.000030 0.000011 0.000013 0.041130
Metropolis 0.012018 0.002303 0.000157 0.000825 0.538395
HMC 0.000186 0.000020 0.000014 0.000010 0.167809

We define convergence here as the Iteration at which the difference between the accepted parameter sample and the previous sample is less than $10^{-5}$. Note that this metric is purely a heuristic and will be used to demonstrate the efficieny of Hamiltonian Monte Carlo vs the other methods.

$$\text{Number of Iterations until Convergence (As Percentage of Total Accepted Proposals)}$$

As an example, if the Trace Plot for your Parameter contained 1000 Accepted Proposals, but you "converged" in 100, this result will be $\frac{100}{1000} = 0.1$

In [182]:
#Gibbs
def Num_Iters_Gibbs(P, tol=10**(-5)):
    """
    Evaluates the Number of Iterations (As a Percentage) Until Convergence. In Gibbs case, include Burn-In
    
    Parameters
    ----------
    P : Trace Plot of Parameter to Evaluate
    tol : Tolerance to evaluate convergence to 
    
    Returns
    -------
    Percentage as number of iterations / total accepted. 
    
    """
    for i in range(1,len(P)):
        if abs(P[i] - P[i-1]) < tol:
            return((i+1000)/(len(P)+1000))

#HMC and Metropolis
def Num_Iters_HMC_M(P, tol=10**(-5)):
    """
    Evaluates the Number of Iterations (As a Percentage) Until Convergence
    
    Parameters
    ----------
    P : Trace Plot of Parameter to Evaluate
    tol : Tolerance to evaluate convergence to 
    
    Returns
    -------
    Percentage as number of iterations / total accepted. 
    
    """
    for i in range(1,len(P)):
        if abs(P[i] - P[i-1]) < tol:
            return(i/(len(P)))
In [183]:
G = np.array([Num_Iters_Gibbs(Beta_G[:,0]), Num_Iters_Gibbs(Beta_G[:,1]), 
              Num_Iters_Gibbs(Beta_G[:,2]), Num_Iters_Gibbs(Beta_G[:,3]), Num_Iters_Gibbs(Tau_G)])
M = np.array([Num_Iters_HMC_M(Beta_M[0]), Num_Iters_HMC_M(Beta_M[1]), Num_Iters_HMC_M(Beta_M[2]), 
              Num_Iters_HMC_M(Beta_M[3]), Num_Iters_HMC_M(Tau_M)])
HMC = np.array([Num_Iters_HMC_M(Beta_HMC[:,0]), Num_Iters_HMC_M(Beta_HMC[:,1]), 
                Num_Iters_HMC_M(Beta_HMC[:,2]), Num_Iters_HMC_M(Beta_HMC[:,3]), Num_Iters_HMC_M(Tau_HMC)])
pd.DataFrame(np.vstack([G, M, HMC]), columns = ["Beta1", "Beta2", "Beta3", "Beta4", "Tau"], 
             index = ["Gibbs", "Metropolis", "HMC"])
Out[183]:
Beta1 Beta2 Beta3 Beta4 Tau
Gibbs 0.188281 0.194081 0.138886 0.173183 None
Metropolis None None None None None
HMC 0.103583 0.0123098 0.0643515 0.0910729 0.13731

We can see that Gibbs Sampling converges well with low variances of the parameter estimates, but it takes quite a number of iterations to converge. This result is likely due to the 'Burn In' period needed for the Gibbs algorithm.

Our Metropolis Sampler converged to close to the true value, but we can see that in a few cases, it never truly converged with respect to our criterion. The main flaw of the Metropolis algorithm is its 'random walk' behavior, and we can see that with the exceptionally high variances of our accepted parameter estimates.

The Hamiltonian Monte Carlo algorithm balanced both a low variance in the parameter estimates and a low number of iterations until convergence. In a separate test of just 100 Iterations, the HMC algorithm is able to quickly find the appropriate parameter estimates.

6. Conclusion

This paper has explored the Hamiltonian Monte Carlo algorithm in great detail, comparing its efficiency and behavior with that of other MCMC methods, specifically Gibbs and Metropolis. Given the quick convergence and success under highly correlated parameters, it is no surprise that this algorithm is becoming increasingly popular in the Statistics and Machine Learning community. Several variations have been proposed, including No U-Turn Sampling (NUTS), Langevin HMC, Riemannian Manifold HMC, Stochastic Gradient HMC, and Spherical HMC, to name a few. We hope this write-up will aid future scholars in understanding the theoretical basis and empirical effectiveness of this algorithm, in the hope that future discoveries may be made to progress Bayesian methodologies.

7. References

In [ ]: