Predicting European Soccer Match Results
Predicting the outcome of competitions is a well-studied and desirable subject, for both gamblers and sports fans alike. Given a Kaggle dataset of competing European Soccer teams, we set out to build a predictive model that can correctly predict the winner of head-to-head matchups. The dataset contains match level results and team specific attributes over eight years. Using the dataset we intend to understand the relative strength of teams using a Bradley-Terry model.
The Effect of Racial Heterogeneity on Graduation Rates
We assess whether a statistic designed to measure the holistic level of racial diversity in a school district has a statistically significant impact on the graduation rate in the school district. The analysis is performed using data from the Data for Diplomas project, which is an amalgamation of data from various sources, including the 2010 US census and the American Community Survey.
Mapping NYC Boroughs with Supervised Learning
This project explores the world of larger data sets using the New York City Open Data repository. We look at public safety violations across the entire city from 2007 to 2015, a 5 GB dataset containing over 10 million records. After geo-coding locations of each observation in the data set, we build a boundary map containing outlines of the five New York City boroughs using a supervised learning algorithm. We also explore a question of our own interest and create a novel visualization of the data itself.
Exploring the Multi-Armed Bandit Problem
This paper explores the multi-armed bandit problem (MAB). A multi-armed bandit is a sequential experiment with the goal of achieving the largest possible reward from a payoff distribution with unknown parameters. We review several strategies for selecting bandits, including a creative application of dynamic programming. Note that there are many variations of the stochastic MAB, including contextual bandit and adversarial bandit, which we will not discuss here and is outside of the scope of this paper.
Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) is a relatively recent approach to posterior estimation. It improves on the simple random-walk proposal of the Metropolis algorithm, by ensuring that these proposals are in the direction of steepest ascent (Neal, 2011). In doing so, it avoids the slow exploration of the state space from these random proposals. HMC reduces the correlation between successive sampled states by using Hamiltonian evolution between states, a concept derived from quantum mechanics. The energy preserving aspect of Hamiltonian dynamics will be extremely useful in derivating and implementating this algorithm.
La Quinta vs. Dennys
"La Quinta is Spanish for 'Next to Dennys'" is a joke made famous by late comedian Mitch Hedberg. We perform analysis to test the veracity of this claim, using a combination of web scraping, geo-coding, and statistical analysis from the companies' respective websites. The project finishes with an interesting visualization and our findings.
Web Scraping Indeed Job Postings
We create a unique Shiny App that Web Scrapes job postings from Indeed based on a specified Keyword and Location. We perform a Term Frequency - Inverse Document Frequency (TF-IDF) analysis on the entire corpus, and display the most relevant words in a visual word-cloud. Furthermore, we allow the option to display more or less words, and show a detailed word summary.