Regression is used primarily if you wish to predict or explain numeric data

Regression | Photo by Hao Wang

This article is a continuation of series that last covered Bayes statistics.

Volatility is inherent within the stock market, but perhaps the rise of “meme stocks” is something that is beyond traditional comprehension. This article is not about such “stonks” — which is the ironic misspelling of stocks that encapsulates the new investing zeitgeist amongst Millennials (broadly speaking). Rather, I was pondering the fabulous rollercoaster of a ride of $GME as a timely example of something to apply linear regression to; one can apply a linear regression model to a given stock to predict its price in the future. …

Getting Started

Hard to believe there was once a controversy over probabilistic statistics

Visualization of Frequentists vs Bayesian debate
Visualization of Frequentists vs Bayesian debate
Frequentists vs Bayesians | Photo by Lucas Benjamin

This article builds on my previous article about Bootstrap Resampling.

Introduction to Bayes Models

Bayesian models are a rich class of models, which can provide attractive alternatives to Frequentist models. Arguably the most well-known feature of Bayesian statistics is Bayes theorem, more on this later. With the recent advent of greater computational power and general acceptance, Bayes methods are now widely used in areas ranging from medical research to natural language processing (NLP) to understanding to web searches.

In the early 20th century there was a big debate about the legitimacy of what is now called Bayesian, which is essentially a probabilistic way of…

Simple, straightforward, convenient.

Another Data Dimension | Photo by Rene Böhmer
Another Data Dimension | Photo by Rene Böhmer
Alternate Data Dimension | Photo by Rene Böhmer

No, not Twitter Bootstrap — this bootstrapping is a way of sampling data, and it is one of the most important to consider what underlies the variation of numbers, the variation of distributions, what underlies distributions. To that end, bootstrapping works really, really well. For Data Scientists, Machine Learning Engineers, and statisticians alike it is vital to understand resampling methods.

But why use resampling? We use resampling because we only have a limited amount of data — the limits of time, and economics, to say the least. What then is resampling? Resampling is when you take a sample, and then…

“It’s tough to make predictions, especially about the future!” — Yogi Berra

Time series analysis abstraction | Photo by Luca Bravo

Some wisdom transcends the ages!


This article provides an overview of time series analysis. Time series are an extremely common data type. A quick Google search yields many applications, including:

  • Demand forecasting: electricity production, traffic management, inventory management
  • Medicine: Time-dependent treatment effects, EKG
  • Financial markets and economics: seasonal unemployment, price/return series, risk analysis
  • Engineering/Science: signal analysis, analysis of physical processes

For this article I will cover:

  • Basic properties of time series
  • How to perform and understand decomposition of time series
  • The ARIMA model
  • Forecasting

References: selection of references you can use to go deeper into time series analysis with Python:

Or how I learned to love the sigmoid “squishification” function for categorical data classification

Logistic Regression abstraction
Logistic Regression abstraction
Logistic Regression | Photo by Denise Bossarte

This article is a brief continuation of my regression series.

So far the regression examples I have been illustrating have been numeric, of numbers: predicting a continuous variable. With the Galton family height dataset, we were predicting children’s height — a continuously varying parameter. Yet, note how a regression line fails to fit a binary outcome, a classification example:

The continuing adventures of regularization and the eternal quest to prevent model overfitting!

Abstract representation of LASSO, Ridge, and Elasticnet regression
Abstract representation of LASSO, Ridge, and Elasticnet regression
LASSO, Ridge, and Elasticnet regression | Photo by Daniele Levis Pulusi

This article is a continuation of last week’s intro to regularization with linear regression. Lettuce yonder back into the nitty-gritty of making the best data science/ machine learning models possible with more advanced techniques on simplifying our models. How do we simplify our models? By removing as many features as possible. Why do I want to do this? I want to have a simpler model. Why do I want a simpler model? Well, because a simpler model generalizes better. What does generalization mean? It means that you can actually use it in the real world. …

The one where we correct overfitting

Regularization with Linear Regression
Regularization with Linear Regression
Regularization and Linear Regression | Photo by Jr Korpa

This article is a continuation of my series on linear regression and bootstrap and Bayesian statistics.

Previously I talked at length about linear regression, and now I am going to continue that topic. As I hinted at previously, I am going to bring up the topic of regularization. And what regularization does, is it simplifies the model. And the reason why you want to have a simpler model is that, usually, a simpler will model will perform better in most if not pretty much all of the tests that can be run.

N.B: It is not kosher to use training…

There were others who had forced their way to the top from the lowest rung by the aid of their bootstraps

Bootstrapping Linear Regression | Photo by Ahmad Dirini

This article builds on my Linear Regression and Bootstrap Resampling pieces.

For the literary-minded among my readers, the subtitle is a quote from ‘Ulysses’ 1922, by James Joyce! The origin of the term “bootstrap” is in literature, though not from Joyce. The usage denotes: to better oneself by one’s own efforts — further evolving to encompass metaphors for a series of self-sustaining processes that proceed without external help, the context we are likely most familiar with.

For data scientists and machine learning engineers, this bootstrapping context is an important tool for sampling data. For this reason, it is one of…

Oh boy, homoscedasticity

Multivariant Regression | Photo by Paweł Czerwiński

This article is a continuation of my previous one on Linear Regression.

It is important to reiterate from my last article about the error formulae in least-squares regression.

The Central Limit Theorem (CLT). Something that we likely learned in high school math (AP Stats for me). What I remember about it was that because of the CLT, the magic number for sampling was n = 30. Like many sleep-deprived teens, I nodded and jotted that down in my notebook as I sat in the back of the class, struggling to read the faded projector from the back of the class. As an aside I swear that this was among the last projectors in the entire school, with all my other classes having those fancy smart boards.

While I…

James Andrew Godwin

Writer, Data Scientist and huge Physics nerd

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store