What Kind of Data Do I Need to Start with Machine Learning?

Learn how much historical data is needed to start with machine learning. Then, capture your most valuable business outcomes.

David Mora

In order to make predictions, machine learning requires historical data which is from a wide enough time frame to generally capture all kinds of variation you’d expect.

This data should be stored in a spreadsheet, that has:

  • a column representing what you want to predict;
  • a columns with data related to that prediction column

Example: Data for Predicting User Turnover

You run an online, subscription-based product. But users are cancelling their subscriptions. You want to predict this ahead of time, and implement strategies to retain those users.

To do this with machine learning:

1. You'd need a spreadsheet where each row is a customer with:

  • A “yes/no” column called "User cancelled their subscription"
  • A series of columns that help predict whether a user cancels. Columns might include: user demographics like gender and region, or engagement measures like items purchased.

2. And you’d want to be sure you had data on customers from all representative time frames:

  • If user behavior changes little month to month, a few months of data could work. 
  • If data corresponds with yearly cycles, you'd want at least a few years of data to capture that variation.
  • Further, you’d want a sufficient sample of both customers who did and didn’t cancel.

Machine learning mimics the way a human with no context for your business would learn from your data. Use that to estimate the data you’ll need.

Ask yourself: given enough time and memorization, could an intelligent human find the factors they need to make strong predictions given only your data? If so, there's a very good chance machine learning can as well.

Once you have that data, you’ll need to process it.

So how should you format your data to get maximum power out of machine learning?