Data Munging for Machine Learning

Your data can be useless, if you don't prepare it well. Luckily, predictive analytics software are making the data wrangling process easier.

David Mora

The process: how to get there

  1. Outline your end goal.

Open a spreadsheet, and fill in the column headers with what you want to predict, and the data that's predictive of it. Write down all the data sources you'll need to pull from. Include as many stakeholders in this planning, since each has a unique perspective on the outcome and its potential drivers.

  1. Gather & clean the data.

This is infamously time-consuming. That’s why it’s crucial to rely on good tooling here. For technical details of data cleaning, see the full checklist below.

Full checklist for preparing data for machine learning

1. Merge all your files

If you have multiple data sources, merge them into one file

2. Aggregate data to one row per instance you want to predict

For example, each row might represent a student's monthly stats (meaning multiple rows for each student), but you want to aggregate them so you can make a single yearly prediction for each student.

3. Ensure each row is roughly independent from all other rows

ML algorithms assume each row is "independent." To understand what that means, let's look at a dataset of student yearly data, where each row is a student's performance for a year of school. If we have multiple years for the same student, then each row is not independent: if a student did poorly the year before, this clearly has an impact on how they do the year after. In other words, later rows for a student "depend" significantly on earlier rows. So how do we fix it? We might add a new column called "performance in previous years." While not perfect, this makes past performance an explicit part of the individual row, rather than leaving it's dependence on other rows implicit and unaddressed.

4. Filter to just quality data

It's great to have lots of data, but you'll want to throw out rows or columns where much of the data is missing or erroneous. These aren't largely going to just dilute the model. That said, if a column has quality data, but you're not sure if it's useful, definitely keep it, and see if the model finds it predictive.

5. Ensure you have a representative sample

It’s great to have lots of data, but does that data capture all the cases you hope to predict? Specifically, does it have a robust sample of the your most important outcome? This is a classic challenge in ML: often the most important outcome happens rarely. For example: credit card fraud, machine failure, or customer churn. Work around this by pulling data from a larger span of time.

6. Clean & refine columns

  • Some numeric measures may be stored as categorical strings, eg "First" or "1 - low engagement." Convert those to numbers.
  • Make sure dates are formatted in a machine read-able way. You may also want to split a date into a categorical column, eg from "1/1/2020/" to "January".
  • Give columns clearer names so it'll be easy to use and share the dataset

Data preparations drains 60+% of data scientists’ time. Invest in a data pipeline that’s automated and doesn’t require a code.

Regardless how you choose to prepare data, these 3 factors should be present:

  • Provide a powerful, code-free visual interface anyone on your team can use to build pipelines. Historically, data prep is costly, slow, and brittle because it’s done strictly in code. Only data scientists can tweak it. Avoid this trap.
  • Make pipelines standard and re-usable. 
  • Integrates with your existing data flow. Pull directly from data sources, and feed the prepped data directly into the machine learning algorithm.