Your data can be useless, if you don't prepare it well. Luckily, predictive analytics software are making this process easier.
Open a spreadsheet, and fill in the column headers with what you want to predict, and the data that's predictive of it. Write down all the data sources you'll need to pull from. Include as many stakeholders in this planning, since each has a unique perspective on the outcome and its potential drivers.
This is infamously time-consuming. That’s why it’s crucial to rely on good tooling here. For technical details of data cleaning, see the full checklist below.
If you have multiple data sources, merge them into one file
For example, each row might represent a student's monthly stats (meaning multiple rows for each student), but you want to aggregate them so you can make a single yearly prediction for each student.
ML algorithms assume each row is "independent." To understand what that means, let's look at a dataset of student yearly data, where each row is a student's performance for a year of school. If we have multiple years for the same student, then each row is not independent: if a student did poorly the year before, this clearly has an impact on how they do the year after. In other words, later rows for a student "depend" significantly on earlier rows. So how do we fix it? We might add a new column called "performance in previous years." While not perfect, this makes past performance an explicit part of the individual row, rather than leaving it's dependence on other rows implicit and unaddressed.
It's great to have lots of data, but you'll want to throw out rows or columns where much of the data is missing or erroneous. These aren't largely going to just dilute the model. That said, if a column has quality data, but you're not sure if it's useful, definitely keep it, and see if the model finds it predictive.
It’s great to have lots of data, but does that data capture all the cases you hope to predict? Specifically, does it have a robust sample of the your most important outcome? This is a classic challenge in ML: often the most important outcome happens rarely. For example: credit card fraud, machine failure, or customer churn. Work around this by pulling data from a larger span of time.
Regardless how you choose to prepare data, these 3 factors should be present: