Have you ever tried to modify data and found that it took more time than expected? This task looks rather simple, but it contains a lot of error-prone, manual and repetitive work.
Data munging or usually referred as data preparation is often the most time consuming part of an analytical project. You can do it manually, but that will be a slow and error prone process. Some teams will try to automate it, using Alteryx, Excel macros or Python, but this takes a long time to set up and often has a big learning curve. Even a simple task like extracting a day of the week can take way longer than expected.
The New York Times reports that “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
Repetitive data cleaning steps
As data is piling up in every department, the need for structured and optimized data cleaning is increasing. Usually this is a task for the data scientists or data analysts, but its use is applied in many departments. If you are managing a sales team, it could be the Excel spreadsheet that powers your weekly sales meeting. Once or twice a week, you would nudge your colleague, the data scientist, to clean that data and build better sales forecasts. Or, maybe that enormous Google sheet filled with marketing data, from your marketing director. If you’re a data scientist, it’s the feature engineering needed for machine learning.
Whether you are doing this in Excel, Google Sheets or Python, data prep is a laborious task that needs to be faster, simpler, and more powerful.
Such repetitive tasks that contain certain rules are a perfect fit for automation. We wanted to provide the easiest and most convenient way to do this. At Intersect Labs, we sought out to make the world’s most advanced and easy to use data prep tool. We designed a flow that would let you transform data 11x faster, no matter how simple or complex the transformation is.
Introducing Pasteur: The Data preparation app
At Intersect Labs, we strive to simplify the data science workflow for everyone. The first step towards that goal is to enable tech and non-tech professionals to have the ability to organize and understand their data. In 2020, all kinds of data sources help teams to achieve their goals and KPIs. This data is incorporated in almost every business process, and todays' professionals must have the tools to go from data to decisions in minutes.
Today, we launch Pasteur, an automated data preparation app that will guide non-tech professionals to understand and clean their data.
How it works?
Pasteur currently supports 50 pre-built data cleaning steps that are organized in Transform blocks, which later in the text we'll call them Transforms. Now, instead of writing a lot of custom functions in Excel or Python scripts, you can create a sequence of data cleaning steps, which we call a pipeline.
To apply a data cleaning function, you simply add the “Transform” block from a gallery of transforms. Want to multiply two columns? Add the “Multiply Columns” block, done. Want to group by one column and sum the other columns’ values? Add the “Aggregation” transform block, done.
Example: Cleaning and Analyzing Sales data
Say we want to get insights from the data that we have on our customers' individual orders for 2020. We would like to quickly produce an overview of the performance for each month. So we’d like to:
- Group the orders by month;
- Get the number of sales and revenue per month.
Since this is a repetitive task for us, we want to set up a pipeline that we can use again in the future. Once we set one pipeline we can just upload the next dataset and out comes the results.
Loading your data in Pasteur
1) Set up an account on Pasteur if you haven't already
2) Create a new project
- Go to Projects
- Click on New Project
- Click on Select Data Set
- Go to Upload CSV and select your CSV
- Confirm columns and then press Confirm
You can also connect your data with Google Sheets, MongoDB or SQL databases.
Let's look at the data
We need the Order Date, Quantity and Item Cost to get the Monthly Sales number out.
Setting up the pipeline
Now we can start appending transforms to the pipeline. First, we want to multiply two columns, the “Quantity” and the “Item cost” column. We add the ‘Multiply columns’ transform and select which columns to multiply and we put a new name for this column, “Total Cost”.
If we click on ‘Apply Transform’, and set the Data Preview to ‘After’, we can see that the transform has applied a new column to the data pipeline, our ‘Total Cost’.
Now we would like to group the data by month. We can use the Order Date column (which is in European format) for this, and luckily, Intersect Labs can handle this without a sweat. Add the ‘Extract from Date/Time column’ to the pipeline. The Date/Time transform block automatically creates a column for the day, month and year out of the single column. We can even extract the name of the day or month.
We just added 3 columns for the year, month and day.
Lastly, let's group the orders on the month column.
This can be done with the “Aggregate Data” transform. First, we select which column to group our orders by. In our case this is the “Month” column. Then, we need to select which column we would aggregate our data on, and how. We want to add up all the “Total Cost” values together, thus we select sum. While we are at it, there are a bunch of aggregations to select, such as the mean, standard deviation (std), maximum, minimum, etc. The mean gives us the average order size, which might be handy to include, so we select this one too.
Great! We now have our Total Sales per month and Average Sales per month out of our sales data spreadsheet. Let’s get the total dataset out and make a nice graph, get some coffee, and make our manager happy.
Download the transformed spreadsheet.
Plotting the Data
We have successfully cleaned our sales data using Pasteur, and now we can go back in Excel and try to plot the data. From what looked like a messy data at the begigning, now we are able to plot total revenue per month and average order size only because we've organized our data.
That’s it. In no-time we have set up a data transformation pipeline using Intersect Labs, that can quickly get data insights for your sales meeting. The best thing is that we can reuse this pipeline every time we have a new list of orders. We can simply apply it again to a newly uploaded dataset, and get new insights in under 5 minutes. Of course, this was just an easy example. There are over 50 transform blocks available to do any data modification you need.
If you find yourself doing all these repetitive data cleaning steps, we advise you to try Pasteur. You are going to be able to go from messy data and using complex functions, to just building a custom pipeline that will work for you.
For some of our users it saves up to 4 working days a month. They usually automate the whole process, from input to output. For one customer (that we wrote on our blog), we've built a 150 steps pipeline, set up in under 1 hour. Now, anytime someone answers to their Typeform survey, the response gets sent to Pasteur via the API endpoint; the pipeline cleans the data and sends the output to their CRM. They no longer need to clean this data ever again.