Have you ever tried to modify data and found that it took more time than expected? This task looks rather simple, but it contains a lot of error-prone, manual and repetitive work.
Data preparation is often the most time consuming part of an analytical project. You can do it manually, but that will be a slow and error prone process. Some teams will try to automate it, using Alteryx, Excel macros or Python, but this takes a long time to set up and often has a big learning curve. Even a simple task like extracting a day of the week can take way longer than expected.
The New York Times reports that “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” There’s an increasing amount of data in more of our lives since then and the data wrangling problem hasn’t gotten better.
It could be the Excel spreadsheet that powers your weekly sales meeting if you’re managing a sales team. Maybe it’s that enormous Google sheet filled with feedback from the survey you’ve been running over the past month. If you’re a data scientist, it’s the feature engineering needed for machine learning.
Whether you are doing this in Excel, Google Sheets or Python, data prep is a laborious task that needs to be faster, simpler, and more powerful.
Such repetitive tasks that contain certain rules are a perfect fit for automation. We wanted to provide the easiest and most convenient way to do this. At Intersect Labs, we sought out to make the world’s most advanced and easy to use data prep tool. We designed a flow that would let you transform data 11x faster, no matter how simple or complex the transformation is.
Now you can create a sequence of modifications called a pipeline. You simply add the “Transform” block you want to add to your data pipeline from a gallery of transforms. Want to multiply two columns? Add the “Multiply Columns” block, done. Want to group by one column and sum the other columns’ values? Add the “Aggregation” transform block, done.
Say we want to get insights from the data that we have on our customers' individual orders for 2020. We would like to quickly produce an overview of the performance for each month. So we’d like to:
Since this is a repetitive task for us, we want to set up a pipeline that we can use again in the future. Once we set one pipeline we can just upload the next dataset and out comes the results.
Loading your data in Pasteur
1) Set up an account on Pasteur if you haven't already
2) Create a new project
You can also connect your data with Google Sheets, MongoDB or SQL databases.
Let's look at the data
We need the Order Date, Quantity and Item Cost to get the Monthly Sales number out.
Setting up the pipeline
Now we can start appending transforms to the pipeline. First, we want to multiply two columns, the “Quantity” and the “Item cost” column. We add the ‘Multiply columns’ transform and select which columns to multiply and we put a new name for this column, “Total Cost”.
If we click on ‘Apply Transform’, and set the Data Preview to ‘After’, we can see that the transform has applied a new column to the data pipeline, our ‘Total Cost’.
Now we would like to group the data by month. We can use the Order Date column (which is in European format) for this, and luckily, Intersect Labs can handle this without a sweat. Add the ‘Extract from Date/Time column’ to the pipeline.
We just added 3 columns for the year, month and day.
Lastly, let's group the orders on the month column.
This can be done with the “Aggregate Data” transform. First, we select which column to group our orders by. In our case this is the “Month” column. Then, we need to select which column we would aggregate our data on, and how. We want to add up all the “Total Cost” values together, thus we select sum. While we are at it, there are a bunch of aggregations to select, such as the mean, standard deviation (std), maximum, minimum, etc. The mean gives us the average order size, which might be handy to include, so we select this one too.
Great! We now have our Total Sales per month and Average Sales per month out of our sales data spreadsheet. Let’s get the total dataset out and make a nice graph, get some coffee, and make our manager happy.
Download the transformed spreadsheet.
That’s it. In no-time we have set up a data transformation pipeline using Intersect Labs, that can quickly get data insights for your sales meeting. The best thing is that we can reuse this pipeline every time we have a new list of orders. We can simply apply it again to a newly uploaded dataset, and get new insights in under 5 minutes. Of course, this was just an easy example. There are over 50 transform blocks available to do any data modification you need.
Try out the Intersect Labs pipeline yourself using your own dataset.