Data munging, or data wrangling, is the process by which analysts gather, transform and clean big datasets.
It’s also one of the most cumbersome and outdated parts of any data-informed organization. It burdens analysts unnecessarily, and delays data use.
But it doesn’t have to be this way. A new generation of tools & processes is sprouting to streamline, automate, and enable seamless collaboration with data.
Here’s an inside tour of the past, present, and future of data munging.
What is Data Munging?
The primary goal of data munging is to make “raw” data more valuable for a variety of downstream processes such as analytics, data visualizations, or training machine learning models.
Often, the munging process is done manually, by creating custom scripts or using spreadsheets to sort and make the data meaningful for further analysis.
The term “data wrangler” or “data munger” was coined 20 years ago by -- believe it or not -- CNN. They defined it as “a person who’ll hunt data information for news stories”. The phrases also appeared in science around the same time, referred to a work of a storage administrator who handled large datasets.
Twenty years later, this process has changed little, introducing major challenges in today's data-driven companies.
Most prominently, the technical & logistical hurdles of data wrangling take up to 80% of data analysts time, leaving them with very little time to do the job that they were actually hired to do - gathering insights.
Obstacles for data-driven organizations
There’s a simple justification for how slow data wrangling is: it’s time well spent, providing the analyst with deeper knowledge of the data’s in and outs.
But astute readers know this: the current process is deeply flawed. It can be very error-prone and time consuming. Even the slightest change requires the intervention of the data/analytics team. The result? Organizations become very slow in making the right decisions, and often stalling behind the competition.
Moreover, since data munging is the first step in the data processing journey, it inhibits organizations from becoming truly data-driven.
So what can companies do?
Easy consumption of data
How do organizations get to easy consumption of data?
First and foremost, organizations invest hours if not days of gathering, transforming and cleaning data a.k.a wrangling data. Only after that can teams and individuals choose data tools to further process the data and make it easy to consume. For example, some teams may choose Excel to build reports and visualize data, and others can choose other more “sophisticated” tools like PowerBI and Tableau.
However, just consuming the data doesn’t guarantee an “easy consumption”. This path is often uneasy, and it can be overwhelming for anyone in the organization who wears an analyst hat.
On top of that, there are hundreds of people across departments who need to consume data sets in their day-to-day work. These recipients can be data architects, data scientists or business users who consume the data through a report or a dashboard.
Bad data quality or not getting the insights on time can have a big impact on how the company grows, given the many departments that are influenced by this. Some examples where timely data insights influence decisions are:
- Customer Engagement: Identifying low-level engagement and acting on it
- Growth: Help marketers optimize their campaigns based on ads performance
- Product: Innovate and upsell services based on product performance reports
- Sales: Help prioritize sales leads based on qualifying data factors
Access to data
Oftentimes, data analysts are burdened with a lot of requests coming from various departments who ask for a certain piece of dataset or an insight. These requests may require going back to manipulating the data thus adding another pile of work for the analyst. Without a way to easily automate data workflows, many requests are necessarily repetitive -- wasting analysts time and guaranteeing constant delays for the requestor. Everyone loses.
Some organizations, like Envoy, handle these repetitive requests by building internal data products that enable other people in their organization to get access to data. In the words of their data team manager, Arvind Ramesh, he says:
“The advantages of data products is that they enable other people to more easily access data and use it in their workflow without us having to go back and forth for every piece of insight”.
Envoy has an internal platform called Gainsight that showcases interactive dashboards on how customers are interacting with their platform in real time. Imagine if your customer success team had this unified data platform, that has an intuitive UI that is easily understandable and interactive. Then, your data analyst can focus on high-impact work and bring more insights where it is needed the most.
Data Munging roadblocks in E-commerce
Imagine that you have an e-commerce business. Every day you get thousands of orders from all over the world.
You use Shopify as your e-commerce platform and your data analyst usually downloads the sales data in an Excel sheet so they can discover anomalies, surface regular reports and enable the fulfillment team to send orders out the door efficiently.
However, the extracted data from Shopify is structured very differently from the way the analyst needs to consume it. Shopify reports transaction data, but the data needs to be aggregated on a customer level. Your analyst likely runs a couple of steps in Excel to first prepare the data in the format she wants to analyze.
Those steps can be cleaning, joining tables, restructuring, aggregating and even enriching the dataset so it can be ready for analysis. Then, and only then can she even think about building graphs and dashboards manually, or in third-party tools..
Here we feel the burden of data munging yet again: 80% of time was spent on data normalization, leaving only 20% for actual analysis and performance dashboard creation.
If only she had an app that could import the data and run her operations on it regularly and automatically!
Suddenly her time could be freed up so she could actually play to her strengths -- helping drive data-driven decisions!
Big pressure on Data analysts
As with the e-commerce example, managers often push the data teams to provide performance reports on time, not understanding the underlying mundane & slow process of data wrangling. In those cases, analysts may become frustrated, mostly because they don’t do the work they are hired to do - gathering insights.
Decision makers need to be able to create a positive environment for the data analyst, in order to avoid overburdened workers and enforce better practices and tools that will help them be better at their jobs.
In 2018, LinkedIn reported that there's a shortage of 151,717 people with data science skills in the U.S., based on data from its platform and new IBM data projected that data analyst openings will increase by 364,000 openings in 2020, and that it will soar even more in the years to come.
This landscape looks bleak. Where do we go from here?
The path towards efficient Data Workflows
Working with data is a very hands-on and experimental process and it will continue to involve people.
However, “hands-on” doesn’t necessarily mean that analysts need to be overwhelmed with a lot of engineering work, stitching data sources, and repeating a lot of the same data cleaning tasks. Construction is very “hands-on” -- but what would you think if you saw a site still using wheelbarrows, not dump trucks?
Organizations should pursue automated or semi-automated data workflows that can handle frequent use cases and automate most of the redundant data cleaning and merging tasks. They can choose to build them internally, or outsource it to a third-party tool. Having these tools in place will help companies to provide easy access and consumption of data across departments.
The path towards efficient data workflows is not only about finding the best data analysts, but also equipping the business teams with the right tools and support they need.
Only then, companies can really utilize the power of data analysis and start making decisions that are truly powered by data.
If you are looking for a solution that can help you build your semi-automated data workflows, let us know.
We built a visual builder that gets you from data to decisions faster.
You'll be able to import data, automate complex data operations, and trigger actions with pre-built components in just a few minutes.