Building data pipelines

3 Key Learnings from the Webinar “Modern Data Warehousing without the Burden of ETL”

July 17, 2018

As the Eckerson Group’s data management practice lead, Dave Wells is a respected advisory consultant, educator, and research analyst dedicated to building meaningful connections throughout the path from data to business impact. That’s why we recently tapped Dave to co-present with me what turned out to be our most popular webinar to date—a webinar that discussed modern data warehousing without the burden of Extract Transform Load (ETL).

Initially a skeptic, Dave became a fan of Incorta’s approach after learning how our technology does indeed bypass the troublesome ETL process to enable truly modern analytics. What Incorta does is a completely new approach, one that encouraged Dave to rethink his stance on data warehousing and its role in the future of modern analytics. As a result, Dave developed three key learnings—which he presented on the webinar—that, from his perspective, those looking to modernize their data warehouse must acknowledge and embrace in order to succeed. Here they are.

KEY LEARNING #1: The data warehouse is not dead—but it is on life support.

Data warehouses promised us the ultimate in data organization for easy access, understandability, and a single version of the truth when combining data from multiple enterprise sources. Yet, after all these years, we still struggle with painful and time-consuming deployments, slow data capture, ETL that doesn’t scale, and latent batch processing.

But data warehouses persist. In a TDWI poll Dave ran, more than 70 percent of 105 respondents operate 2-5 data warehouses, with fewer than 10 percent operating only one. It’s apparent that, in many cases, the data warehouse is not dead. But it’s clearly struggling. According to Dave, “Those data warehouses exist and continue to be operated because people and business processes depend upon them. With that reality, we have to accept that the data warehouse is alive, but my belief is it’s alive, but not alive and well.”

KEY LEARNING #2: The data warehouse must modernize to deliver the business value today’s leaders need.

Data warehouses might be important and business-critical, but, as they are, they simply can’t meet current needs. We see challenges in growth management—growth in data volumes, the number of subject areas, the number of simultaneous users. We see extreme workload fluctuations as tech teams try to move increasingly larger volumes of data into the data warehouse.

In addition, on-premises data warehouses add a significant burden to data center management, operations, and costs. We see processing bottlenecks and delays, and the restarting and rerunning of ETL processes. We see new projects to add subject areas or functionality to a data warehouse waiting upon more infrastructure, more processing capacity, or more data storage capacity before they can be completed.

Yet, despite its limitations and associated pain, we do want the answers and benefits a data warehouse does deliver—we just don’t want or need the traditional data warehouse. For instance:

We want to capture time-variant data for time series analysis and trending, but we don’t want to do all of the integration, data cleansing, and heavy lifting of ETL and data movement to make that happen.
We want to capture data history as it happens—then integrate, aggregate, cleanse, and store it for analysis—but we don’t want batch ETL that doesn’t scale and batch processing that’s inherently latent.
We want to understand interconnections between data and identify relationships within data—so we can navigate when we need to navigate, and integrate and aggregate on demand—but we don’t want to cast all of those data relationships into a rigid relational or dimensional schema in order to store the data.
We want high-speed queries, but we don’t want a painful, long, expensive deployment to meet our performance requirements (if that’s even possible).
We want a high-performing analytics platform, but we don’t want to exert extra effort to deploy it—and redeploy it—every time something changes.

KEY LEARNING #3: The challenge of modernization is retaining the benefits of the data warehouse while eliminating the pain of managing it.

So how exactly do we do this?

To modernize data warehousing, we need to organize the right data in the right ways to answer whichever questions occur at whichever time. That means we need unlimited dimensionality and flexibility enabled by four things: efficient data capture, efficient data relationships, efficient queries, and efficient deployment.

Let’s take a quick look at each one.

Efficient data capture.

With traditional data warehouses, data moves too slowly. Instead, we need to rapidly capture time-variant data that gives us a reliable record of enterprise history to support time series analysis, but do so without the burden of data modeling that designs rigid schema needing to be refactored when source data changes.

To modernize data warehousing, we need to capture and store data quickly, without a lot of transformation, without recasting schema, and without introducing unnecessary data latency. Data will be available when it’s needed, data warehousing costs will be eliminated, and all of the constraints of rigid star schema will be removed. No longer will limits be imposed on the number of business questions that can be answered.

Efficient, seamless data relationships.

Traditional data warehousing integrates and aggregates all of the data that might be needed—even though only a small percentage of it will ever be used—and requires a variety of connectors to access data stored in many different source types. As a result, updating and maintaining ETL is a nightmare.

Instead, for modern data warehousing, we need to easily connect to data wherever it resides and in whichever format it exists, and only integrate and aggregate data on-demand, when required to answer a query. So we need technology—such as Parquet—that supports the on-demand, efficient, high-speed copying of raw, unmodified data from its original source. We need table-to-table mapping to understand the relationships between tables, including automated mapping where existing joins are visible. And we need to be able to look at SQL to see which kinds of joins occur, while also do additional mapping for relationships unrelated to source data. All of these things enable that on-demand data integration and aggregation that modern data warehousing requires.

Efficient, high-speed queries.

For those companies with traditional data warehouses and perhaps a data lake, the data lake and data warehouses are disconnected, independent silos of information. They’re constrained, forced to limit themselves to fewer than 10 dimensions for star schema. They only can process data stored in those relational databases; adding a new data source is rigorous, with requests often stuck waiting in a technical team’s queue while probing business questions go unanswered. And queries run slowly, with queries of large data sets often slowing other critical business applications in the process.

Modern analytics requires unlimited dimensionality and flexibility, and new heights of speed and performance. We need a high-speed query engine that brings together all of the right data and aggregates it in the right ways with any data source, in any form and format, in real-time. We need flexible, adaptable dimension hierarchies for data aggregation, rollup, and drill down. We need fast query response times, regardless of data volume. And we need to aggregate and integrate data in the middle of a query without degrading the response time or the user experience.

And we can’t ignore data governance during this process. We’re dealing with private and sensitive data—like personally identifying information—so query processes need to be secure and governance-aware.

Efficient, pain-free deployment.

With traditional data warehousing, deployment is anything but fast and pain-free. Time-consuming planning and processing eats up budget, energy, and resources, resulting in an analytics architecture that’s rigid and resistant to frequent, easy change.

Instead, what we need is a scalable, elastic data management and delivery environment that can evolve as needed. We need the ability to gracefully blend in modern analytics or data warehouse technology to an existing data management ecosystem, without having to rebuild any infrastructure. We need to be given the option of cloud/multi-cloud, on-premises, or hybrid deployment. And we need technology that supports on-premises deployment via low-cost, commodity hardware and also common, cloud-based implementations, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.

CONCLUSION

To change the game and truly modernize data warehousing, we need to eliminate the heavy burden ETL imposes on us and capture data history as it happens—but integrate and aggregate it on demand, when needs occur—so we can get insights when we need them, without delay.

We need to organize the right data in the right ways to answer whichever questions occur at whichever time, and do this by:

breaking down data silos;
interconnecting data lakes, data warehouses, and any other data sources; and
dealing with data in any form and format.

In summary, we need unlimited dimensionality and flexibility enabled by four things: efficient data capture, efficient data relationships, efficient queries, and efficient deployment. And, according to Dave Wells from the Eckerson Group, “… Incorta has embraced all four of those critical components of modern data warehousing.”

That’s why all of this is possible now, with Incorta.

Want to hear more about Incorta from Dave Wells and the Eckerson Group? Watch the webinar recording in its entirety or download the Eckerson Group analyst report “Modern Data Warehousing without the Burden of ETL.”