Data governance means different things to different people. To some, it involves security and access to data. To others, it’s about how consistent the data is from system to system. And to others, it revolves around master data management. In a large multifaceted organization, it could mean all of the above.
Data governance originated with transactional systems, such as point-of-sale and shop floor management, and was primarily centered around security and access. As new transactional systems such as ERPs and CRMs came on line, naming conventions and standardization across processes were needed to minimize redundant data and duplicative data entry work. Ensuring the quality of the data was also important and automated checks were placed on inputs to ensure that input errors were not passed to downstream processes.
A well-designed data governance program today might be run by a dedicated data management team. It could also include executives and business leaders chartered with setting the rules and processes for managing data availability, usability, integrity, and security across the entire enterprise.
Regardless of how it’s done, the aim of data governance is to provide the organization with secure, high quality, high confidence data to fuel business processes, analysis, and decision making.
However, due to traditional data preparation practices and technical limitations, data governance within analytic pipelines has been limited. This has prevented organizations from using certain types of data and from doing certain types of analysis.
Breaking the Chain
Historically, we haven’t tapped directly into transactional systems for analytics. These are the systems you use to run your business. You wouldn’t have people running analytic queries on the data while customers are standing at cash registers trying to buy things, for example. This could slow down or even crash the system.
To prevent that, we extract the data and put it into a data warehouse where we can aggregate and join data from many sources, transform it into something that’s optimized for analytics, and load it into an analytic engine.
However, in the transformation process, which creates a derivative of the original data, you lose the lineage and fidelity of your data. In other words, you lose the ability to know where the data came from, and you lose the details. You break the chain of custody.
This has several implications for analytics:
- When viewing a number in a report or a chart that doesn’t look right, it is difficult or impossible to audit. You can’t dig into it to validate it, understand it more deeply, or investigate an anomaly.
- Your ability to segregate the data by finer aspects (such as you would have in the original transaction system) is either partially or completely absent.
- The way that the analytic system computes rollups may be very different from the way the transaction system computes rollups.
- You may not have access to all of the data due to security concerns.
Let’s unpack that last one, security, because this limitation is so common that most people have come to just accept it, and yet the reasons and implications may not be completely obvious.
Dark Data
In a transactional system, data security is enforced in the kinds of things a user can do, and the types of data that can be accessed. For example, while everyone can see an organization chart, only HR managers can make changes in the HR system.
In addition, security is often enforced at the item level. Think of rows, columns, and individual cells in a spreadsheet. There are rules to govern who can see each piece of data. Health data is a good example. Just a few people need to see your prescription history, a somewhat larger group of people need to know your health plan coverages, and to everyone else you’re just a statistic.
But these security policies do not necessarily carry across to analytic systems. This is due to the way we’ve been building analytics pipelines. By transforming and reshaping data to better optimize for analytics, we lose the data detail needed to replicate the same security controls that are present in the transactional system. This means that we limit the data we can analyze, the kinds of analysis we can do, and the trust we have in the data.
So, what happens with sensitive data is one of two things: Either it just doesn’t go into an analytic system and you don’t analyze it, or it gets put into a system that is restricted to a really small number of people. This frequently happens with HR data, sourcing information for components that are trade secrets, and PII, for example.
These are important data sets, so there are some significant implications to not being able to analyze them. Let’s say you conduct employee satisfaction surveys on a regular basis, and you want to see if there’s a correlation between satisfaction and salary. Or you want to do a vendor cost analysis but you don’t know where the goods are coming from. You can’t do it because there are too many restrictions and controls on these data sets.
This leaves a lot of dark corners in enterprise analytics where data can’t be shared or blended with your other organizational data, or even seen by most people.
Adapting Data Governance to Analytics
Fortunately, times and technology have changed, and we can approach data preparation and storage for analysis in a new way that preserves good data governance while providing access to data. The basic principles are:
- Extract and store data from transactional systems directly into a data lake with no transformation or improvements.
- Tightly control who transfers this data, when, and how they load it.
- Preserve the data in its original state. Keep all the details.
- Apply the same data security controls and rules to the data that are present in the source system.
- Only transform, enrich, or aggregate data at reading time, when it is actually being used for analysis.In this way, the data’s origin is known and the original data governance is maintained. No transformation occurs until the data is actually used. This makes it easy to audit and verify, and it is relatively easy to implement new use cases without having to reengineer the data pipeline.
“Are You Out of Your Mind?”
Now, if this afternoon you tell a data professional that we can all run analytics on unprocessed original data (aka “third normal form data”) they will say you are out of your mind. This is because traditional data pipelines, with ETL and data warehouses, can’t do this.
But with the right foundation — including a hybrid architecture leveraging data lakes, columnar storage, an in-memory analytics engine, and some advanced optimization techniques — we can shift the paradigm and gain the best of all worlds.
Incorta is an example of just such a system.
Restoring Confidence and Possibilities
With Incorta, you bring in transactional data — direct and from multiple sources — for analysis without having to reshape or pre-aggregate it. Analytical semantics and structure are applied as the data is queried. This brings a host of benefits:
- You gain confidence. Because you’re no longer working with a derivative of the data, you can drill in, explore, validate, and investigate. You can do root-cause analysis.
- You gain agility. You have flexibility to pull together different data sets and do more correlations. You have the ability to do more varied kinds of analysis.
- You gain more data. Since the data can be protected in exactly the same way that it was protected in the original transactional system, you can unlock some of that dark data from within your organization and use it for analysis.
- You gain consistency. Since the data is suitable for multiple workloads such as reporting, analytics, and data science, everyone’s answers are the same.
While data governance was originally designed for transactional systems to yield quality data for analysis, the traditional way of building data pipelines limited the application of good data governance practices to analytics. This prevented organizations from doing deep analysis, or any analysis at all, on some very valuable data sets.
We now have technology that supports best practices such keeping the original data untouched, and applying data semantics and security upon read. Good data governance practices are preserved, and we can deliver more high quality, high confidence data for analysis and decision making.
Ready to find out what this new paradigm can do for your organization? Spin up a free trial today and try it for yourself.