We are in a very special period in the history of technology, where new and real innovation in data management is in the news every day. It all has me worried however. It reminds me of “Flowers for Algernon,” an award-winning short story which was made into a good movie called Charly. The story line is poignant—a mentally disabled man is given an experimental surgery which at first improves his cognitive capabilities modestly but then continues to progress and turns him into a genius. Unfortunately, the effect is not sustainable and he reverts after not too long to his original state. I fear that if we do not manage the introduction of data lakes and metadata well, our companies may very well wind up “reverting” and not realizing the benefits.
The ability to remember our past experiences is a special cognitive capability. It drives nostalgia as we long to relive that sweet event from our childhood; it also helps us to not repeat our past mistakes. However, without the ability tounderstand what our memories mean, they would just be a bunch of jumbled images and sounds. “Corporate Memory” is a special (and today, rare) capability, which can allow an enterprise to answer questions about complex past events, to learn from past successes and failures, and to explain past behavior to regulators. Data lakes offer a better opportunity than ever before to preserve corporate data memory, but without a preservation of the understanding, the data lake is not much more than a blunt instrument for storage.
Metadata offers that required understanding. Metadata includes the basic business and technical definition of the data item, as well as the lineage (where the data is sourced from) and the operational profile (when it was last loaded, the measured data quality). It can also include references to related fields, the overall schema and taxonomy, and the history of changes for the data element.
Metadata has been an under-developed, under-appreciated and under-utilized data service for a number of reasons. Most operations personnel and analysts have a basic understanding of the meaning of the data they utilize every day, so they view metadata as superfluous. Also, since metadata development touches everything from off-the-shelf transactional systems, to the sourcing of data, to the data warehouse, to reporting and data quality sub-systems, it becomes difficult to know where to start and how to define success. In addition, vendor platforms have had shortcomings around integration, inter-operability and usability. All of this has created an institutional bias—the general perception is that the benefit is low, and metadata projects get placed on the back burner. Lately, there has been renewed interest for a few reasons: Data Governance initiatives have made the case for the linkage between stewardship and metadata; vendor offerings have improved considerably; and most importantly, enterprises are looking to make sure that their data lake investment pays off.
What is the corporate memory problem? With a data lake, we can theoretically re-create a compliance report from 5 years ago, using the original source data, and re-analyze that source data to see if there is something we missed. The corporate memory problem arises when a company is trying to "remember” an event from a prior day, month or year, and the context of the data has changed: the source has changed; the data quality is different; the meaning has changed. Metadata, when rendered properly, helps a company to remember these prior interpretations and meanings. Adding to the problem, the unstructured and semi-structured data which a data lake is likely to hold has a very different set of attributes than our current structured data. What is important to know about a “tweet” from five years ago from a metadata perspective? A video clip? A recording of a customer service interaction? It is like having a bunch of 8-track tapes without an 8-track music player. I can tell you about that Bachman Turner Overdrive song, but I can’t actually play it for you.
So there is now a critical issue—the lack of maturity of metadata platform implementations, and the lack of plans to extend these platforms to the data lake for the enterprise, will interfere with the ability to derive the full value of a data lake. Before the data lake, you could survive without a robust metadata platform, because the schema-rich environment offered most of the structure you needed to have meaning. This also meant that to understand the data from five years ago you needed to either keep the schema from that time period or periodically re-transcribe the older data to the newer schema, which created “generational loss” of complete understanding of the prior data. This generational loss can interfere with a real perception of what was actually going on five years ago, and interfere with a company’s ability to relate that story to regulators completely. Now, with the data lake offering data dissociated from schema, the metadata is the most essential ingredient—the context and the framework which allow you to understand with “high fidelity” the event from years ago.
So now what? I can offer a few guiding points:
- If you haven’t already, invest in a traditional metadata tool which includes self-discovery features (or invest in a separate auto-discovery tool on top of your existing metadata platform). The better you can get your metadata house in order, the better prepared you will be for the changes required for the data lake
- Ask your current metadata platform vendor(s) about their ability (or their plans) to do automated discovery and cataloging for a data lake
- Look at leading edge platforms which can help to catalog data lake information automatically and consider how to integrate with your existing metadata
- Work with subject matter experts and advisors to create a metadata roadmap which includes development of data lake metadata capabilities
- Conduct proof-of-concepts and pilots with your data lake (including access and analysis of older historical data) to prove out the integrations and capabilities and ensure that you can get to the goals for your organization
Most companies are waking up every day with some level of diminished cognitive capability. With the proper implementation of a data lake platform, coupled with the implementation of integrated metadata, companies can put these handicaps behind them permanently and be truly prepared to meet their future challenges with the full benefit of their past experiences.
Elevondata (www.elevondata.com) is a leading edge data management advisory and data lake solutions company. Vin Siegfried is one of the founders and can be reached at vsiegfried@elevondata.com.