DataStema - Blog - The Modern Data Stack - The Battle of David and Goliath?

DataStema - Blog - Picture Data Solution

Photo by ABSCENT on Vectorstock

January 12, 2022

Overview

In 2021, there was quite a big fuss on the web and different social media channels about the Modern Data Stack. We have articles, influencers, CEOs, investors, sites, conferences, and events evangelizing the Modern Data Stack wave.

Is the Modern Data Stack a new type of data architecture, a new way of working? Do we need a new set of tools, new data roles, and skills to implement and work with it?

So far, I believe it's a piece of everything. We have (or we aim to have): evolving and adaptable architecture, distributed data collaborations teams using the data mesh principles, cloud tools commercial and/or open-source, and the rise of the analytics engineer.

Do organizations need to invest in a modern data stack to stay relevant in the market?

A CEO may ask:

What is the value of having a modern data stack? What am I losing if I don't have or invest in a modern data stack?

If you are a startup or a small company, go for it. Trying and testing various data tools until you find the proper configuration that works is possible, affordable, and accessible with the Modern Data Stack, especially if you have blueprints and scripts to deploy these in minutes in your cloud provider of choice. As people, you can experiment with different roles and skills and find the best ways to work together and deliver value as a team.

Regarding the more mature and established companies, we have a heavy legacy to bear from the start: 80% of data projects fail. Some of you reading this article were involved in this, like me.

First Generation – The Data Warehouse

It started in the '90s with database replicas for reporting, then with the Inmon and/or Kimball data warehouse. Then we applied the same principles on data warehouse appliances when volumes, complexity, and performance were crucial.

Most of these projects started from IT, with less/no business involvement and C-level sponsorship (totally against what Inmon and Kimball advised, but anyway, only the geeks are reading the books! Right?). Companies invested millions in these projects, but ultimately, most failed to deliver what was initially promised.

Goliath won the battle as David was not born yet!

Second Generation – The Data Lake

Then someone (who?) told us about the 3Vs (or 4Vs or 5Vs or <n>Vs) of data, and the fact that unstructured data will rule the world (which is true), the schema-on-read, and that Hadoop is the key to everything.

And Hadoop is "free," is open-source, and runs on commodity hardware! What can go wrong with that? We throw away all the expensive appliances and put Hadoop in place! Right? Well, not really.

We found about the Data Lake pattern and that the Data Lake will not replace our beloved Data Warehouse. So we end up, in the best-case scenario, with a Data Lake on Hadoop and with the existing Data Warehouse still on the appliance. In the end, we had increased costs and complexity, late projects, unsatisfied users, and frustrated stakeholders. Other millions go to dust!

At least three good things started with the Hadoop movement, with a propagated effect in our current times.

First was the explosion of new data tools around the Hadoop ecosystem (e.g., Spark, Kafka, Hive, Druid). For the first time in data history, this was a battle between David (the new data companies started from Social Media companies and the academic environment) and Goliath (the "big" vendors we all know and can count on our fingers).
Second was the rise of the Data Scientist as a vital role in the organization and, by default, the primary beneficiary of the Data Lake.
Third, the Hadoop or Data Lake project got more support for business stakeholders and C-level sponsorship, especially for the Data Science initiatives. At least we had some lessons learned from the failed data warehouse projects. There was a separation between IT and Business responsibilities, but not very strict. Ask any company with a Data Lake, who is the owner, and don't be surprised that it is IT again in most cases.

But, what is the situation today?

I lived and worked in the EU all my life and professional years, and to this day, companies are still implementing appliances for their Data Warehouse and Hadoop for their Data Lake, on-premise or hybrid. Coupled with complex and overly-priced legacy ETL and BI tools from the '90s, it makes a perfect picture of an organization lost in time and incapable of change, even they have a CDO appointed to find a solution for this mess.

And talking about CDO's, in most cases, in the EU companies, CDO's were nominated for the Data Governance initiative (The GDPR scarecrow!) or just for the company image: "from now on, our company will become data-driven because we named a CDO". The CDO's rarely sits on board, usually is under CIO or CTO (IT again?), and gets budget if a data breach or data compliance issues arise. I am not trying to be sarcastic here. I'm just relating facts. I happened to know very talented CDO's who are frustrated day by day by the lack of support and budget they have, trying to get the job done in these conditions. I will always be a voice for them.

I don't know if we have similar patterns in other parts of the world, but I say YES if I take a guess. At the end of the day, if we look at the adoption lifecycle curve, there is an unspoken truth about these companies: they will always be in the Late Majority or better Laggards regarding new technologies as Modern Data Stack. This situation opens new and significant opportunities for the data-enabled small and medium companies at the core to go for the big companies' clients and their market share.

It looked like David initially won the battle for a few years, but Goliath won the battle back in the long run!

Third Generation – The Modern Data Stack

The Modern Data Stack is tightly connected with the explosion of data tools in the cloud. The cloud comes with a new model of infrastructure that will help us build these data stacks fast, programmatically, and on-demand, using cloud-native technologies like Kubernetes, Infrastructure as Code (IaC) like Terraform and Pulumi, and cloud best practices of DevOps and security.

So, infrastructure becomes a critical factor in building and implementing a Modern Data Stack.

Besides the tools we know already: data warehouse, integration, visualization, we have new categories of tools like:

Data Transformation (with the "T" stripped out from the ETL), which means that we also have a new generation of data integration tools that are focused only on Extract and Load. "ELT is the new ETL" is the mantra between data aficionados, but to make it clear: ELT started with the data warehouse appliances more than ten years ago, so this is not something new.
Headless BI / Metrics Later / Entity Layer as Ben is referring lately. Somehow it reminds me of the OLAP cubes I used to work on in the '90s, but I know today it's a different "beast" to handle.
Reverse ETL helps integrate data from your data warehouse to your SaaS application. There are 8000+ SaaS apps (and still counting) in the Martech - one source of inspiration for our Data Platform Generator site experiment - each with its data model and API interfaces. This category is by far the "star" of the Modern Data Stack, just looking at the valuation and acquisition prices of the companies building these tools.
DataOps is a methodology and a set of processes and tools put together, inspired by DevOps, Agile, and Quality Management best practices. With DataOps, you should control all the data operations of your Modern Data Stack.
Data Observability is a new set of tools that goes beyond Data Quality and is enhanced to give us more clarity and information about the end-to-end data pipelines value chain, lineage, behavior, and metadata.

But the essential tool in this Modern Data Stack is (and always be) the Data Warehouse, more precisely, the new type of Cloud Data Warehouse (CDW) that comes with the separation of storage and compute.

This characteristic is the critical enabler of all modern data stacks core, and it's coming today in a new architecture pattern: The Data Lakehouse, which is the dream of some of the Hadoop aficionados, back in the days, to have the Data Lake workloads running alongside Data Warehouse workloads.

The CDW started with Snowflake, the most significant software IPO in history, and we have a few key players like:

Databricks, which coined the Data Lakehouse term, continued the Spark journey to a higher level on data and AI, implementing a new open data management architecture.
SingleStore, which is a company somehow in the shadow of the first two, but with a powerful real-time unified database engine that allows customers to run operational and analytical workloads simultaneously and is a serious competitor in the market.
Dremio, which is based on the open data architecture with an engine like Apache Arrow and data virtualization at scale, and
Firebolt, which is the new kid on the block that intends to rewrite history for all CDW.

Are all these tools qualified to be a "Data Lakehouse"? It's too soon to have a conclusion. Time, lessons learned, success stories will be a good reference, but in the end, the use-cases you need to implement for your Modern Data Stack are particular to one technology, so choose smartly.

Now, what tools do we use to discover, explore, analyze, and visualize the data on top of this Data Lakehouse? Your DOLAP or BI tool sitting on your computer as an extension (and usually a data source!) for your favorite spreadsheet tool is not enough.

It's not about the technical ability to connect and get data from a Data Lakehouse engine; it's about a new way of working transparently and collaboratively.

Hopefully, there will not be anymore the spreadsheet "guru" from the Sales department and the spreadsheet "guru" from the Marketing department that meet and start poking their eyes on the last active customer's number (a metric that is still a mystery for many companies out there).

Alignment and consensus should be reached between individuals, not enforced by some data governance rule.

A new category of data analysts or, better say, analytics engineers will rise to fulfill this role. This role requires time, patience, good leadership, and inspiration to promote or hire people with a sense of data curiosity and "data hacking" abilities beyond the CV. These profiles are rare because these skills come naturally for them; it's not all about education and experience. If you find and recognize them, hire them on the spot.

We have an ever-growing ecosystem of commercial and open-source tools for advanced data users as data scientists and machine learning engineers to do their job, again focusing on collaboration and alignment. These are for heavy and interactive users that like to try and test different tools and libraries; run multiple models with different feature sets, if possible in parallel; so by using modern data stack tools which are cloud-based, they can spin up fast multiple clusters to support these workloads.

Both Analytics and AI/ML tools look like data stacks themselves. Most of them come with data extraction, transformation, exploration, processing, and visualization capabilities on top.

But don't be fooled by the aggressive vendor salesmen: their function is not to replace your data architecture components. It is to complement them.

Also, I see the trend of having one tool/stack for both the Analytics and the AI/ML teams, which comes with great value in the end. It doesn't mean that we need to convert a data analyst to a data scientist and vice-versa, but why should these teams not work together?

Today, it looks they are sitting in different corners of a room:

"The BI team" is still working with the ETL tools and DWH appliances but with a high understanding of data and business problems during the long years of projects.
"The Hadoop/Cloud team" which, independently the role they fill, are the accidental administrators and wizards of the domain, writing their scripts and tools that keep the spice flow.
"The Data Scientists team" which has a high degree in understanding and writing complex algorithms, end up doing data engineering for 80% of their time and sometimes lacks the business acumen to explain the results they generate.
"The Analysts team" which in most cases is not a dedicated team, and is composed of spreadsheet "gurus" from each business department that rarely collaborates and share their work between them.

How difficult is it to bring all these teams together and form a "Data Avengers" team that approaches and solves all the data problems in the organization? What do we need?

A valid answer is data governance, and data governance simplified is about two things:people and change.

When it comes to people, it is about who is sitting at the table (and we partially identified that). And when it comes to change, it is about how these people will work together to deliver value beyond all the documented models, processes, and rules (which no one reads). Implementing data governance is tedious and particular to each organization.

Don't drink the Kool-Aid of the vendors pushing to sell you a magic tool that solves this overnight because they are the "experts." You'll end up with a governance model that is not sustainable for your organization and also an expensive tool that needs yearly maintenance and few/no people use.

The best person for this job is you: the business leader, the stakeholder, the manager, that knows in the last small details how the organization work and what it needs to work better and efficiently. No vendor "expert" will be at this level, ever!

Once you have this in place or are still implementing it, start looking for a tool, but there is a catch here.

Most data governance tools have overlapping capabilities, and of course, vendors claim they are better at everything. In reality, some are good at data cataloging, others on data lineage, others on data security, and others on data quality and observability. I'm not advising you to buy a tool for each capability, but just remember an idiom: "end-to-end".

It doesn't count if you have a first or a second generation of a data platform or even the modern data stack itself. What is essential here is that for all the components you have in your data architecture (e.g., ETL/ELT, Data Warehouse, Data Visualization, Reverse ETL, Metrics Layer), the governance tool(s) must support by fabric (not via custom implementation or other 3rd party tools): end-2-end data security, data lineage, data quality, data discovery, etc. for your data flows and pipelines.

To summarize the governance part that will guide and support the "Data Avengers" team, I propose the following pillars:

Data Availability: Data should be made available fast and easy to deploy on a new infrastructure for various experiments – Benefit from the Modern Data Stack could infrastructure.
Data Documentation: Document and catalog the data to reflect the business context – Data Discovery & Cataloging, Business Glossary, Data Dictionary.
Data Integrity: Harmonize data coming from various sources to build consensus and trust across the organization – Data Lakehouse, Data Lineage, Data Privacy & Security.
Data Operations: Govern, monitor, and control all your data operations – Data Governance, DataOps, Data Quality, and Observability, Data Privacy & Security.
Data Literacy: Enable all users (not only the data team) in the organization to understand the complex insights of data by providing a full view of the opportunities and challenges when working with data – Data Leadership, Coaching, and Mentorship.

What will be the competition for the implementation of a Modern Data Stack? Will the primary cloud providers join the movement? Or just be happy we will use their infrastructure. They will be the perfect Goliath candidates. Will the big vendors step up? Start throwing money to buy all the little fish to protect their existing market share and clients? Or the Modern Data Stack companies will ramp up to market cannibalization? Well, time will tell. We only need patience.

Who will win the battle for the Modern Data Stack between David and Goliath? I bet on David this time! What do you think?

My conclusions:

The Modern Data Stack is taking the best practices from the Data Warehousing (30+ years) and the Data Lake (10+ years) world and brings them to a much simplified, easy to deploy and use model due to the agility of the cloud infrastructure. Forgotten are the days when you need a 2-3 years project and an army of consultants to create a simple Data Mart. Also, the availability and affordability of the modern data stack make it easy for any company, from a startup to a big organization, to implement it – Data democratization.
The new category of tools like Reverse ETL, Headless BI, Data Observability will have to pass the test of time, thousands of implementations, lessons learned, and best practices to win the right to be part of a reference data architecture.
The Modern Data Stack will not change how people work, as this is the job of a data leader and sponsor in the organization. It will help via the new category of tools focusing on collaboration between different teams, but the human factor will be responsible for the success in the end.
The explosion of the data tools as documented in Matt Turck's MAD Landscape will continue to grow in the following years. This will put some pressure on companies looking to implement a Modern Data Stack, which will have not only 2-3 options to choose from but dozens or hundreds of options. This was one of the reasons we developed the Data Platform Generator concept site: to help other data aficionados be familiarized with the Modern Data Stack concept and the multitude of tools available in the market under the Data Solution Blueprints© format.

Everyone who knows me can confirm that I practice what I preach, so I will focus on exploring, building, deploying, and testing different Modern Data Stack configurations this year. I decided to become more involved and active on social media channels and document these experiences in detail. I will come back with an update on this article next year, the same month, to see what 2022 reserved for us and the Modern Data Stack wave.

Overview

First Generation – The Data Warehouse

Second Generation – The Data Lake

Third Generation – The Modern Data Stack

About the Author