Blog

Cloud and big data: What you need to know in 2021 and beyond

10-MINUTE READ

May 4, 2021

This year’s Super Bowl LV weekend, I watched a performance that arguably was the greatest of all time (GOAT).

The Weeknd was great, and Tom Brady won another Most Valuable Player award. But I’m talking about this video of Stanford’s Professor Chris Re (a MacArthur fellow, which means he’s an academic GOAT). He describes a new software 2.0 paradigm for artificial intelligence (AI) model development.

What does this have to do with cloud computing and big data? Everything.

Professor Re talks about a future where the best insights will be achieved by excelling at data engineering and domain understanding.

Thanks to transformers and other emerging technologies like edge computing, how we look at cloud and big data today is different than five years ago.

How is that possible? In part because of emerging deep learning models called transformers, which can be pre-trained on a massive data set and then fine-tuned for a particular use case. So instead of creating a new model to solve a problem, we can fine-tune a pre-trained one.

While researchers and highly skilled data scientists will still create AI models, most of us will use models like open-source software. Also, more and more automation will select the model type and tune the hyperparameters.

What’s left for us to do? Most companies will need dramatically more domain experts and data engineers who understand the problem to:

know how to best apply the available data;
monitor the model to make sure it’s making the right predictions and to handle exceptions when it’s not; and
augment gaps in data using their expertise and knowledge about the problem.

We’ll differentiate less on the models and more on preparing data, applying it to the models and monitoring the accuracy. To lead in AI means leading in data.

This brings me to the topic of cloud and big data. Thanks to transformers and other emerging technologies like edge computing, how we look at cloud and big data today is different than five years ago.

What is big data?

The simple answer first: Big data is a massive data set created by digital technology systems. It’s so voluminous that it can’t be processed by traditional IT (e.g., by getting a bigger and better server).

Here’s the slightly more nuanced answer: Big data is what happened when AI’s demand for data exceeded what traditional IT could supply. We always had business intelligence and analytics on data, but not enough demand to capture data in its original form. AI created a demand to mine for complex patterns, deeper insights and real-time streaming. And so we needed a new set of technologies.

Big data is a competitive advantage of cloud-first businesses like Google, Uber and Netflix. These cloud natives compete on their abilities to drive actionable insights from their big data foundations.

But your company doesn’t have to be a cloud-native to take advantage of big data. Thanks to the wide availability of cloud solutions, you can compete toe-to-toe with any digital upstart.

5 ways big data and cloud are related

We can scale big data because of cloud computing—both in how it works and how anyone can access it. So I can’t really talk about big data without cloud. Here are five main ways cloud supports big data:

Cloud enables access to big data in a cost-effective, pay-per-use, scalable manner. Why didn’t we save and process all that data back in the day? It was too expensive because of the monolithic systems we used. To handle more data, we needed bigger machines and the cost scaled exponentially. By contrast, big data is based on parallelized architectures that scale linearly and elastically and take advantage of cloud’s pay-per-use and on-demand access mechanisms.
Cloud is the “easy button” that handles all the hard big data stuff. Standing up, managing and securing a big data cluster is hard. Cloud natives figured it out, but it shouldn’t be a core competency of every business. The cloud provider handles a lot of this infrastructure for you. Thanks to cloud’s power to democratize IT, your company doesn’t have to approach big data like a risky science project.
Cloud tools make it easy to experiment with data. We can get the best insights when it’s easy to work with the data. Thankfully, cloud offers tools for model management and data pipelines that let data scientists and engineers create, experiment and publish models, connect them in a pipeline and monitor performance. In other words, cloud handles the data “plumbing” so you can focus on the insights that can help your business.
Cloud helps you manage your data. You need a unified view of your data: who owns it, who can access it, privacy restrictions, quality, how it connects to other data, etc. Emerging cloud tools offer pre-defined industry data models and metadata systems that give you a singular logical view across multiple systems, cloud vendors and even partners. These systems catalog the data you have available.
Cloud lets you tap into “citizen” users. It equips those industry experts who best understand the problem. Low-code/no-code tools turn data into a self-service capability that anyone in the enterprise can use, not just the data experts and software engineers. The business analyst, the domain expert, the operations engineer — all have data insights at their fingertips thanks to cloud.

To put all these advantages in perspective, here’s an example from my work.

A group of midsized banks wanted to share costs and improve the detection of suspicious activity. Our solution was a collaborative anti-money laundering application. We used the cloud provider’s big data platform tools to stand up a common big data environment used by multiple banks that could scale as new banks were onboarded.

We applied a common industry data model that helped us map data quickly across banks. Model management tools allowed business users to validate pre-defined models that were improved by sharing what worked for other banks. Low code/no code allowed these users to create unique views of data and outcomes for their bank. None of this would have been possible so quickly and with such great efficiency without cloud.

The future of data and cloud: 5 “vintage” trends emerging now

In 2018, I published a data maturity model (PDF) that charts a journey to become a data-driven enterprise:

The journey has been unfolding as I charted. In fact, cloud has sped up the journey by handling data infrastructure and platform basics.

We’re now seeing the emergence of the final Industrialized level of maturity, where data is a competitive advantage for the enterprise. The differentiator now is the quality of data in the digital ecosystem.

Let’s look at the trends in the mature/Industrialized level identified in 2018 and how they are taking shape today.

Business strategy is data- and outcome-centric. The strategy is aligned to business goals (e.g., growing the customer base, creating personalized recommendations, etc.) and outlines how to obtain the data to reach these goals. More than ever, every business is data-driven, thinking about their data as differentiated products.
Data architectures are extending across ecosystems. A data-centric and secure architecture makes it easy to factor data and models from other lines of business, partners and the edge—this new ability powers cross-ecosystem collaboration. My favorite perspective on this data-centric trend is from Zhamak Dehghani (a data GOAT), who articulates the concept of a data mesh as a way for companies to extend their data foundations to work together.
An agile data preparation process is automated and model-driven. Today human intervention is most needed in setting up the data flow because domain expertise is required to map and label data. It’s a key focus of ours: We want to enable “citizen users” with more automation and model-driven development. Domain experts like doctors and plant operators who know the problem best have a direct hand in curating and applying data and overcoming data gaps with their expertise.
Risk regulation and compliance move beyond legal agreements and are automated and enforced programmatically. This development allows companies to monitor and support regulatory compliance to increase trust and make data sharing less risky. Our own David Treat shares how sharing data will be more protected than ever thanks to emerging privacy-preserving technology. It will work with cloud to enable private and secure data sharing at rest (when stored), in motion (in transit), and even through to the compute step (when processed).
Prescriptive insight-driven actions are becoming available to all users. Contextualized data-insight services are vital for companies. They can be supported by a domain knowledge graph, the same technology that underpins web search. It’s possible today, like in our work with digital twins for an oil and gas company. In that instance, the knowledge graph recommended not just relevant data sources but also insights on well operations.

What do all these trends point to? A federated approach to using data and models from others while remaining differentiated on your own.

And that’s important because, as Professor Re predicts, most of us will be doing less model creation and more model application. More than ever, we’ll differentiate on our ability to get the best data from many sources (including our domain experts) and how it’s seamlessly captured, integrated and applied. Cloud turns big data into data products that connect as part of an even bigger data continuum.

Can we still say “big data” without sounding corny?

You bet. Cloud is making data—big data—the most valuable asset your company has today. Thanks to cloud, data is now available on-demand, at scale and democratized in its access, no matter where your company is located or the line of business.

I hope you now understand why I spent my Super Bowl weekend watching Professor Re’s talk. Those who are best poised to unlock value from data have access to that data and know the domain.

Who better than your own business to become the next data GOAT?

WRITTEN BY

Teresa Tung

Cloud First Chief Technologist