The two things that will make or break your data lake strategy

Data lakes are different

Data lakes are different from data warehouses as data lakes allow storing data in raw form and having varied formats. One of the purposes of the initiative is to have the ability to run various analytics, data science/AI algorithms on a broader set of data to get better insights. The data lake is usually low-cost to assemble compared to data warehouses as the data can be dumped into the data lake with no or very limited processing. However, that also creates a problem!

Technology is the easiest part to handle in data lake initiatives now as multiple cloud providers give the capability. We have all sorts of data storage capabilities from structured to unstructured data. There is a data storage option for a given format of data.

Anyone doing even any rudimentary data science knows and understands the importance of data quality. Is data clean and deduplicated? Are the data points unified? Without that it is at best “Garbage in Garbage out” and this wisdom is not new.

With any initiative towards data lake, it is equally important to put a data governance strategy in place which should continuously ensure that the data is of high quality and remains high quality. It is not a one-off exercise but a continuous effort that has to keep going. Think of it like a filtration plant. And even if data passes through the filtration plant and remains in storage for a long time it again needs to be passed through the filtration engine to ensure that the data quality remains intact.

“Data lake success needs both technical and semantic data quality, period”

Technical cleanup

Data quality also should not be dealt with from a myopic view of just data cleaning which is limited to filling missing values, doing enrichment, and deduplication. I would term them as technical clean-up. These are important but equally important for the data cleaning engine is to have a notion of semantic cleanliness in place.

Semantic cleanup

Semantic cleanup needs domain knowledge and purpose-built data quality engines. The engine should understand the domain and the relationship between the objects to be effective. For example in the Industrial OEM world, equipment and parts are two important categories of objects and the right classification in one or another is important for many analyses to be meaningful.

In the real world, we would expect OEMs to have curated catalogs in place which can be fed into the system but real-world warriors know that this is hardly the case. There are though gladiators in the system who with their tribal knowledge know how to fit things together. Institutionalization of that knowledge is very important and for the same reason solutions or platforms that can handle the semantic notions as first-class concepts become important for any analysis to be effective.

The rules/insights can then be captured into automated DS algorithms to make them scalable. Automation is successful only when it is built with both technical and semantic inputs.

Evaluate your data lake initiatives and make sure data governance and quality in both technical and semantic dimensions exist as core elements.

How to achieve Installed Base Visibility?

June 28, 2024

Functions

Industry

Consulting

Installed Base Resources

Aftermarket Champions

How it works

Aftermarket IQ

How it works

Aftermarket IQ

For Teams

Aftermarket IQ

For Teams

Aftermarket IQ

Installed Base Resources

Benchmark your Aftermarket

Installed Base Resources

Benchmark your Aftermarket

About the Company

About the Company

The two things that will make or break your data lake strategy

Lalit Bhatt

The two things that will make or break your data lake strategy

Data lakes are different

Technical cleanup

Semantic cleanup

RELATED POST

Data Lakes for Packaging OEMs: Path to Single Source of Truth

Monthly Industrial Round-Up – July 2024: Key Updates and Trends in Machinery Manufacturing

How to achieve Installed Base Visibility?

From Pre-Visit Prep to Reporting: Enhanced Field Service Efficiency with IB HealthCheck

Installed Base
Intelligence Platform

Resources

Company

Contact

Installed Base Resources

How it works

Aftermarket IQ

How it works

Aftermarket IQ

Installed Base Resources

Benchmark your Aftermarket

Installed Base Resources​

Benchmark your Aftermarket

About the Company

About the Company

The two things that will make or break your data lake strategy

Lalit Bhatt

The two things that will make or break your data lake strategy

Data lakes are different

Technical cleanup

Semantic cleanup

RELATED POST

Installed Base Intelligence Platform

Resources

Company

Contact

Installed Base Resources

Installed Base
Intelligence Platform