Discover everything about Data Lakes: their usefulness, advantages, and best practices for efficiently storing and managing massive data. Optimize your data infrastructure with this article.

24/11/2022

EVERYTHING ABOUT DATA LAKE: Benefits, Features and Best Practices

7 minutes read

Table of contents

A project ? A question?

Contact our experts without further delay

What if data was the new wealth of the 21st century? After black gold, it is digital gold that has been rapidly exploited by many companies that have experienced exponential growth thanks to it. 

But like all wealth, data must be cultivated, maintained, and stored in order to be used efficiently.  

But with the ever-increasing amount of data being exchanged and created, and above all the issues of security and privacy, it is becoming increasingly complex to find the ideal way to access this data. 

If for a long time the trend was to set up a data warehouse, it is now the data lake that is making the news. Let’s look at this innovation that will definitely help your company. 

What is a Data?

In computer science, contrary to what we think, a data is not an information but rather the representation of this information. Indeed, a data will become information only when it is placed in a context

For example, the number 10 is data. This number can be placed in different contexts and become real information: 10 years, 10 degrees, 10 products that are still in stock… 

It is a bit like the raw material of information adapted to machines. The integrity of the data is composed in binary code, a succession of 1 and 0. The data can be exchanged, kept, deleted 

Today, everything is considered as a data: your email address, the search you made yesterday on Google and even this article. 

Indeed, the computer data which is in the form of code for the machine can be presented in the form of text, sound, image etc… 

There are several types of data:  

  • Primary VS secondary: primary data are raw data, that they have not undergone any processing or modification, unlike secondary data 
  • Structured VS unstructured: structured data is data that can be easily analyzed because of its form. It is often possible to organize and classify them in spreadsheets (texts, dates…). On the contrary, unstructured data are data that are difficult to analyze because of their nature. We will find in this category images, videos… 

Don't confuse Big Data, Data Warehouse and Data Lake

Big data

Every second, 29,000 GB of data are exchanged worldwide. And this figure has been growing steadily in recent years. This incredible amount of massive data or megadata has been called Big Data for a few years. Some experts highlight three essential characteristics: 

  • Volume: we all know the “traditional” units of measurement of data such as kilobytes, megabytes and even terabytes in companies. But with Big Data, future servers will have to have the capacity to process data volumes that will be measured in exabytes (10 to the 18th power) and even zettabytes (10 to the 21st power) 
  • Velocity: one of the major challenges of data access and exchange is immediacy. Today, we want to have everything right away and not have to wait. It is therefore necessary to be faster and faster and even to anticipate the different requests
  • Variety: whether structured or unstructured, data needs to be processed and the variety of the type of data logically leads to a variety of processing but also to questions about the management and control of all the data received  

All this data is kept and stored in servers that are grouped in data centers. Today, the largest data center, owned by China Telecom, occupies a surface of more than 1 million km2 for 1.2 million servers. 

The Data Warehouse

While these data centers also house corporate data, companies can also have a data warehouse

This is a unified storage space for all the data from all the systems of an organization. It can take the form of a physical server or be stored in the Cloud

The data stored in the data warehouse has a purpose and can be easily used for reporting purposes. Secondary data is therefore stored there. 

The architecture of a data warehouse is simple and consists of: 

  • A basic structure that makes all the data available so that users can access and use it
  • A test area that allows the data to be cleaned before it is stored in the basic structure. This is called Data Cleansing
  • A Data marts system that allows you to separate data according to business processes or departments (sales, marketing, etc.). This makes it possible to access data more quickly. It also strengthens security since users can only access the data they need. 

An ODS (Operational Data Store) can also be set up to store heterogeneous data that will be processed before being integrated into the data warehouse. 

The Data Lake

Finally, there is more and more talk these days about data lakes. It is simply an alternative storage space to the data warehouse that will allow to store any kind of data. 

Thus, a data lake will not be intended for end users but for data analysts who will have to analyze the raw data to make it understandable and usable.   

Let’s go into a little more detail to understand the stakes.

The Data Lake in detail

As we said above, the amount of data exchanged is constantly increasing year after year in our society but also in the professional world and in companies. 

In addition to the multitude of sources, it is also the diversity of structures that can make their exchange and analysis within the organization complex. 

The Origins of the Data Lake

The term was used for the first time in 2010 by James Dixon who criticized the data warehouse and especially the data marts for being too small in terms of size. 

He compared the data marts to a bottle of water allowing easy consumption. To continue in the aquatic domain, he evoked a lake in a more natural state that has different sources. 

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The content of the Data Lake stream is from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." 

James Dixon

Very quickly, large companies including the growing Big Tech have embraced these big data lakes. But not all implementations have been successful, even turning these lakes into swamps: the Datas swamp

Without organization and real structure, all data lakes can turn into data swamps, a space where any type of data can be found without any thought behind it. One could almost say that it is a lake left to abandon. 

How to Avoid the Data Swamp?

It is essential to establish a governance framework. This governance has three characteristics:

  • Locate the data and be able to access the necessary information in the right format. It must be secure and in the right format according to the use we are going to make of it. We must know its transformation process
  • The information has a useful life: delete when it is no longer needed
  • The data must be orchestrated. We speak of orchestration when the data is configured, managed and coordinated in a simple and above all automated way. Contrary to automation, which only concerns a specific task, orchestration is a more complex process and often concerns several systems. 

What are the General Characteristics of the Data Lake?

The data present in the data lake is data that will be potentially useful in the future without us knowing yet their purpose and the use that we will have of them. 

Behind this technology is the concept of freedom to store any type of data. But to be considered as a data lake, three key characteristics must be respected: 

  • A unique storage space
  • An orchestration functionality
  • Applications or flows allowing to act on the data

The data lake is going to be very useful for companies that are called data driven. These companies operate around the data they collect. 

Thus, most of the decisions taken are based on this data. It is therefore essential to have the right structures and architectures in place so that their analysis is effective.  

What are the Key Components of a Data Lake Architecture?

A data lake is not set up at random: it responds to business needs 

The first step is to analyze these needs. We will have to define objectives by adopting a double vision: knowing what we want in the short and medium/long term but also where to start to reach these objectives. 

From there, it is easier to put forward the possible obstacles and thus to set up the necessary tools to anticipate and overcome them. It is also mandatory to be aware of the data we have, especially the problems of availability and quality

From all this work comes several tools such as a prioritization matrix or a business roadmap. 

Then comes the choice of a pragmatic architecture, i.e., an architecture that meets current needs and that can potentially evolve to meet future needs. 

Within any data lake architecture, there are five key components :

Data ingestion: a system capable of ingesting data from multiple sources and in multiple formats (web pages, applications, IoT systems, etc.). It must therefore be flexible so that it can be run in different ways (in real time, at once or in batches) 

Data storage: a scalable system capable of supporting compressed encrypted systems while maintaining its efficiency, particularly in terms of cost. 

Data security: a data lake must offer a maximum level of security thanks to a multi-factor authorization system and role-based access  

Data analysis: after ingesting the data, it must be possible to analyze it in an agile way thanks to tools that extract the desired information before transferring the selected data to another storage space  

Data governance: it is important to keep track of the modifications made to the data lake in order to be compliant during the various audits. In addition, all data processing must be simplified as much as possible to guarantee a certain level of data quality for professional use. 

What are the Advantages of the Data Lake?

A schema-on-read basis: the data does not need to have a specific format when ingested because there is no processing but a simple reading at this stage. 

Since a data lake has no real structure, it is flexible and can easily adapt to changes and the integration of new data. 

A storage space that is suitable for all types of data and that breaks with traditional silos. Data does not need to be processed beforehand and ingestion is much faster than in data warehouses for example.  

Data centralization: all the company’s data is in a single system, which makes it easier to search or compare 

A secure space that can be accessed from anywhere: all the data is accessible both within the company and at the home of the data analyst who is going to work from home, for example. To make this possible, security is often a priority to avoid data loss or theft.  

What to choose between On-Premises and Cloud Data Lake?

When talking about data storage, for a very long time we were left with the image of those big corridors made up of server cabinets in which all the data was kept. But today, Cloud technology allows us to relocate this storage to a company offering a Cloud service.  

The constraints and advantages are ultimately the same as for any other system. Thus, an onsite data lake will require space, equipment, skills but also a certain cost for all maintenance.  

For a data lake on the Cloud, the material costs are not charged to the companies. Nevertheless, you must pay attention to the storage costs which can be high depending on the amount of data you have. We find the same competitors as on the classic Cloud, namely AWS, Azure and Google Cloud. 

Did you like this article? Share it!

Discovers more VASPP articles

Vasppletter

Découvrez nos solutions VASPP et les nouveautés SAP !

Nous n'avons pas pu confirmer votre inscription.
Votre inscription est confirmée.