‘Unstructured Data’ is a very popular term that we frequently hear now. As organizations continue to function on a daily basis, unstructured data is continuously generated. These data do not have any pre-defined data model or an organized structure. However, the vast amount of data coming from non-traditional sources are potentially very important for any organization.
Processing of any unorganized data is a laborious and time-consuming task. In fact, any kind of data that does not get stored inside a database is unorganized and unstructured.
Moreover, data from various tools used in any organization for their internal usage also contribute to the pool of unstructured data. In addition, the data generated from various tools need to be combined first and then processed in order to get a final clean version of the structured data, which organizations can then put to use for their desired purpose.
Origin of the ‘Data Lake’ Term
The term ‘Data Lake’ was originally coined by Chief Technical Officer (CTO) of Pentaho, Mr. James Dixon. According to Dixon, “If you think of a DataMart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from multiple sources to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (Source: wikipedia.org)
What is a Data Lake?
A ‘Data Lake’ is a storage repository that holds a vast amount of raw, unstructured data in its native format until needed. It is a great place for investigating, exploring, experimenting, and refining data in addition to archiving data.
With so many data sources that companies can put to use for making better business decisions, Data Lakes are becoming very important. Some of these data sources include – social networks, review websites, online news, weather data, weblogs, and sensor data. All of these ‘big data’ sources lead to rapidly increasing data volumes and new data streams that need to be carefully analyzed.
Data Lake supports the following capabilities:
1) It captures raw data and then stores it at a low cost.
2) It can store any type of data within the same repository.
3) It can also perform data transformation and make it fit for the user so that the user can effectively use it for his or her own purpose.
4) It follows the Schema on Read property. In Schema on Read, data could be inserted into the repository in its original form without the need of any data schema. This is required since the requirement of any data is unpredictable when it is inserted into the repository.
5) It can support new types of Data Processing.
6) It can perform single subject analytics based on some specified use cases.
How Data Lake Works
The Data Lake is a data repository, where data from varied sources are accumulated for future use. It removes the cumbersome and costly data preparation overhead, required by the traditional data warehouses. Business users retrieve data from the Data Lake using metadata tags. As a result, it saves the Data Analysts from involving IT in every step of the process. The Data Lake mainly works on the following three principles:
- Stores Anything
Data Lake consists of all kinds of data, both raw data over extended period of time as well as processed data.
- Flexible Analysis
Enables users to use, analyze and conduct research on the data as per their own terms and conditions.
- Easy Accessibility
Enables multiple data access patterns over a shared infrastructure including batch, interactive, online, search, in-memory and other processing engines.
Fig. 1: Data Lake Principles
From the below diagram, we can see that data sources are placed on the left side. Data Sources generally depict data from different operational systems that may include any data from IT Service Management tool, Defect Management tool, Requirement Management tool, and Social Media data. These data are then loaded into the Data Lake. The unprocessed data could be stored inside the Data Lake as long as it is needed and should never be deleted.
Fig. 2: How Data Lake Works
From the right side of the diagram, it can be seen that after the data is identified by the user for his/her need, it is analyzed, transformed accordingly, and placed into a data mart or data warehouse.
After that, the data could be used with any tool like Excel or any type of user created custom application to get a good insight of that processed data.
Benefits of the Data Lake
Fig. 3: Benefits of Data Lake
a. Centralization: In a Data Lake, the data from various sources are centralized in a shared place. Once gathered together, it is processed using Big Data technology, or any other Search Technique that would otherwise have been impossible.
b. Data Scalability: Data Lake allows huge amount of data to be stored in its repository. Storage costs are the primary concern that need to be addressed. Data Lake offers a low cost storage for the data. Data Lake has the potential to expand in size as the amount of data increases.
c. Every Data is Important: Since data storage cost is minimum in Data Lake; huge volume of data can be stored at any time. Any insignificant data can become significant in the future. So there is no need to decide which data is relevant for the time being. One can just store the data in the Data Lake.
d. Variety of Data Types: Data Lake has the ability to store different types of data. Like – logs, XML, multimedia, transactional data, API data, sensor data, binary, social data, chat, and people data.
e. Flexibility of Data Access: Users from various departments have the flexibility to access the contents of the Data Lake as it is stored in a central repository. As a result, a user can easily collect a data deemed important for driving business decisions in any organization.
f. Data Analysis: As the data that is stored in the Data Lake comes from varied sources, there is a probability of getting better data analysis result from those varied sources compared to the data stored inside a single data source.
Implementation of Data Lake in Kovair
Fig. 4: Data Lake Implementation in Kovair
From the above figure, we can see that Kovair Data Lake serves as a repository of data from various third party tools like Rally, Jira, HP Quality Center, as well as Kovair Omnibus Integrations Transactional data.
Kovair Omnibus is an Enterprise Service Bus (ESB) platform that seamlessly connects applications and data using the SOA Architecture. It integrates about 75+ third party, legacy, open source and homegrown tools of various functional domains. That means data from a third party tool can be synched to another tool through Kovair Omnibus.
Kovair Data Lake can contain data from any third party tool as well as Omnibus Transactional data. For any third party tool, mapped as well as unmapped data, which is in the form of records, project information, user information, relationship data, attachments or comments; depending on the client requirement can be pushed into the Kovair Data Lake repository.
This way, the data from various sources can be dumped into a single repository. Organizations using multiple instances of same or different third party tools can use Kovair Data Lake for analyzing the data in future and can prepare analytics with it using Kovair Data Lake web portal.
The web portal is a user interface capable of displaying Reports, Dashboards, exporting reports to excel, managing users, licenses.
Kovair Data Lake also supports capturing of data generated because of data transactions occurring through Kovair Omnibus. Transactions in Kovair Omnibus denote the details of data that has been synched or transferred from a source third party tool to a target third party tool. This transaction related information that are stored in multiple Omnibus transactional data repositories are logged into a single Database used by Kovair Data Lake.
As a result, data from multiple Kovair Omnibus instances can be fetched by the user and can be used for various report generation as and when needed by the organizations.
In future, Kovair Data Lake will also support interactive report generation from the data stored in its repository based on user requirement.