Introduction
Moving your organization to a data lake entails more than just building its architecture and collecting massive amounts of data. Your organization must also continuously manage the data lake and consider how to operate it cost-effectively and efficiently at scale, so it provides value in the long run.
Key elements one must consider are monitoring and optimizing the operation of the data lake, as well as its performance, analyzing utilization patterns, and using this information to optimize the cost and value of your data lake.
In this article, we will show how to optimize the performance of a data lake analytics so that it is more of an asset and less of a hindrance.
Consequences of unmanaged data lakes
Much criticism about data lake technology centers on a few problems.
Criticism about data lakes is related to a few problems and challenges.
Unmanaged data lake that stores all the organization’s raw, ungoverned data might quickly become riddled with duplicates, old versions, and redundancies. Having to continuously sync this data between applications or local data storage and the main repository means unnecessarily ETLs and expenses; it might also create a problem when data becomes outdated since the last sync.
Another challenge is related to data governance. The new GDPR regulations restricting data location mean organizations cannot move sensitive data into the cloud or a centralized repository but have to keep it in local storage. As a result, organizations cannot use that data to derive business insights and keep accumulating operational costs.
These shortcomings ultimately result in a situation where organizations cannot access all their data quickly and effectively, and serve it reliably to the upstream applications.
The result is low ROI, delays with project launches, and decreased value while increased operational costs.
Performance optimization for faster and cheaper data lake analytics
Data lake based query engines are often based on brute force technology. As the majority of the data is filtered out because it’s not relevant to the query, the organization incurs higher costs due to the massive compute resources required to scan data, even if it’s partitioned. Given the scale of data lake deployments, these costs become a major factor preventing the data lake from being valuable for the organization. Data consumers often complain about slow query performance, which will drive data teams to throw more compute at queries, driving costs even higher.
Reducing the cost while improving performance is key to getting value and more business insights from data.
Query (and cost) acceleration
Faster business insights mean competitive advantage and can be sped up with query optimization. Query optimization takes advantage of a time-honored optimization technique for making queries run faster in data systems. This technique is simply processing less data by skipping non-relevant data. This allows the queries and jobs accessing data tables to omit swathes of data which significantly speeds up query execution.
This can be achieved in a number of ways. In one of them, a data lake uses automatically-collected metadata about files. Upon querying, just by reading the metadata, chunks of data can be skipped without the query engine ever accessing files.
Another query acceleration method is called dynamic indexing. It essentially breaks data across any column into nano blocks and automatically assigns the optimal index to each nano block. This novel indexing technology makes for very fast execution of queries without the need to model data or move it to optimized data platforms.
Another way to optimize queries is to use optimal data formats. When transforming raw data assets into normalized formats, ready for querying and analytics, data lake architecture should choose data formats that can compress data and reduce data capacities needed for storage. This will increase query performance by common data lake analytic services.
When you’re choosing a query engine platform for managing your data lake, look for a provider that offers a workload management solution. Workload management is a critical piece of driving insights while cutting operational spending. It will automatically choose the right optimizations across your workloads which will include prioritizing queries, applying smart indexing, and dynamic caching of data sets.
These platforms can often optimize memory in a way that the most frequently accessed data is stored in the fastest memory. This speeds data retrieval and cuts time for getting business insights.
Additionally, look for a platform that has predictive data retrieval capabilities. Such a solution will assess which types of data are accessed most often and will assign them to the fastest memory as well, from where it can be quickly retrieved.
Storage and file sizes
One of the best practices for reducing analytics processing costs is reducing storage on data analytics platforms that often charge by the amount of data scanned. This will improve analytics querying performance, too. Reduced storage is achieved with the help of optimized data formats, for example, Apache Parquet. This file format is specifically designed for querying large amounts of data and is agnostic to the data processing framework, data model, or programming language.
Also, look at file sizes. Analytics engines calculate usage on a per-file basis. Microsoft experts organize data into larger-sized files for better performance (256MB to 100GB in size). But mind that some engines might have a hard time processing files larger than 100GB. Sometimes, Microsoft notes, data pipelines have limited control over the raw data stored in a big number of small files.
Draining the “data swamp:” Pipelines
Many organizations have already made a decision and started to drain the data lake by freeing it of data sources that don’t belong there – that is only for the temporary storage of data from new sources. Once the organization decides on the use cases for that data, it creates data pipelines that deliver that data outside the data lake for simpler, more efficient data delivery.
Draining the data lake also places a greater emphasis and focus on the data pipelines. Companies increasingly reconsider their architecture and adopt one that focuses on using analytic pipelines to prevent data lake from becoming a big data swamp.
Data pipelines complement cloud data lakes which are used only for storing and preprocessing data. Pipelines orchestrate the data and make sure that data from multiple sources is properly transformed, blended, governed, and delivered before the analytics even begins. They process and deliver data to applications as soon as it arrives thus freeing the data lake from the burden.
Conclusion
Companies expect their data lakes to provide the most recent data, reliable analytics, fast business insights, and reduced expenditures while obtaining value from the data. A data lake that is not optimized is unfit for today’s world of the constant inflow of data that needs immediate analysis and reaction and will not meet those expectations.
Data teams should focus on continued performance and cost optimization which should involve query optimization, reducing the lake sizes, and implementing analytic pipelines.
Big data can deliver value only if one can quickly, affordably, and reliably analyze it.