Best practices for data lakes
How to build, maintain, and derive value from your Hadoop data lake.
Hadoop is an extraordinary technology. The types of analyses that were previously only possible on costly proprietary software and hardware combinations as part of cumbersome EDWs are now being leveraged by organizations of all types and sizes simply by deploying free open source software on commodity hardware clusters.
Early use cases for Hadoop were trumpeted as successes based on their low cost and agility. But as more mainstream use cases emerged, organizations found that they still needed the management and governance controls that dominated in the EDW era. The data lake has become a middle ground between EDWs and “data dumps” in offering systems that are still agile and flexible, but have the safeguards and auditing features that are necessary for business-critical data.
Integrated data lake management solutions like Bedrock and Mica are now delivering the necessary controls without making Hadoop as slow and inflexible as its predecessor solutions. Use cases are emerging even in sensitive industries like health care, financial services, and retail.
Enterprises are also looking ahead. They see that to be truly valuable, the data lake can’t be a silo, but must be one of several platforms in a carefully considered end-to-end modern enterprise data architecture. Just as you must think of metadata from an enterprise-wide perspective, you need to be able to integrate your data lake with external tools that are part of your enterprise-wide data view. Only then will you be able to build a data lake that is open, extensible, and easy to integrate into your other business-critical platforms.
A checklist for success
Are you ready to build a data lake? Here is a checklist of what you need to make sure you are doing so in a controlled yet flexible way.
Business-benefit priority list
As you start a data lake project, you need to have a very strong alignment with the business. After all, the data lake needs to provide value that the business is not getting from its EDW.
This may be from solving pain points or by creating net new revenue streams that you can enable business teams to deliver. Being able to define and articulate this value from a business
standpoint and convince partners to join you on the journey is very important to your success.
Architectural oversight
Once you have the business alignment and you know what your priorities are, you need to define the upfront architecture: what are the different components you will need, and what will the end technical platform look like? Keep in mind that this is a long-term investment, so you need to think carefully about where the technology is moving. Naturally, you may not have all
the answers upfront, so it might be necessary to perform a proof of concept to get some experience and to tune and learn along the way. An especially important aspect of your architectural plans is a good data-management strategy that includes data governance and metadata, and how you will capture that. This is critical if you want to build a managed and governed data lake instead of the much-maligned “data swamp.”
Security strategy
Outline a robust security strategy, especially if your data lake will be a shared platform used by multiple lines of business units or both internal and external stakeholders. Data privacy and security are critical, especially for sensitive data such as protected health information (PHI) and personally identifiable information (PII). You may even have regulatory rules you need to conform to. You must also think about multitenancy: certain users might not be able to share data with other users. If you are serving multiple external audiences, each customer might have individual data agreements with you, and you need to honor them.
I/O and memory model
As part of your technology platform and architecture, you must think about what the scale-out capabilities of your data lake will look like. For example, are you going to use decoupling between the storage and the compute layers? If that’s the case, what is the persistent storage layer? Already, enterprises are using Azure or S3 in the cloud to store data persistently, but then spinning up clusters dynamically and spinning them down again when processing is finished. If you plan to perform actions like these, you need to thoroughly understand the throughput requirements from a data ingestion standpoint, which will dictate throughput for storage and network as well as whether you can process the data in a timely manner. You need to articulate all this upfront.
Workforce skill set evaluation
For any data lake project to be successful, you have to have the right people. You need experts who have hands-on experience building data platforms and who have extensive experience with data management and data governance so they can define the policies and procedures upfront. You also need data scientists who will be consumers of the platform, and bring them in as stakeholders early in the process of building a data lake to hear their requirements and how they would prefer to interact with the data lake when it is finished.
Operations plan
Think about your data lake from a service-level agreement (SLA) perspective: what SLA requirements will your business stakeholders expect, especially for business-critical applications that are revenue-impacting? You need proper SLAs in terms of lack of downtime, and in terms of data being ingested, processed, and transformed in a repeatable manner. Going back to the people and skills point, it’s critical to have the right people with experience managing these environments, to put together an operations team to support the SLAs and meet the business requirements.
Communications plan
Once you have the data lake platform in place, how will you advertise the fact and bring in additional users? You need to get different business stakeholders interested and show some successes for your data lake environment to flourish, as the success of any IT platform ultimately is based upon business adoption.
Disaster recovery plan
Depending on the business criticality of your data lake, and of the different SLAs you have in place with your different user groups, you need a disaster recovery plan that can support it.
Five-year vision
Given that the data lake is going to be a key foundational platform for the next generation of data technology in enterprises, organizations need to plan ahead on how to incorporate data lakes into their long-term strategies. We see data lakes taking over EDWs as organizations attempt to be more agile and generate more timely insights from more of their data. Organizations must be aware that data lakes will eventually become hybrids of data stores, include HDFS, no-SQL, and Graph DBs. They will also eventually support real-time data processing and generate streaming analytics—that is, not just rollups of the data in a streaming manner, but machine-learning models that produce analytics online as the data is coming in and generate insights in either a supervised or unsupervised manner. Deployment options are going to increase, also, with companies that don’t want to go into public clouds building private clouds within their environments, leveraging patterns seen in public clouds. Across all these parameters, enterprises need to plan to have a very robust set of capabilities, to ingest and manage the data, to store and organize it, to prepare and analyze, secure, and govern it. This is essential no matter what underlying platform you choose—whether streaming, batch, object storage, flash, in-memory, or file—you need to provide this consistently through all the evolutions the data lake is going to undergo over the next few years.
Download the full report Architecting Data Lakes by Alice LaPlante and Ben Sharma to learn more.
Editor’s note: Ben Sharma is speaking on the topic of “Building a modern data architecture”at Strata + Hadoop World in San Jose, March 28-31,2016.
This post is a collaboration between O’Reilly and Zaloni. See our statement of editorial independence.