Chapter 7. Securing Your Data Lake

Security architectures in the cloud are very different from those on-premises. Today, the cloud is reasonably secure. But it has taken some time to get here.

When the public cloud began, it lacked security functionality. For example, AWS EC2-Classic instances received public IP addresses. A few years later, Amazon introduced its virtual private cloud (Amazon VPC), which included private subnetting and boundaries. Since then, the cloud has matured from typical compute in which security is a minor consideration to an environment with the extensibility and functionality that allow a security professional to more reasonably protect infrastructure.

We’ve seen three generations of cloud security thus far. In the first generation there was little—typically just ad hoc—security. Then, the second generation introduced virtual private clouds and third-party services to enhance security, such as application firewalls. The third generation has included logging approaches like setting up Lambdas to trigger on certain events and AWS Gatekeeper. There is room for a fourth generation to improve cloud security even more.

It’s very important to look at the security primitives differently than those of on-premises setups, and to adopt those security primitives to create a secure data infrastructure.

When securing a data lake in the cloud for the first time, security engineers need to:

  • Understand the different parties involved in cloud security.

  • Expect a lot ...

Get Operationalizing the Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.