Chapter 5. Automate Your Infrastructure

Christiano Anderson

One of the roles of data engineers is to deploy data pipelines by using a cloud service provider like Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, or others. We can easily use the web console to link components together and provide a full data pipeline.

Considering AWS as an example, we may want to use API Gateway as our representational state transfer (REST) interface to perform data ingestion, a few Lambda functions to validate the ingestion, Kinesis Data Streams to provide a real-time analysis, Kinesis Data Firehose to deliver the data, and Simple Storage Service (S3) as a persistence layer. We may also want to use Athena as a visualization layer.

With this example, we have to deal with about six components. Each component may require additional setup. Finally, we have to deal with a lot of identity and access management (IAM) roles to handle permissions and access-control lists (ACLs). OK, we can do everything by clicking the console and linking all the components together; this is the fastest way to create your infrastructure if you just need one simple ingestion pipeline.

But if you have to set everything up by hand again, and again, and again, it will require a lot of extra time and provide more opportunities to make mistakes, or even open a security breach. That is why data engineers ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.