Building Distributed Pipelines for Data Science Using Kafka, Spark, and Cassandra
Learn how to introduce a distributed data science pipeline in your organization
Sign up before this course sells out!
Building a distributed pipeline is a huge—and complex—undertaking. If you want to ensure yours is scalable, has fast in-memory processing, can handle real-time or streaming data feeds with high throughput and low-latency, is well suited for ad-hoc queries, can be spread across multiple data centers, is built to allocate resources efficiently, and is designed to allow for future changes, join Andy Petrella and Xavier Tordoir for this immensely practical hands-on course.
What you’ll learn—and how you can apply it
By the end of this course, you’ll have a solid understanding of:
- The most important technologies for a distributed pipeline, when they should be used—and how
- How to integrate scalable technologies into your company’s existing data architecture
- How to build a successful, scalable, elastic, distributed pipeline using a lean approach
This course is for you if…
- You’re a data scientist with experience with data modeling, business intelligence, or a traditional data pipeline and need to deal with bigger or faster data
- You’re a software or data engineer with experience in architecting solutions in Scala, Java, or Python and you need to integrate scalable technologies in your company’s architecture
Prerequisites:
- Intermediate knowledge of an object-oriented language and basic knowledge of a functional programming language, as well as basic experience with a JVM
- Understanding of classic web architecture and service-oriented architecture
- Basic understanding of ETL, streaming data, and distributed data architectures
- Intermediate understanding of Docker and UNIX, as well as some basic knowledge about networks (IP, DNS, SSH, etc.)
Schedule
- Day 1 ()
- Introduction, Spark, Spark Notebook, and Kafka
- Assignment #1
- Day 2 ()
- Streaming: Spark, Kafka, and Cassandra
- Data analysis and external libraries
- Assignment #2
- Day 3 ()
- Microservices, cluster management, job orchestration, and live demo of end-to-end distributed pipeline
- Final discussion & wrap up
Register now
Participate in this workshop from the convenience of your home, your office… whatever environment you find most comfortable and conducive to an intensive educational experience.
With post-course support: $799Individual ticket plus the ability to correspond with the instructors about the content of the course for 2 weeks after the course ends. (Consulting for specific use cases is not included.)
If you already have a ticket and would like to add post-course support, please contact customer service.
Group ticket
Working as a team? Learn as a team.
Taking this course as a team ensures that everyone is on the same page and understands both the immediate and long-term and immediate goals of your project. Exploring new ideas and collaborating on exercises together is a great team-building experience; everyone on your team will have the opportunity to ask questions, discuss use cases, and learn from other participants.
For group tickets and enterprise licensing, please contact onlinetraining@oreilly.com.
Once you have registered, further details about joining the workshop will be available in your members.oreilly.com account. After the event concludes, access to the recording of the event will be added to your account.