Video description
Part 1 is designed to reflect the most in-demand Scala skills. It provides an in-depth understanding of core Scala concepts. We will wrap up with a discussion on Map Reduce and ETL pipelines using Spark from AWS S3 to AWS RDS (includes six mini-projects and one Scala Spark project).
Part 2 covers PySpark to perform data analysis. You will explore Spark RDDs, Dataframes, a bit of Spark SQL queries, transformations, and actions that can be performed on the data using Spark RDDs and dataframes, the ecosystem of Spark and Hadoop, and their underlying architecture. You will also learn how we can leverage AWS storage, databases, computations, and how Spark can communicate with different AWS services.
Part 3 is all about data scraping and data mining. You will cover important concepts such as Internet Browser execution and communication with the server, synchronous and asynchronous, parsing data in response from the server, tools for data scraping, Python requests module, and more.
In Part 4, you will be using MongoDB to develop an understanding of the NoSQL databases. You will explore the basic operations and explore the MongoDB query, project and update operators. We will wind up this section with two projects: Developing a CRUD-based application using Django and MongoDB and implementing an ETL pipeline using PySpark to dump the data in MongoDB.
By the end of this course, you will be able to relate the concepts and practical aspects of learned technologies with real-world problems.
What You Will Learn
- Build ETL pipeline from AWS S3 to AWS RDS using Spark
- Explore Spark/Hadoop applications, ecosystem, and architecture
- Learn collaborative filtering in PySpark
- Recognize the distinction between synchronous and asynchronous requests
- Understand MongoDB CRUD, query operators, projection operators, and update operators
- Build APIs for CRUD operations in MongoDB through Django
Audience
This course is designed for absolute beginners who want to create intelligent solutions, study with actual data, and enjoy learning theory and then putting it into practice. Data scientists, machine learning experts, and drop shippers will all benefit from this training.
A basic understanding of programming, HTML tags, Python, SQL, and Node JS is required. However, no prior knowledge of data scraping, and Scala is needed.
About The Author
AI Sciences: AI Sciences are experts, PhDs, and artificial intelligence practitioners, including computer science, machine learning, and Statistics. Some work in big companies such as Amazon, Google, Facebook, Microsoft, KPMG, BCG, and IBM.
AI sciences produce a series of courses dedicated to beginners and newcomers on techniques and methods of machine learning, statistics, artificial intelligence, and data science. They aim to help those who wish to understand techniques more easily and start with less theory and less extended reading. Today, they publish more comprehensive courses on specific topics for wider audiences.
Their courses have successfully helped more than 100,000 students master AI and data science.
Table of contents
- Chapter 1 : Part 1 - Data Scraping and Data Mining for Beginners to Pro with Python
-
Chapter 2 : Requests
- Introduction to Python Requests
- Hands-On with Requests
- Extracting Quotes Manually
- Quiz (Extracting Authors)
- Solution (Extracting Authors)
- Pagination
- Quiz (Extracting Author and Quotes)
- Solution 01 (Extracting Author and Quotes)
- Solution 02 (Extracting Author and Quotes)
- Ajax Requests
- Ajax Requests for Cricinfo
- Ajax Requests Pagination
- Quiz (Extracting Top Stats from Cricinfo)
- Solution 01 (Extracting Top Stats from Cricinfo)
- Solution 02 (Extracting Top Stats from Cricinfo)
-
Chapter 3 : Beautiful Soup 4 (BS4)
- Introduction to BS4
- Quiz (Difference Between Requests and BS4)
- Solution (Difference Between Requests and BS4)
- Hands-On with BS4
- Extracting Data from Tree
- Extracting Quotes from the Website
- Quiz (Extracting Author Names)
- Solution (Extracting Author Names)
- Attributes of Tags in BS4
- Multi-Valued Attributes of Tags in BS4
- Scraping Movie Names from IMDB
- Quiz (Getting the Ratings, Year, and Name of the Movie)
- Solution 01 (Getting the Ratings, Year, and Name of the Movie)
- Solution 02 (Getting the Ratings, Year, and Name of the Movie)
- Scraping Time, Genre, and Releasing Date from IMDB 01
- Scraping Time, Genre, and Releasing Date from IMDB 02
- Combining Two Requests Data for IMDB
- Movies Recommender System (Creating Movie URL)
- Movies Recommender System (Creating Director URL)
- Movies Recommender System Using BS4 (Getting Top Four Movies)
- Movies Recommender System Using BS4 (Merge All Requests Together)
-
Chapter 4 : CSS Selectors
- Introduction to CSS Selectors
- CSS Selectors Hands-On (Tags)
- Quiz (Tags)
- Solution (Tags)
- CSS Selectors Hands-On (Descendants, ID, Class)
- Quiz (Descendants)
- Solution (Descendants)
- Quiz (ID)
- Solution (ID)
- Solution (Class) Part 1
- Solution (Class) Part 2
- CSS Selectors Hands-On (Nested Tags, ID Tags, Class Tags)
- Quiz (Class with Tag)
- Solution (Class with Tag)
- CSS Selectors Hands-On (Comma Separator, Universal Selectors
- Quiz (Combining Two Selectors)
- Solution (Combining Two Selectors)
- CSS Selectors Hands-On (Sibling Notations and Direct Child)
- Quiz (Adjacent Sibling)
- Solution (Adjacent Sibling)
- Quiz (General Sibling)
- Solution (General Sibling)
- CSS Selectors Hands-On (Child Selectors)
- Quiz (First Child)
- Solution (First Child)
- Quiz (Only Child)
- Solution (Only Child)
- Quiz (Last Child)
- Solution (Last Child)
- CSS Selectors Hands-On (Negations, Attributes)
- Quiz (Negation)
- Solution (Negation)
- CSS Selectors Hands-On (Attributes, Attributes Values)
- Quiz (Attributes Values)
- Solution (Attributes Values)
- CSS Selectors Hands-On (Attributes Wild Cards Values)
- Quiz (Attributes Wild Card)
- Solution (Attributes Wild Card)
-
Chapter 5 : Scrapy
- Introduction to Scrapy
- Comparison of Scrapy and Requests
- Scrapy at a Glance Documentation
- Getting Started with Scrapy
- Running Documentation Spider 1
- Running Documentation Spider 2
- Writing Spider from the Scratch
- Understanding the Response (URL, Status)
- Understanding the Response (Headers)
- Understanding the Response (Values in Headers)
- Understanding the Response (Body)
- Understanding the Response (Request)
- Understanding the Response (Meta)
- Understanding the Response (flags, certificate, ip_address, copy)
- Understanding the Response (replace, urljoin, follow, follow_all)
- Response CSS and Scrapy Shell
- Extracting quotes with Scrapy
- Understanding Nested Selectors
- Extracting the Author and Quotes
- Checking for Next Page
- Checking for Next Page in Spider
- Checking for Next Page URL
- Scraping Quotes from Next Pages
- Exporting Extracted Data
- Quiz (Get the Tags)
- Solution (Get the Tags)
- Next Website
- CSS Selectors for Movie Names and URLs
- Combined CSS Selectors for Movie Names and URLs
- Sent Request to the Film Info Page
- Merge Data from Two Callbacks
- Extracting Movie Duration and Genres
- Exporting the Extracted Data
- Quiz (Extracting the Year)
- Solution (Extracting the Year)
- Getting Director Name and URL
- Getting Top Four Movies of Directors
- Extracting Data Anomaly (dont_filter Flag)
-
Chapter 6 : Scrapy Project
- Hugo Boss Website for Scraping
- Understanding Site Structure
- Writing CSS Selectors for Listings
- Listings in Scrapy Shell
- Sending Request to Listings URLs
- Extracting Products URL from the Listings
- Sending Requests to Products of the Listings
- Writing CSS to Get the Product Info
- Getting the Bigger Images of the Product
- Checking Next Page URL
- Adding Pagination to Spider and Running It
- Output of the Spider
-
Chapter 7 : Selenium
- Introduction to Selenium
- Getting Started with Selenium
- Configuring the Webdriver
- Extracting Quotes with Selenium
- Extracting Quotes and Author Names
- Quiz (Extracting Quotes)
- Solution (Extracting Quotes)
- Clicking on Button
- Pagination and Extracting Data
- Exception Handling for Unavailable Element
- Navigating the Website for Login
- Quiz (Login and Extract Quote)
- Solution (Login and Extract Quote)
- Chapter 8 : Project Selenium
- Chapter 9 : Part 2 - Scala and Spark - Master Big Data with Scala and Spark
-
Chapter 10 : Scala Overview
- What is Scala
- Scala Setup (Local Machine)
- Scala Setup (Online)
- Variables in Scala
- Arithmetic Operations on Variables-1
- Arithmetic Operations on Variables-2
- Quiz (Arithmetic Operations)
- Solution (Arithmetic Operations)
- Quiz (Strings)
- Solution (Strings)
- Type Casting
- Taking Input from User
- Quiz (User Input and Type Casting)
- Solution (User Input and Type Casting)
-
Chapter 11 : Flow Control
- Overview of Control Statements
- If Else Statements
- Conditions in If
- Quiz (If Statement)
- Solution (If Statement)
- Nested If Else
- Quiz (Nested If Else)
- Solution (Nested If Else)
- Logical Operators
- Quiz (Logical Operators)
- Solution (Logical Operators)
- If Else If
- Quiz (If Else If)
- Solution (If Else If)
- Overview of Loops
- Overview of While Loop
- While Loop
- Quiz (While Loop)
- Solution 1 (While Loop)
- Solution 2 (While Loop)
- Do While Loop
- For Loop
- Quiz 1 (For Loop)
- Solution 1 (For Loop)
- Quiz 2 (For Loop)
- Solution 2 (For Loop)
- Break
- Break Fix
- Project Overview for Flow control
- Project Solution Design
- Project Solution Code 1
- Project Solution Code 2
- Project Solution Code 3
- Project Solution Code 4
-
Chapter 12 : Functions
- Overview of Functions
- Writing Addition Function
- Quiz (Basic Function)
- Solution (Basic Function)
- Functions Common Issues
- Named Arguments
- Quiz (String Concatenation Function)
- Solution (String Concatenation Function)
- Quiz (Dividing Code in Functions)
- Solution (Dividing Code in Functions)
- Default Arguments
- Quiz (Default Arguments)
- Solution (Default Arguments)
- Anonymous Functions
- Quiz (Anonymous Functions)
- Solution (Anonymous Functions)
- Scopes
- Project Overview for Functions
- Checking Credentials
- Prompting the menu
- Basic Functions
- Breaking Code in More Functions
- Final Run (Functions)
- Chapter 13 : Classes
-
Chapter 14 : Data Structures
- Introduction of Data Structures
- Lists Introduction
- Lists Create and Delete Elements
- Lists Take
- ListBuffer Introduction
- Add Data in ListBuffer
- Remove Data from ListBuffer
- Take Data from ListBuffer
- Project Overview for Data Structures
- Project Architecture Discussion
- Project Architecture Implementation
- User Input for Objects
- Implementing the Control Flow
- Creating Required Functions Inside Class
- Overview of Maps
- Creating Maps
- Check Key in Map
- Update Value in Map
- Add and Remove Items from Maps
- Iterating on Maps
- Project Overview for Data Structures
- Project Architecture for Data Structures
- Project Structure Code
- Using Maps for Word Count
- Final Run
- Sets Overview
- Add and Remove Item from the Set
- Set Operations
- Overview of Stack
- Push and Pop in Stack
- Stack Attributes
- Project Overview
- Project Architecture
- Extra Closing Bracket Use Case
- Extra Starting Bracket Use Case
-
Chapter 15 : Project for Scala and Spark
- Project Introduction
- Why Spark
- Hadoop Ecosystem
- Spark Architecture
- Spark Ecosystem
- DataBricks Account
- Setting up DataBricks Cluster
- Spark Local Setup
- Spark Hadoop Setup
- Spark RDDs
- Spark RDDs (textFile, collect)
- Spark Local Run
- Understanding Map
- Understanding Flat Map
- Understanding Reduce by Key
- Word Count Example
- Spark DFs
- Spark DF Read Data
- Spark Print Schema, Select
- Spark GroupBy
- Spark DF Write
- Creating S3 Bucket
- Creating Database in RDS
- Performing ETL
- Chapter 16 : Part 3 - PySpark and AWS - Master Big Data with PySpark and AWS
-
Chapter 17 : Introduction to Hadoop, Spark Ecosystems and Architectures
- Why Spark
- Hadoop Ecosystem
- Spark Architecture and Ecosystem
- DataBricks Signup
- Create DataBricks Notebook
- Download Spark and Dependencies
- Java Setup on Windows
- Python Setup on Windows
- Spark Setup on Windows
- Hadoop Setup on Windows
- Running Spark on Windows
- Java Download on MAC
- Installing JDK on MAC
- Setting Java Home on MAC
- Java check on MAC
- Installing Python on MAC
- Set Up Spark on MAC
-
Chapter 18 : Spark RDDs
- Spark RDDs Introduction
- Creating Spark RDD
- Running Spark Code Locally
- RDD Map (Lambda)
- RDD Map (Simple Function)
- Quiz (Map)
- Solution 1 (Map)
- Solution 2 (Map)
- RDD FlatMap
- RDD Filter
- Quiz (Filter)
- Solution (Filter)
- RDD Distinct
- RDD GroupByKey
- RDD ReduceByKey
- Quiz (Word Count) with Spark RDDs
- Solution (Word Count) with Spark RDDs
- RDD (Count and CountByValue)
- RDD (saveAsTextFile)
- RDD (Partition)
- Finding Average-1
- Finding Average-2
- Quiz (Average)
- Solution (Average)
- Finding Min and Max
- Quiz (Min and Max)
- Solution (Min and Max)
- Project Overview for Spark RDDs
- Total Students
- Total Marks by Male and Female Student
- Total Passed and Failed Students
- Total Enrolments Per Course
- Total Marks Per Course
- Average Marks Per Course
- Finding Minimum and Maximum Marks
- Average Age of Male and Female Students
-
Chapter 19 : Spark DFs
- Introduction to Spark DFs
- Creating Spark DFs
- Spark Infer Schema
- Spark Provide Schema
- Create DF from RDD
- Rectifying the Error
- Select DF Columns
- Spark DF withColumn
- Spark DF withColumnRenamed and Alias
- Spark DF Filter Rows
- Quiz (select, withColumn, filter)
- Solution (select, withColumn, filter)
- Spark DF (Count, Distinct, Duplicate)
- Quiz (Distinct, Duplicate)
- Solution (Distinct, Duplicate)
- Spark DF (sort, orderBy)
- Quiz (sort, orderBy)
- Solution (sort, orderBy)
- Spark DF (Group By)
- Spark DF (Group By - Multiple Columns and Aggregations)
- Spark DF (Group By -Visualization)
- Spark DF (Group By - Filtering)
- Quiz (Group By)
- Solution (Group By)
- Quiz (Word Count) with Spark DFs
- Solution (Word Count) with Spark DFs
- Spark DF (UDFs)
- Quiz (UDFs)
- Solution (UDFs)
- Solution (Cache and Persist)
- Spark DF (DF to RDD)
- Spark DF (Spark SQL)
- Spark DF (Write DF)
- Project Overview
- Project (Count and Select)
- Project (Group By)
- Project (Group By, Aggregations, and Order By)
- Project (Filtering)
- Project (UDF and WithColumn)
- Project (Write)
- Chapter 20 : Collaborative Filtering
- Chapter 21 : Spark Streaming
- Chapter 22 : ETL Pipeline
-
Chapter 23 : Project - Change Data Capture / Replication On Going
- Introduction to Project
- Project Architecture
- Creating RDS MySQL Instance
- Creating S3 Bucket
- Creating DMS Source Endpoint
- Creating DMS Destination Endpoint
- Creating DMS Instance
- MySQL WorkBench
- Connecting with RDS and Dumping Data
- Querying RDS
- DMS Full Load
- DMS Replication Ongoing
- Stopping Instances
- Glue Job (Full Load)
- Glue Job (Change Capture)
- Glue Job (CDC)
- Creating Lambda Function and Adding Trigger
- Checking Trigger
- Getting S3 File Name in Lambda
- Creating Glue Job
- Adding Invoke for Glue Job
- Testing Invoke
- Writing Glue Shell Job
- Full Load Pipeline
- Change Data Capture Pipeline
- Chapter 24 : Part 4 - MongoDB-Mastering MongoDB for Beginners (Theory and Projects)
- Chapter 25 : Overview
- Chapter 26 : Basic Mongo Operations
- Chapter 27 : Basic Update Operation
- Chapter 28 : Basic Read Operation
- Chapter 29 : Basic Delete Operation
-
Chapter 30 : Query and projection operators
- Module Introduction
- $eq Operator
- $gt Operator
- $lt Operator
- $in Operator
- $ne Operator
- $nin operator
- $and Operator
- $or Operator
- $not Operator
- $exists Operator
- $types Operator
- $expr Operator
- $mod Operator
- $text Operator
- $all Operator
- $elemMatch Operator
- $size Operator
- $ Operator
- $slice Operator
- Quiz ($eq)
- Solution ($eq)
- Quiz ($gt)
- Solution ($gt)
- Quiz ($gte)
- Solution ($gte)
- Quiz ($in)
- Solution ($in)
- Quiz ($lt)
- Solution ($lt)
- Quiz ($lte)
- Solution ($lte) Part F10401
- Solution ($lte)
- Quiz ($ne)
- Solution ($ne)
- Quiz ($nin)
- Solution ($nin) Part 1
- Solution ($nin) Part 2
- Solution ($nin) Part 3
- Quiz ($and)
- Solution ($and)
- Quiz ($or)
- Solution ($or) Part 1
- Solution ($or) Part 2
- Quiz ($not)
- Solution ($not) Part 1
- Solution ($not) Part 2
- Solution ($not) Part 3
- Quiz ($exists)
- Solution ($exists)
- Quiz ($expr)
- Solution ($expr)
- Quiz ($mod)
- Solution ($mod)
- Quiz ($text)
- Solution ($text)
- Quiz ($all)
- Solution ($all) Part 1
- Solution ($all) Part 2
- Quiz ($elemMatch)
- Solution ($elemMatch) Part 1
- Solution ($elemMatch) Part 2
- Quiz ($size)
- Solution ($size)
-
Chapter 31 : Update Operators
- $currentDate Operator
- $inc Operator Part 1
- $inc Operator Part 2
- $min Operator
- $max Operator
- $mul Operator
- $rename Operator
- $set Operator Part 1
- $set Operator Part 2
- $unset Operator
- $addToSet Operator
- $pop Operator
- $pull Operator
- $push Operator
- $each Operator
- $position Operator
- $sort Operator
- Quiz 1 (Update Operators)
- Solution 1 (Update Operators) Part 1
- Solution 1 (Update Operators) Part 2
- Solution 1 (Update Operators) Part 3
- Solution 1 (Update Operators) Part 4
- Quiz 2 (Update Operators)
- Solution 2 (Update Operators) Part 1
- Solution 2 (Update Operators) Part 2
- Solution 2 (Update Operators) Part 3
-
Chapter 32 : Mongo with Node
- Installing Node on Local Machine
- Installing VS Code
- Mongo Atlas
- Create Cluster on Mongo Atlas
- Creating User in Atlas
- Network Access
- Database and Collections
- Connect Node with Mongo
- Get Databases
- Insert in Mongo Using Node
- Read from Mongo Using Node
- Update in Mongo Using Node
- Delete from Mongo Using Node
- Chapter 33 : Mongo with Python
- Chapter 34 : Django with Mongo
- Chapter 35 : Spark with Mongo
Product information
- Title: 50 Hours of Big Data, PySpark, AWS, Scala, and Scraping
- Author(s):
- Release date: March 2022
- Publisher(s): Packt Publishing
- ISBN: 9781803237039
You might also like
video
PySpark and AWS: Master Big Data with PySpark and AWS
The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports …
video
Data Engineering with Python and AWS Lambda LiveLessons
7 Hours of Video Instruction Data Engineering with Python and AWS Lambda LiveLessons shows users how …
video
Apache Spark with Python - Big Data with PySpark and Spark
This course covers all the fundamentals of Apache Spark with Python and teaches you everything you …
book
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library
Take a journey toward discovering, learning, and using Apache Spark 3.0. In this book, you will …