Smarter pandas for Data Science
Published by Pearson
Rising to the Challenge of Data Science
- Develop an understanding of the underlying design of the pandas API
- Dive into the “why” behind pandas
- Reduce reliance on external resources like StackOverflow
- Work through core concepts and progress into more complex analyses
pandas is a surprisingly coherent and consistently designed tool. Yet users constantly run into errors they don’t understand and are forced to confront design decisions in the API that they cannot comprehend. This is because most pandas introductions present the API from only the most superficial level, suggesting the tool is merely a collection of tiny, constantly changing details you have to memorize. Of course people will be confused and frustrated!
In this course, we will look at a series of problems in pandas, starting with simple examples and quickly progressing to complex analyses. Each step of the way, we'll discuss a core concept in pandas and provide the “why” behind the pandas API, with a goal to transform pandas from a collection of obscure API details to an obvious, coherent tool for columnar data analysis.
Rising to the Challenge of Data Science series
Data science is more than creating complex models or analyses from data. An effective data scientist must navigate, normalize, clean, analyze, draw conclusions from, and communicate their data instead of relying on data engineers (cleaning/storage) or software developers (dashboarding/communication). To become an independent, full stack data scientist, it is essential to master all of the skills and mechanics in this series not only to write more robust, maintainable, and extensible code, but also to adequately communicate your results to stakeholders and colleagues, keeping the message within your own control.
- Smarter Shell for Data Science
- Smarter SQL for Data Science
- Smarter Statistics for Data Science
- Smarter Dashboarding for Data Science
- Smarter pandas for Data Science
- Smarter Plotting for Data Science
What you’ll learn and how you can apply it
By the end of the live online course, you’ll understand:
- The basic collection types in Python and pandas and when to use them
- Which operations are fast or slow
- How to know if a transformation will be meaningful
- Common computations that you can perform on a pandas.DataFrame
- Windowing, grouping, or sampling operations and why they are special
And you’ll be able to:
- Use pandas more effectively as a coherent tool for columnar data analysis
- Perform semantically meaningful transformations on a pandas.DataFrame
- Anticipate if certain operations will be fast or slow
- Understand the design of the pandas API, and use pandas without constant reliance on resources like StackOverflow
This live event is for you because...
- You use pandas but sometimes struggle with the API, or rely too heavily on resources like StackOverflow
- You want a better understanding of the underlying design of pandas to reduce the churn, or “fiddliness,” of your code
- You come from a non-traditional (non-programming) background and need to become effective with data science and analytics
- You need the foundational knowledge to empower the data science/data analysis needs of your team
Prerequisites
- Some prior experience with Python and pandas
- Knowledge of relevant syntax and core functionality
Course Set-up
- No specific setup required. Course notes and all materials provided during the session.
- Recommended: Up-to-date Python installation (from www.python.org) and coding environment.
Recommended Preparation
- Attend: Rising to the Challenge of Data Science live online training series by James Powell
Recommended Follow-up
- Read: Pandas for Everyone, First Edition, by Daniel Chen
- Watch: Pandas Data Analysis with Python Fundamentals by Daniel Chen
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
Segment 1: The Index, the Series, the DataFrame
Presentation: (1 hour and 25 minutes)
We’ll start by reviewing the foundational data types provided by Python in the "builtins” and "collections" modules to understand why the "numpy.ndarray" is an inevitable consequence of how Python evolved. This will guide us to an understanding of how to use tools like NumPy more effectively, and provide a foundational understanding for how we should approach pandas. We’ll discuss where pandas builds on top of the NumPy, the purpose and the design of the indexing, index alignment, and indexing mechanisms. Finally, we’ll see what a pandas.DataFrame really is and discuss when it is the most (or least) appropriate tool for our work, as well as how we should approach using it.
In this segment, we will aim to answer the following questions:
- What is a Python list?
- What is a NumPy numpy.ndarray?
- What is a Python dict?
- What is a Python collections.Counter?
- What is a Pandas pandas.Array?
- What is a Pandas pandas.Series?
- What is a Pandas pandas.DataFrame?
Break (10 minutes)
Q&A (10 minutes)
Segment 2: DataFrame Transformations
Presentation: (1 hour and 25 minutes)
Given a thorough understanding of the design of pandas and the purpose and mechanism underlying the "pandas.DataFrame," we will turn our attention to transformations we can perform to manipulate a “pandas.DataFrame." These will include transformations that we will do as part of data cleaning as well as transformations core to our analyses. We’ll discuss the semantics of these transformations, taking special care to understand them in the context of the index and index alignment, and discover how we can approach any given pandas problem without constantly having to reach for the documentation or StackOverflow.
In this section, we will aim to answer the following questions:
- What are semantically meaningful transformations we can perform on a pandas.DataFrame?
- How do we know if an operation will be fast or slow?
- How do we know if a transformation will be meaningful or not?
Break (10 minutes)
Q&A (10 minutes)
Segment 3: DataFrame Computations Presentation: (1 hour and 20 minutes)
Given a thorough understanding of how to load, process, manipulate, clean, and transform our data, we will turn our attention to analyses. We will discuss common analyses we will do on our "pandas.DataFrame" and how we can formulate these in terms of the underlying design of pandas. Rather than recite a bunch of “recipes,” we’ll approach each analytical problem by working through the steps that someone would consider when seeing that problem for the first time. Our goal is to generalize all of the knowledge we have discussed in this and previous sections to empower you to effortlessly solve any pandas problems you encounter in your work.
In this section, we will aim to answer the following questions:
- What are the common computations that we can perform on a pandas.DataFrame?
- What are the common windowing, grouping, or sampling operations?
- Why are these operations special?
Q&A (10 minutes)
Your Instructors
James Powell
James Powell is a world-renowned expert in data science and open source scientific computing. His subject mastery and guidance are sought after by global organizations for consulting, coaching, and staff training. He is one of the most prolific speakers in the community, and he shares new content each week on his YouTube page, LinkedIn, and Discord.
In addition to O’Reilly Online Learning, James and his team at Don't Use This Code provide Python training and consulting services. He would love to talk to you about how your team uses open source tools for AI/ML - e-mail him at james@dutc.io.
James’s courses are dense and fast-paced. Be sure to stay for the full session and don’t worry if it goes by quickly. You’ll want to re-watch the video later to capture all of the details.
Cameron Riddell
Cameron Riddell is a leading specialist on data analysis. He is an expert on analytic and visualization tools such as pandas, Matplotlib, and Bokeh. He delivers key insights to corporate clients to help design and maintain robust, efficient systems for data analysis and visualization.
Cameron is also a prolific presenter, and you will want to check out his weekly blog by signing up for our newsletter.
You can expect that any course led by Cameron will pack a lot of content into a short amount of time! You’ll want to stay for the entire session to see how it comes together and refer back to the recording to take in all of the detail.