Statistics for Data Science

Book description

Get your statistics basics right before diving into the world of data science

About This Book

  • No need to take a degree in statistics, read this book and get a strong statistics base for data science and real-world programs;
  • Implement statistics in data science tasks such as data cleaning, mining, and analysis
  • Learn all about probability, statistics, numerical computations, and more with the help of R programs

Who This Book Is For

This book is intended for those developers who are willing to enter the field of data science and are looking for concise information of statistics with the help of insightful programs and simple explanation. Some basic hands on R will be useful.

What You Will Learn

  • Analyze the transition from a data developer to a data scientist mindset
  • Get acquainted with the R programs and the logic used for statistical computations
  • Understand mathematical concepts such as variance, standard deviation, probability, matrix calculations, and more
  • Learn to implement statistics in data science tasks such as data cleaning, mining, and analysis
  • Learn the statistical techniques required to perform tasks such as linear regression, regularization, model assessment, boosting, SVMs, and working with neural networks
  • Get comfortable with performing various statistical computations for data science programmatically

In Detail

Data science is an ever-evolving field, which is growing in popularity at an exponential rate. Data science includes techniques and theories extracted from the fields of statistics; computer science, and, most importantly, machine learning, databases, data visualization, and so on.

This book takes you through an entire journey of statistics, from knowing very little to becoming comfortable in using various statistical methods for data science tasks. It starts off with simple statistics and then move on to statistical methods that are used in data science algorithms. The R programs for statistical computation are clearly explained along with logic. You will come across various mathematical concepts, such as variance, standard deviation, probability, matrix calculations, and more. You will learn only what is required to implement statistics in data science tasks such as data cleaning, mining, and analysis. You will learn the statistical techniques required to perform tasks such as linear regression, regularization, model assessment, boosting, SVMs, and working with neural networks.

By the end of the book, you will be comfortable with performing various statistical computations for data science programmatically.

Style and approach

Step by step comprehensive guide with real world examples

Table of contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Transitioning from Data Developer to Data Scientist
    1. Data developer thinking
    2. Objectives of a data developer
      1. Querying or mining
        1. Data quality or data cleansing
        2. Data modeling
        3. Issue or insights
        4. Thought process
      2. Developer versus scientist
        1. New data, new source
        2. Quality questions
        3. Querying and mining
        4. Performance
        5. Financial reporting
        6. Visualizing
        7. Tools of the trade
    3. Advantages of thinking like a data scientist
      1. Developing a better approach to understanding data
      2. Using statistical thinking during program or database designing
      3. Adding to your personal toolbox
      4. Increased marketability
      5. Perpetual learning
      6. Seeing the future
    4. Transitioning to a data scientist
      1. Let's move ahead
    5. Summary
  3. Declaring the Objectives
    1. Key objectives of data science
      1. Collecting data
      2. Processing data
      3. Exploring and visualizing data
      4. Analyzing the data and/or applying machine learning to the data
      5. Deciding (or planning) based upon acquired insight
        1. Thinking like a data scientist
        2. Bringing statistics into data science
        3. Common terminology
          1. Statistical population
          2. Probability
          3. False positives
          4. Statistical inference
          5. Regression
          6. Fitting
          7. Categorical data
          8. Classification
          9. Clustering
          10. Statistical comparison
          11. Coding
          12. Distributions
          13. Data mining
          14. Decision trees
          15. Machine learning
          16. Munging and wrangling
          17. Visualization
          18. D3
          19. Regularization
          20. Assessment
          21. Cross-validation
          22. Neural networks
          23. Boosting
          24. Lift
          25. Mode
          26. Outlier
          27. Predictive modeling
          28. Big Data
          29. Confidence interval
          30. Writing
    2. Summary
  4. A Developer's Approach to Data Cleaning
    1. Understanding basic data cleaning
      1. Common data issues
      2. Contextual data issues
      3. Cleaning techniques
    2. R and common data issues
      1. Outliers
        1. Step 1 – Profiling the data
        2. Step 2 – Addressing the outliers
      2. Domain expertise
      3. Validity checking
      4. Enhancing data
      5. Harmonization
      6. Standardization
    3. Transformations
    4. Deductive correction
    5. Deterministic imputation
    6. Summary
  5. Data Mining and the Database Developer
    1. Data mining
      1. Common techniques
      2. Visualization
        1. Cluster analysis
        2. Correlation analysis
        3. Discriminant analysis
        4. Factor analysis
        5. Regression analysis
        6. Logistic analysis
        7. Purpose
    2. Mining versus querying
      1. Choosing R for data mining
      2. Visualizations
        1. Current smokers
      3. Missing values
      4. A cluster analysis
    3. Dimensional reduction
      1. Calculating statistical significance
    4. Frequent patterning
      1. Frequent item-setting
    5. Sequence mining
    6. Summary
  6. Statistical Analysis for the Database Developer
    1. Data analysis
      1. Looking closer
    2. Statistical analysis
    3. Summarization
      1. Comparing groups
        1. Samples
        2. Group comparison conclusions
      2. Summarization modeling
    4. Establishing the nature of data
    5. Successful statistical analysis
    6. R and statistical analysis
    7. Summary
  7. Database Progression to Database Regression
    1. Introducing statistical regression
      1. Techniques and approaches for regression
        1. Choosing your technique
        2. Does it fit?
    2. Identifying opportunities for statistical regression
      1. Summarizing data
      2. Exploring relationships
      3. Testing significance of differences
    3. Project profitability
    4. R and statistical regression
    5. A working example
      1. Establishing the data profile
        1. The graphical analysis
      2. Predicting with our linear model
        1. Step 1: Chunking the data
        2. Step 2: Creating the model on the training data
        3. Step 3: Predicting the projected profit on test data
        4. Step 4: Reviewing the model
        5. Step 4: Accuracy and error
    6. Summary
  8. Regularization for Database Improvement
    1. Statistical regularization
      1. Various statistical regularization methods
      2. Ridge
      3. Lasso
      4. Least angles
      5. Opportunities for regularization
        1. Collinearity
        2. Sparse solutions
        3. High-dimensional data
        4. Classification
      6. Using data to understand statistical regularization
      7. Improving data or a data model
        1. Simplification
        2. Relevance
        3. Speed
        4. Transformation
        5. Variation of coefficients
        6. Casual inference
        7. Back to regularization
        8. Reliability
      8. Using R for statistical regularization
        1. Parameter Setup
    2. Summary
  9. Database Development and Assessment
    1. Assessment and statistical assessment
      1. Objectives
      2. Baselines
      3. Planning for assessment
      4. Evaluation
    2. Development versus assessment
      1. Planning
    3. Data assessment and data quality assurance
      1. Categorizing quality
      2. Relevance
      3. Cross-validation
        1. Preparing data
    4. R and statistical assessment
      1. Questions to ask
      2. Learning curves
        1. Example of a learning curve
    5. Summary
  10. Databases and Neural Networks
    1. Ask any data scientist
      1. Defining neural network
        1. Nodes
        2. Layers
        3. Training
        4. Solution
        5. Understanding the concepts
      2. Neural network models and database models
        1. No single or main node
        2. Not serial
        3. No memory address to store results
      3. R-based neural networks
        1. References
        2. Data prep and preprocessing
        3. Data splitting
        4. Model parameters
        5. Cross-validation
        6. R packages for ANN development
        7. ANN
        8. ANN2
        9. NNET
        10. Black boxes
      4. A use case
        1. Popular use cases
        2. Character recognition
        3. Image compression
        4. Stock market prediction
        5. Fraud detection
        6. Neuroscience
    2. Summary
  11. Boosting your Database
    1. Definition and purpose
      1. Bias
        1. Categorizing bias
        2. Causes of bias
        3. Bias data collection
        4. Bias sample selection
      2. Variance
        1. ANOVA
      3. Noise
        1. Noisy data
      4. Weak and strong learners
        1. Weak to strong
        2. Model bias
        3. Training and prediction time
        4. Complexity
        5. Which way?
    2. Back to boosting
      1. How it started
      2. AdaBoost
        1. What you can learn from boosting (to help) your database
    3. Using R to illustrate boosting methods
      1. Prepping the data
      2. Training
      3. Ready for boosting
        1. Example results
    4. Summary
  12. Database Classification using Support Vector Machines
    1. Database classification
      1. Data classification in statistics
      2. Guidelines for classifying data
        1. Common guidelines
      3. Definitions
    2. Definition and purpose of an SVM
      1. The trick
      2. Feature space and cheap computations
      3. Drawing the line
      4. More than classification
      5. Downside
      6. Reference resources
      7. Predicting credit scores
    3. Using R and an SVM to classify data in a database
      1. Moving on
    4. Summary
  13. Database Structures and Machine Learning
    1. Data structures and data models
      1. Data structures
      2. Data models
        1. What's the difference?
        2. Relationships
    2. Machine learning
      1. Overview of machine learning concepts
      2. Key elements of machine learning
        1. Representation
        2. Evaluation
        3. Optimization
      3. Types of machine learning
        1. Supervised learning
        2. Unsupervised learning
        3. Semi-supervised learning
        4. Reinforcement learning
          1. Most popular
      4. Applications of machine learning
      5. Machine learning in practice
        1. Understanding
        2. Preparation
        3. Learning
        4. Interpretation
        5. Deployment
        6. Iteration
    3. Using R to apply machine learning techniques to a database
      1. Understanding the data
      2. Preparing
      3. Data developer
      4. Understanding the challenge
      5. Cross-tabbing and plotting
    4. Summary

Product information

  • Title: Statistics for Data Science
  • Author(s): James D. Miller
  • Release date: November 2017
  • Publisher(s): Packt Publishing
  • ISBN: 9781788290678