Book description
What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems.
From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it.
Among the many topics covered, you’ll discover how to:
- Test drive your data to see if it’s ready for analysis
- Work spreadsheet data into a usable form
- Handle encoding problems that lurk in text data
- Develop a successful web-scraping effort
- Use NLP tools to reveal the real sentiment of online reviews
- Address cloud computing issues that can impact your analysis effort
- Avoid policies that create data analysis roadblocks
- Take a systematic approach to data quality analysis
Table of contents
- Bad Data Handbook
- SPECIAL OFFER: Upgrade this ebook with O’Reilly
- About the Authors
- Preface
- 1. Setting the Pace: What Is Bad Data?
- 2. Is It Just Me, or Does This Data Smell Funny?
- 3. Data Intended for Human Consumption, Not Machine Consumption
- 4. Bad Data Lurking in Plain Text
- 5. (Re)Organizing the Web’s Data
- 6. Detecting Liars and the Confused in Contradictory Online Reviews
- 7. Will the Bad Data Please Stand Up?
- 8. Blood, Sweat, and Urine
- 9. When Data and Reality Don’t Match
- 10. Subtle Sources of Bias and Error
- 11. Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
- 12. When Databases Attack: A Guide for When to Stick to Files
- 13. Crouching Table, Hidden Network
-
14. Myths of Cloud Computing
- Introduction to the Cloud
- What Is “The Cloud”?
- The Cloud and Big Data
- Introducing Fred
- At First Everything Is Great
- They Put 100% of Their Infrastructure in the Cloud
- As Things Grow, They Scale Easily at First
- Then Things Start Having Trouble
- They Need to Improve Performance
- Higher IO Becomes Critical
- A Major Regional Outage Causes Massive Downtime
- Higher IO Comes with a Cost
- Data Sizes Increase
- Geo Redundancy Becomes a Priority
- Horizontal Scale Isn’t as Easy as They Hoped
- Costs Increase Dramatically
- Fred’s Follies
- Myth 1: Cloud Is a Great Solution for All Infrastructure Components
- Myth 2: Cloud Will Save Us Money
- Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID
- Myth 4: Cloud Computing Makes Horizontal Scaling Easy
- Conclusion and Recommendations
- 15. The Dark Side of Data Science
- 16. How to Feed and Care for Your Machine-Learning Experts
- 17. Data Traceability
- 18. Social Media: Erasable Ink?
- 19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
- Index
- About the Author
- Colophon
- SPECIAL OFFER: Upgrade this ebook with O’Reilly
- Copyright
Product information
- Title: Bad Data Handbook
- Author(s):
- Release date: November 2012
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449324971
You might also like
book
Statistical Data Cleaning with Applications in R
A comprehensive guide to automated statistical data cleaning The production of clean data is a complex …
article
The Human Factor in AI-Based Decision-Making
Individuals’ unique decision-making styles inform the choices they make when working with AI-based inputs. The authors …
article
Why the Power of Technology Rarely Goes to the People
Throughout history, the advantages and costs of technological innovations have been unevenly distributed between the powerful …
article
Why So Many Data Science Projects Fail to Deliver
Many companies are unable to consistently gain business value from their investments in big data, artificial …