Book description
A web application involves many specialists, but it takes people in web ops to ensure that everything works together throughout an application's lifetime. It's the expertise you need when your start-up gets an unexpected spike in web traffic, or when a new feature causes your mature application to fail. In this collection of essays and interviews, web veterans such as Theo Schlossnagle, Baron Schwartz, and Alistair Croll offer insights into this evolving field. You'll learn stories from the trenches--from builders of some of the biggest sites on the Web--on what's necessary to help a site thrive.
- Learn the skills needed in web operations, and why they're gained through experience rather than schooling
- Understand why it's important to gather metrics from both your application and infrastructure
- Consider common approaches to database architectures and the pitfalls that come with increasing scale
- Learn how to handle the human side of outages and degradations
- Find out how one company avoided disaster after a huge traffic deluge
- Discover what went wrong after a problem occurs, and how to prevent it from happening again
Contributors include:
John Allspaw
Heather Champ
Michael Christian
Richard Cook
Alistair Croll
Patrick Debois
Eric Florenzano
Paul Hammond
Justin Huff
Adam Jacob
Jacob Loomis
Matt Massie
Brian Moon
Anoop Nagwani
Sean Power
Eric Ries
Theo Schlossnagle
Baron Schwartz
Andrew Shafer
Publisher resources
Table of contents
-
Web Operations: Keeping the Data on Time
- SPECIAL OFFER: Upgrade this ebook with O’Reilly
- Foreword
- Preface
- 1. Web Operations: The Career
- 2. How Picnik Uses Cloud Computing: Lessons Learned
-
3. Infrastructure and Application Metrics
- Time Resolution and Retention Concerns
- Locality of Metrics Collection and Storage
- Layers of Metrics
- Providing Context for Anomaly Detection and Alerts
- Log Lines Are Metrics, Too
- Correlation with Change Management and Incident Timelines
- Making Metrics Available to Your Alerting Mechanisms
- Using Metrics to Guide Load-Feedback Mechanisms
-
A Metrics Collection System, Illustrated: Ganglia
- Background
-
A Quick Introduction to Ganglia
- The need to keep collection and aggregation costs low
- The need to automatically discover new nodes and metrics
- The need to match network transport with your metrics collection task
- The need to implicitly prioritize cluster metrics
- The need to aggregate and organize metrics once they're collected
- The need to provide convenient interfaces for creating new metrics and pulling out existing metrics for correlation against other data
- Conclusion
- 4. Continuous Deployment
- 5. Infrastructure As Code
- 6. Monitoring
-
7. How Complex Systems Fail
-
How Complex Systems Fail
-
(Being a Short Treatise on the Nature of Failure; How Failure Is Evaluated; How Failure Is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)
- Complex systems are intrinsically hazardous systems
- Complex systems are heavily and successfully defended against failure
- Catastrophe requires multiple failures–single-point failures are not enough
- Complex systems contain changing mixtures of failures latent within them
- Complex systems run in degraded mode
- Catastrophe is always just around the corner
- Post-accident attribution to a "root cause" is fundamentally wrong
- Hindsight biases post-accident assessments of human performance
- Human operators have dual roles: as producers and as defenders against failure
- All practitioner actions are gambles
- Actions at the sharp end resolve all ambiguity
- Human practitioners are the adaptable element of complex systems
- Human expertise in complex systems is constantly changing
- Change introduces new forms of failure
- Views of "cause" limit the effectiveness of defenses against future events
- Safety is a characteristic of systems and not of their components
- People continuously create safety
- Failure-free operations require experience with failure
-
As It Pertains Specifically to Web Operations
- It will be difficult to tell that the system has failed
- It will be difficult to tell what has failed
- Meaningful response will be delayed
- Communications will be strained and tempers will flare
- Maintenance will be a major source of new failures
- Recovery from backup is itself difficult and potentially dangerous
- Create test procedures that front-line people can use to verify system status
- Manage operations on a daily basis
- Control maintenance
- Assess performance at regular intervals
- Be a (unique) customer
-
(Being a Short Treatise on the Nature of Failure; How Failure Is Evaluated; How Failure Is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)
- Further Reading
-
How Complex Systems Fail
- 8. Community Management and Web Operations
- 9. Dealing with Unexpected Traffic Spikes
- 10. Dev and Ops Collaboration and Cooperation
-
11. How Your Visitors Feel: User-Facing Metrics
- Why Collect User-Facing Metrics?
- What Makes a Site Slow?
- Measuring Delay
- Building an SLA
- Visitor Outcomes: Analytics
- Other Metrics Marketing Cares About
- How User Experience Affects Web Ops
- The Future of Web Monitoring
- Conclusion
- 12. Relational Database Strategy and Tactics for the Web
- 13. How to Make Failure Beautiful: The Art and Science of Postmortems
- 14. Storage
- 15. Nonrelational Databases
-
16. Agile Infrastructure
- Agile Infrastructure
-
So, What's the Problem?
-
Talk Does Not Cook Rice
- The infrastructure is an application
- Version control: The foundation of sanity
- Configuration management and automated deployments
- Monitoring
- Dev-test-prod life cycle, continuous integration, and disaster recovery
- Radiate information
- Reflective process improvement
- Incremental changes and refactoring
- The simplest thing that could work
- Separation of concerns
- Technical debt
- Continuous deployment
- Pairing
- Managing flow
-
Talk Does Not Cook Rice
- Communities of Interest and Practice
- Trading Zones and Apologies
- Conclusion
- 17. Things That Go Bump in the Night (and How to Sleep Through Them)
- A. Contributors
- Index
- About the Authors
- Colophon
- SPECIAL OFFER: Upgrade this ebook with O’Reilly
Product information
- Title: Web Operations
- Author(s):
- Release date: June 2010
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449394158
You might also like
book
Back-end Performance
Performance simply matters. Technology may allow us to "go bigger", but maybe not necessarily be better …
article
Reinventing the Organization for GenAI and LLMs
Previous technology breakthroughs did not upend organizational structure, but generative AI and LLMs will. We now …
book
Professional Search Engine Optimization with PHP: A Developer's Guide to SEO
Maybe you're a great programmer or IT professional, but marketing isn't your thing. Or perhaps you're …
article
Three Ways to Sell Value in B2B Markets
As customers face pressure to reduce costs while maintaining profitability, value-based selling (VBS) has become critical …