DEVOPS FOUNDATION

DevOps and chaos engineering

Chaos Engineering is like intentionally breaking things to make your systems stronger.

The Core Idea: Instead of waiting for unexpected failures to happen in your live systems, you deliberately introduce controlled "chaos" (like simulating server crashes or network outages) to see how your systems react.
Learning from Failure: By observing how your systems behave under stress, you can identify weaknesses, improve your response plans, and build more resilient infrastructure.
Netflix's "Chaos Monkey": Netflix pioneered this with their "Chaos Monkey" tool, which randomly terminated instances in their production environment to force their engineers to build systems that could withstand such failures.
Beyond Simple Failures: Chaos Engineering goes beyond simple server crashes. It involves complex scenarios, like network disruptions, data center outages, and even human intervention tests ("Game Days") to evaluate how your teams respond to real-world incidents.
Key Benefits:

Improved Resilience: Systems become more robust and can better withstand unexpected failures.
Faster Recovery: Reduced downtime and quicker resolution of incidents.
Enhanced Team Preparedness: Improved incident response capabilities and better coordination among teams.
Continuous Learning: Promotes a culture of learning and continuous improvement within the organization.

In simpler terms, DevOps is about breaking down the barriers between software developers and the teams that manage the systems those applications run on.

Instead of working in isolation, these teams (developers, operations, and others) collaborate closely throughout the entire software lifecycle.
This collaboration leads to faster and more frequent software releases, higher quality software, and improved operational efficiency.
Key principles include collaboration, automation, continuous improvement, shared responsibility, and a customer-centric focus.
Common practices include continuous integration, continuous delivery, infrastructure as code, and microservices.

Why is DevOps important?

Faster delivery: Get software to users quicker.
Improved quality: Reduce bugs and improve reliability.
Increased efficiency: Automate tasks and free up teams for more strategic work.
Enhanced collaboration: Break down silos and improve communication.
Improved reliability: Reduce downtime and improve the overall user experience.

DevOps Core Values: CAMS

Culture:
- DevOps is fundamentally about people and culture, not just technology.
- Focus on changing human behavior and breaking down silos between teams (development and operations).
- Build a culture of collaboration, communication, and shared understanding.
Automation:
- Automation is crucial for efficiency and speed, but it's not the sole focus of DevOps.
- Automate tasks to reduce manual work, improve consistency, and accelerate delivery.
- Create a fabric of automation that supports your systems and applications.
Measurement:
- Track key metrics to understand performance and the impact of changes.
- Focus on metrics that measure outcomes, such as deployment frequency, lead time for changes, and customer satisfaction.
- Avoid focusing solely on vanity metrics or incentivizing the wrong behaviors.
Sharing:
- Collaboration and knowledge sharing are essential for continuous improvement.
- Share information through documentation, pair programming, peer reviews, and open communication.
- Foster a culture of transparency and learning within and across teams.

Key Takeaways:

DevOps is a human-centric approach that requires a cultural shift within an organization.
Automation is a critical enabler of DevOps, but it should be guided by cultural and business goals.
Measurement provides valuable insights for continuous improvement, but it's important to choose the right metrics and avoid unintended consequences.
Sharing knowledge and best practices is essential for team growth and organizational success.

DevOps Guiding Principles: The Three Ways

1. Systems Thinking and the Principles of Flow:

Focus on the Whole: This principle emphasizes understanding the entire system, not just individual parts. Optimizing one component without considering its impact on the whole can create unintended consequences (like a bottleneck shifting elsewhere).
Value Stream: The focus is on the flow of value from idea to customer delivery. Identifying and removing bottlenecks within this flow is crucial for efficiency.
Collaboration: Breaking down silos between teams (development, operations, etc.) is essential for smooth value flow.

2. Amplifying Feedback Loops:

Fast Feedback: Quick feedback allows for rapid identification and correction of issues.
Reduced Waste: Early detection of problems minimizes wasted effort and time.
Continuous Improvement: Feedback loops drive continuous improvement by highlighting areas for optimization.

3. Culture of Continuous Experimentation and Learning:

Learning from Mistakes: Encourage experimentation and learning from both successes and failures.
"Fail Fast" Mentality: Embrace experimentation and rapid iteration, even if it means occasional failures.
Continuous Learning: Foster a culture of ongoing skill development and knowledge sharing within the team.

In essence:

Systems Thinking: See the big picture and optimize the entire value stream.
Amplifying Feedback Loops: Get quick feedback to improve and minimize waste.
Continuous Experimentation: Learn from doing, embrace change, and foster a culture of learning.

DevOps practice playbook:

No Single "Playbook" like Agile: While Agile has structured methods (Scrum, Extreme Programming), DevOps lacks a strict, one-size-fits-all approach.
Five Core Pillars: The course focuses on five key areas crucial for a successful DevOps implementation:
1. Culture: Fostering a collaborative, learning environment where teams can experiment and grow.
2. Process: Adopting Agile and Lean principles like small batches, feedback loops, and limited work in progress.
3. Infrastructure as Code: Treating infrastructure like software, automating its creation and management.
4. Continuous Delivery: Automating software releases through frequent, small changes.
5. Site Reliability Engineering: Building and operating reliable systems with a strong focus on observability and automation.
Interdependence of Pillars: These pillars are interconnected. Success in one area depends on progress in others. For example, Continuous Delivery alone won't improve business performance without strong Site Reliability Engineering.
Gradual Improvement: The key is to gradually improve all five pillars simultaneously, avoiding an unbalanced approach.
Self-Assessment: The speaker encourages you to assess your organization's current standing on each pillar as a starting point.

Key Takeaways:

DevOps is a journey, not a destination.
Focus on building a strong foundation by addressing all five pillars.
Continuous improvement is essential.

This section of the DevOps course emphasizes that "People over Process over Tools" should be the guiding principle when selecting DevOps tools.

Focus on People First:
- Identify who will use the tools and ensure they have the necessary skills and support.
- Prioritize tools that facilitate collaboration among the entire team.
Define Processes:
- Determine the specific workflow and desired outcomes before choosing tools.
- Select tools that align with and support your defined processes.
Choose the Right Tools:

KISS Principle: Keep it Simple, Stupid. Avoid unnecessary complexity by selecting only the essential tools.
Tool Integration: Ensure that tools work well together and can be easily integrated into your existing workflows.
Dynamic Adaptation: Choose tools that can adapt to changes in your infrastructure and environment.

Why We Need a DevOps Culture:

The Problem:

Siloed Teams: Traditionally, IT departments are often fragmented. Developers, testers, operations, security, and database teams work in isolation, throwing work over "walls" to the next team without proper communication or collaboration.
Slow Delivery: This "over-the-wall" approach leads to slow and inefficient processes. For example, getting a new server can take weeks due to bureaucratic procedures, even if the actual server provisioning takes minutes.
Business Frustration: Businesses are increasingly tech-savvy and impatient with these delays. They seek faster delivery and are frustrated by inefficient IT processes.

The Solution: DevOps Culture

Breaking Down Silos: DevOps aims to break down these silos by fostering collaboration and communication between all teams involved in software development and delivery.
Continuous Improvement: DevOps emphasizes continuous improvement through feedback loops and a focus on learning from mistakes.
Three Key Areas: To create a DevOps culture, organizations must focus on:
- Communication: Open and transparent communication between all stakeholders.
- Collaboration: Teamwork and shared responsibility across all teams.
- Continuous Learning: A culture of learning and improvement, where teams constantly seek to optimize their processes.

In Essence: DevOps is about creating a more agile and efficient IT organization that can deliver value to the business faster and more reliably. It requires a shift in mindset and a commitment to breaking down traditional barriers between teams.

Key Takeaways:

Silos are detrimental: They hinder collaboration, slow down delivery, and frustrate both IT teams and the business.
DevOps promotes collaboration: It encourages communication and shared responsibility across all teams.
Continuous improvement is crucial: DevOps is an iterative process that emphasizes continuous learning and optimization.

Communication and trust are the bedrock of successful DevOps. Without them, even the best technical practices will fail.

How to Improve Communication:

Establish Clear Channels: Define specific channels for different types of communication (e.g., project updates, incident reports, customer information).
Develop Communication Processes: Create clear guidelines on who needs to communicate what to whom and when, especially during critical events.
Foster a Culture of Transparency: Encourage open and honest communication, even about potential problems.
Build Trust:
- Assume Good Faith: People are generally trying to do their best, even if their actions seem counterproductive.
- Share Context: Provide ample information about your team's work, challenges, and goals.
- Be Curious and Respectful: Understand other teams' perspectives and work towards shared objectives.

Overcoming Communication Barriers:

Invest in Communication Skills: Read books, attend workshops, and practice effective communication techniques.
Address Misunderstandings: Actively seek to understand the root cause of conflicts and work together to find solutions.
Minimize Unnecessary Restrictions: Avoid excessive restrictions on information access unless absolutely necessary.

Benefits of Effective Communication:

Improved Collaboration: Better teamwork, reduced conflicts, and increased efficiency.
Increased Trust: Stronger relationships, improved morale, and a more positive work environment.
Enhanced Innovation: A culture of trust allows for greater risk-taking and experimentation.
Better Decision-Making: Access to accurate and timely information leads to more informed decisions.

DevOps emphasizes continuous learning and experimentation. This means constantly improving your skills and taking calculated risks to learn from the results.

Kaizen: This Japanese concept, meaning "continuous improvement," is central to DevOps. It's about making small, consistent changes to improve processes over time. This is similar to the Lean manufacturing principles used by Toyota.
Gemba: This Japanese word translates to "the real place." In DevOps, it means going directly to the source of the problem or the place where value is created instead of relying on reports or assumptions. This could involve observing the actual work being done, interacting with users, or examining the code directly.
Plan-Do-Check-Act (PDCA) Cycle: This simple yet powerful cycle is a core element of kaizen.
1. Plan: Define what you want to achieve and how you will do it.
2. Do: Execute the plan.
3. Check: Measure the results and analyze the outcomes.
4. Act: Make adjustments based on the results and use these findings to inform the next cycle.
Building People: The PDCA cycle not only drives improvements but also helps individuals develop critical thinking and problem-solving skills.

DevOps builds upon Agile, but they are not the same.

Agile: Focuses on how software is developed – breaking down projects into smaller iterations, collaborating closely within development teams, and getting frequent feedback.
DevOps: Extends Agile by including operations teams (those who manage the servers, networks, etc.). This ensures that the software can be smoothly deployed, monitored, and maintained in a real-world environment.

Key differences:

Scope: Agile primarily focuses on the development process, while DevOps considers the entire software lifecycle, from development to operations.
Collaboration: Agile emphasizes collaboration within development teams, while DevOps extends this to include operations teams.
Focus: Agile focuses on delivering working software, while DevOps also emphasizes the importance of building and maintaining reliable systems.

Why is this important?

Faster Delivery: By breaking down work into smaller iterations and collaborating closely with operations, teams can release software more frequently and respond quickly to market changes.
Improved Quality: Continuous feedback and collaboration help to identify and fix issues early in the development process, leading to higher quality software.
Increased Efficiency: Automation and streamlined processes improve the overall efficiency of the software delivery pipeline.

In essence:

DevOps is like an upgraded version of Agile that recognizes the crucial role of operations in the success of any software project.

Lean, the second building block of DevOps.

Lean's Core: It's about systematically eliminating waste in any process.
Origins: Born in manufacturing (Toyota), it revolutionized industries, then moved to product development (Lean Startup), and finally to software (Lean Software Development).
Key Concepts:
- Focus on Value: Identify activities that truly add value to the end product/service.
- Eliminate Waste (Muda):
  - Type 1: Necessary but non-value-adding.
  - Type 2: Completely unnecessary.
- Reduce Irregularity (Mura): Minimize delays and wait times.
- Prevent Overburden (Muri): Avoid fatigue and breakdowns.
Software Wastes: Bugs, delays, unnecessary features, and even some management activities.
Lean Techniques:
- Kaizen: Continuous improvement.
- Value Stream Mapping: Analyze the entire process to identify waste.
- Visual Management: Use tools like Kanban boards to track progress.
- Work-in-Progress Limits: Prevent starting too many tasks at once.
DevOps and Lean:

DevOps principles were modified to include Lean (CALMS).
Lean is crucial for successful DevOps implementations.
Google research confirms the link between Lean practices and better software delivery.

DevOps Process Building Block: Visible Ops Change Control

The Problem:
- Frequent system outages are often caused by poorly managed changes (80% of the time).
- Traditional IT Service Management (ITSM) approaches, like ITIL, often lead to overly complex and slow change control processes. This involves extensive documentation, approval by committees (Change Advisory Board - CAB), and excessive delays.
- These slow processes hinder agility and empower those least qualified (upper management) to make critical change decisions.
The Visible Ops Solution:
- Focuses on lightweight, fast, and scalable change control.
- Emphasizes:
  - Peer Reviews: Most changes are reviewed by another technologist within the team.
    - High-risk changes are escalated for broader review.
  - Small Changes: Smaller changes are easier to review, test, and roll back if necessary.
  - Early Testing: Continuous integration with automated testing provides immediate feedback on changes.
  - Automated Safeguards: Security and other safeguards are integrated into the development process.

Infrastructure as Code (IaC) means using code to define and manage your computer systems instead of doing it manually.

Imagine building a house. Traditionally, you'd hire workers, buy materials, and have them build it brick by brick. IaC is like having a blueprint (the code) and a machine that automatically builds the house based on that blueprint.

Here's a breakdown:

Traditional Approach: Manually setting up servers, configuring networks, and installing software. This is slow, error-prone, and difficult to reproduce consistently.
IaC Approach: Writing code (like scripts) to automate these tasks. This makes infrastructure more consistent, reliable, and easier to manage.

Key Benefits of IaC:

Speed and Efficiency: Automating tasks saves time and reduces manual effort.
Consistency: IaC ensures that all systems are configured identically, minimizing inconsistencies and reducing errors.
Reproducibility: Easily recreate and rebuild infrastructure in different environments (e.g., development, testing, production).
Version Control: Track changes to your infrastructure code, allowing you to easily revert to previous versions if needed.
Scalability: Easily scale your infrastructure up or down as needed by modifying your code.

Think of it like this: Instead of treating servers as unique individuals ("pets"), you treat them as a group ("cattle"). You manage them collectively using code, making it easier to scale, maintain, and update your infrastructure.

Key takeaway: IaC is a fundamental DevOps practice that brings the power of software development to infrastructure management, leading to faster deployments, improved reliability, and greater efficiency.

DevOps Applications of Infrastructure as Code

This excerpt explains how DevOps principles are applied to managing and configuring IT infrastructure. Here's a breakdown:

Key Concepts

Configuration Management: The process of ensuring your systems and software are in the desired state.
Infrastructure as Code (IaC): Automating configuration management through code, making it repeatable and reliable.

Three Core Components of IaC in DevOps:

Provisioning: Preparing servers (virtual or physical) for use, including installing operating systems and configuring basic settings.
Deployment: Installing and upgrading application software on provisioned systems.
Orchestration: Coordinating operations across multiple systems, such as automated failover and rolling deployments.

Two Approaches to Configuration Management:

Imperative (Procedural): Defines a series of commands to achieve a specific state (e.g., a script).
Declarative (Functional): Defines the desired state, and the tool figures out how to achieve it.

Important Considerations:

Idempotency: The ability to repeatedly execute a process without changing the desired state.
Self-Service: Allowing users to initiate configuration management processes independently.
Drift: The divergence of the actual system state from the desired state, often caused by manual changes or unexpected behavior.

In essence: DevOps leverages IaC to automate and streamline infrastructure management, leading to faster deployments, improved reliability, and increased efficiency.

Key Takeaways:

IaC is a cornerstone of DevOps practices.
Understanding provisioning, deployment, and orchestration is crucial.
Choosing between imperative and declarative approaches depends on your specific needs and preferences.
Idempotency, self-service, and drift detection are essential for successful IaC implementation.

Evolution of configuration management (CM) in DevOps.

Early Days (1990s):
- Separate Dev and Ops teams.
- Tools like Ghost for simple cloning and large suites like Tivoli for enterprises.
- Limited collaboration and sharing between teams.
Rise of Infrastructure as Code (2000s):
- Tools like CFEngine, Puppet, and Chef emerged.
- Focused on managing system configurations declaratively (defining the desired state).
- Limited application deployment capabilities.
- "Golden Image or Foil Ball" concept introduced by Luke Kanies: Minimal base images + declarative CM for flexibility.
Cloud Computing and Growing Challenges:
- Increased demand for automated server provisioning due to dynamic cloud environments.
- Orchestration limitations: Traditional tools like Puppet and Chef lacked strong orchestration capabilities.
- Focus on system administration, not application deployment.
Emergence of New Tools (2010s):
- Ansible and SaltStack: Introduced push mechanisms for orchestrated deployments.
- Emphasis on workflows and automation.
- Rise of self-service orchestration and runbook tools (e.g., Rundeck).
- Improved infrastructure provisioning capabilities.

Key Takeaways:

Configuration management has evolved significantly, moving from manual processes to automated, declarative approaches.
Collaboration and sharing between teams are crucial for successful CM.
Modern CM tools address not only system configuration but also application deployment and orchestration.
The focus has shifted towards more flexible and dynamic approaches to manage complex IT environments.

Six Practices for Continuous Integration:

This passage outlines six key practices for successful continuous integration (CI):

Fast Builds: Builds should be quick – ideally under five minutes. Long build times discourage frequent builds, leading to delays and increased work in progress.
Small Commits: Commit small, focused changes to the codebase. This makes it easier to identify and isolate issues, and improves code review efficiency.
Address Broken Builds Immediately: Broken builds obstruct the entire team. Establish a culture where broken builds are addressed promptly, potentially even halting other work until the issue is resolved.
Trunk-Based Development: Favor a trunk-based development approach where developers work directly on the main branch (trunk) and integrate changes frequently. This minimizes conflicts and ensures the trunk always reflects the latest code. For larger features, use feature flags to control their visibility rather than long-lived branches.
Reliable Tests: Ensure automated tests are reliable and consistent. Flaky tests erode trust in the CI system and hinder effective debugging.
Clear Build Outputs: Each build should produce a clear status (pass/fail), a detailed log of test results, and an installable artifact. This provides transparency, aids in troubleshooting, and ensures the integrity of releases.

Continuous Delivery: Five Key Practices

Continuous Delivery (CD) is about automating the process of releasing software, making it faster and more reliable. Think of it like ordering food online – you tap a few buttons, and your order is processed, prepared, and delivered. In CD, you automate the steps to build, test, and deploy your software.

Here are five essential practices for successful CD:

Immutable Artifacts:
- Build your software once and use the same "package" (like a ZIP file or a Docker container) for all environments (testing, staging, production).
- This ensures consistency and makes debugging easier.
- Treat these packages as unchangeable to maintain trust and traceability.
Identical Pre-Production Environment:
- Create a testing environment that closely mirrors your production environment, including all the same software, hardware, and configurations.
- This helps identify and fix issues before they reach your real users.
Automated Testing and Feedback:
- Thoroughly test your software at every stage (build, deployment, pre-production).
- Automate tests as much as possible to speed up the process and reduce human error.
- If any test fails, stop the pipeline and fix the issue before proceeding.
Immutable Deployments:
- Ensure that every deployment produces the exact same result, regardless of how many times you run it.
- Use techniques like Docker containers or configuration management tools to achieve this consistency.
Focus on Overall Flow:
- Prioritize the smooth and efficient flow of the entire software delivery process, even if it means temporarily slowing down individual developers.
- Encourage collaboration and teamwork to quickly resolve any issues that arise.

Key Takeaways:

Continuous Delivery is about automating and streamlining the software release process.
Consistency, automation, and a focus on the overall flow are crucial for success.
By implementing these practices, you can deliver software faster, more reliably, and with higher quality.

Crucial role of Quality Assurance (QA) in DevOps, particularly in achieving Continuous Integration (CI) and Continuous Delivery (CD).

The Catch: While CI/CD promises faster deployments, reduced bugs, and improved collaboration, it's impossible without a robust testing strategy.
Automation is Key: Manual testing is slow and unreliable. Automating tests at every stage of the pipeline is essential for efficient and effective CI/CD.
QA's Role: QA professionals shift from manual testers to test designers and developers, working closely with developers to write and integrate tests.
Types of Testing:
- Unit Tests: Test individual functions or components of code.
- Code Hygiene Tests: Check for code quality and adherence to best practices using linters and formatters.
- Integration Tests: Verify how different parts of the application work together.
- Acceptance/End-to-End Tests: Simulate user interactions to test the entire application.
Test-Driven Development (TDD) & Behavior-Driven Development (BDD): These approaches encourage writing tests before writing the actual code, ensuring that the code meets specific requirements.
Handling Slow Tests: Run slow tests in parallel, schedule them for off-peak hours, or consider the risk-reward of waiting for slow tests on every release.
Other Important Tests: Include infrastructure, performance, and security testing in your overall testing strategy.

Key Takeaways:

QA is fundamental to successful CI/CD.
Automation is crucial for efficient testing.
QA professionals play a vital role in designing and developing tests.
A comprehensive testing strategy includes various types of tests.
TDD and BDD promote a test-first approach to development.
Effective strategies are needed to handle slow tests.

Continuous Deployment, the next logical step after Continuous Delivery.

Key Points:

What it is: Continuous Deployment means automatically releasing code to production as soon as it passes all tests.
Comfort Level: Some organizations may not be comfortable with fully automated deployments and may require manual approvals or staged rollouts.
Importance of CI/CD Foundation: If you have a strong Continuous Integration and Continuous Delivery pipeline, you're better positioned to safely implement Continuous Deployment.
Automation in the Pipeline: Approvals and even manual steps can be integrated into your automated pipeline.
Future Flags: These allow you to deploy new code to production but control its visibility to users, enabling gradual rollouts and A/B testing.
The "If you stay ready" Principle: Being prepared for deployment minimizes delays and allows for faster responses.
Release Stage: This stage involves releasing the artifact, notifying stakeholders, and finally deploying to production.
Production Release Challenges: Production releases often require significant engineering work to automate safely, especially for running services with live users and data.
Importance of Consistency: Release procedures in production must be mirrored in the test environment to ensure consistent testing.
Opinionated Systems: For successful Continuous Deployment, establish clear and well-defined procedures that are easy to follow.
Production Release Patterns:
- Rolling Deployment: Upgrading systems one by one.
- Blue-Green Deployment: Shifting traffic from the old system (blue) to the new system (green).
- Canary Deployment: Deploying to a single system and monitoring for issues before wider rollout.
- A/B Deployment: Releasing features to subsets of users for testing and gradual rollout.
Collaboration: Close collaboration between development, operations, and infrastructure teams is crucial for successful deployment.
Real-World Example: Signal Sciences' "Deployer" tool, which enabled rapid and reliable deployments with a focus on automation and user experience.

CI toolchain for your better understanding:

Imagine an onion with multiple layers. When building a CI toolchain, instead of thinking from left to right like a pipeline, consider it from the outside in, like peeling an onion.

The outermost layer is Deployment:

This refers to how your system will be delivered and used.
Consider factors like containers, system images, or installers.
Decisions here, like A/B testing (using feature flagging tools) or rolling deployments (using orchestration tools), affect other stages of the pipeline.

Next layer is Artifact Repository:

This stores the built software (artifact) in various formats.
General options include Artifactory or Nexus, while specific options include cloud provider repositories or language-specific ones like bit.dev.
A simple solution is to zip the artifact and upload it to Amazon S3.

Inner layer is Build and Test:

Build system performs the build and initiates later stages.
Popular options include Jenkins (open-source), CloudBees and CircleCI (SaaS-based), or GitHub Actions.
Testing ensures everything works as expected. There are different categories of testing tools:
- Unit testing (code-specific tests, like Golang's go test)
- Code hygiene (enforces coding standards, like ESLint for JavaScript)
- Integration testing (ensures the artifact works with the system, like Pytest for Python)
- Acceptance testing (often UI-based, like Selenium)
Other testing areas to consider include infrastructure, performance, and security.

The core layer is Version Control:

This is where code changes are tracked and managed.
Most organizations use Git (like GitHub, GitLab, Bitbucket).

By tracing a code change through this layered structure, you can measure your overall cycle time (how fast code moves from development to production). This is a crucial metric to monitor and improve.

Site Reliability Engineering (SRE) is the application of software engineering principles to ensure the reliable and consistent operation of IT systems.

Here's a breakdown:

Reliability in IT: This means the system consistently performs its intended function, including factors like availability (uptime), performance, and security.
SRE as Operations: SRE focuses on the operational aspects of DevOps, like monitoring, managing, and fixing issues in production environments.
Engineering for Reliability: SRE emphasizes building reliability into systems from the ground up, rather than trying to fix problems after they occur.
Key SRE Practices:
- Building for Reliability: Designing systems to be resilient and maintainable.
- Operational Feedback: Observing system behavior, responding to incidents, and using that information to improve the system.

Benefits of SRE:

Reduced Change Failure Rate: Fewer production issues caused by software changes.
Faster Service Restoration: Quicker resolution of production problems.
Improved Uptime and Performance: Meeting service level objectives (SLOs) for availability and performance.

Key Takeaways:

SRE is a crucial part of DevOps, focusing on the operational aspects of software delivery.
Building reliability into systems from the beginning is essential.
Continuous monitoring and feedback from production are critical for improving system reliability.

By implementing SRE practices, organizations can improve their software delivery processes, enhance system reliability, and ultimately deliver better services to their customers.

The Secret to Reliable Software: Design It Right From the Start

When people talk about keeping software running smoothly (operations), they often think about fixing problems after they happen (production). But actually, the key to reliability lies in how you design the software in the first place.

As a developer, you play a big role in building dependable applications. Here are three important resources to help you design software that works well even when things go wrong:

Design Patterns for Reliability: Software engineers have created pre-built solutions for common problems, like the ones in the famous "Gang of Four Design Patterns" book. There's another book specifically focused on reliability called "Release It!" by Michael Nygard. This book teaches you how to design applications to avoid failures and how to handle them when they do occur.
Integration Points: The Weakest Link: According to the book "Release It!", the biggest cause of software problems are the connections between different parts of the system (integration points). If one part fails, it can bring down everything else (cascading failure). This is especially true in modern software with many connections (microservices).
Circuit Breaker: Stopping the Chain Reaction: To prevent cascading failures, a technique called a Circuit Breaker can be used. It monitors integration points and if there are too many problems, it temporarily stops using that connection to avoid overloading the failing system. This gives the system time to recover. Libraries like Resilience4j can simplify using Circuit Breakers in your code.
The Twelve-Factor App: A Manifesto for Reliable Services: This set of guidelines (at 12factor.net) outlines how to design software that's easy to deploy and maintain. For example, one rule states that configurations should be stored separately from the code and managed through environment variables. This makes your software more flexible and less prone to errors.
Learning from the Experts: Martin Fowler is a well-respected software engineer who writes about various software design concepts. He's a great resource to learn more about building reliable software from a practical perspective.

Remember: Reliable software starts with good design. Take some time to explore these resources and see how they can help you build stronger, more dependable applications.

Key Takeaways from "Building for Reliability: Practice"

Systems Fail:
- All systems, even complex ones, are prone to failures.
- "Codes often written with the assumption that failure of the underlying systems is if not impossible, at least very unusual."
- "Individual components are failing all the time."
"How Complex Systems Fail":
- This framework emphasizes that changes introduce new failure modes.
- Complex systems always operate in a degraded state.
Focus on Resilience, Not Perfection:
- "Pushing more and more money at highly available systems is a losing game."
- Resilience: "The intrinsic ability of a system to maintain or regain a dynamically stable state."
Key Resilience Techniques:
- Redundancy: Running multiple copies of components.
- Load Balancing: Distributing traffic across healthy systems.
- Auto Scaling: Dynamically adjusting resources based on demand.
- Failover and Recovery: Automated mechanisms to handle failures.
Sociotechnical Systems:
- People are integral to system behavior.
- "All systems are sociotechnical."
SRE Role:
- Tool Development: "SREs should be spending at least half of their time developing tools."
- Runbook Creation: Documenting procedures for safe system intervention.
- Developer Involvement: "You write it, you run it."
Importance of Developer Involvement:
- Developers are responsible for their code's behavior in production.
- They must learn to use debugging and performance monitoring tools in real-time environments.

Rephrased for Better Understanding:

This excerpt emphasizes that building reliable systems requires a shift in perspective. Instead of striving for absolute perfection (which is unrealistic and costly), we should focus on building systems that can withstand and recover from inevitable failures.

Accept the Reality of Failures: Acknowledge that even the most well-designed systems will experience issues.
Embrace Resilience: Design systems with mechanisms to handle failures gracefully (redundancy, load balancing, auto-scaling).
Recognize the Human Factor: Acknowledge that people play a crucial role in both causing and resolving system issues.
Promote Collaboration: Foster close collaboration between developers and operations teams.
Continuous Improvement: Continuously improve system reliability through monitoring, tool development, and ongoing learning.

Operational Feedback: Observability

Observability: The ability to understand the internal state of a system by examining its external outputs (metrics, logs). Essentially, "can we really tell what's going on?" by looking at the data.
Five Areas of Observability:
- Synthetic Checks (Health Checks): Proactively testing system functionality by simulating user interactions. Simple "is it working?" checks.
- System & Application Metrics: Monitoring key performance indicators (CPU, memory) and collecting custom application metrics (function call times, login counts).
- End-User Performance:
  - Application Performance Monitoring (APM): Monitoring performance at the code level (function execution times, API call durations).
  - Real User Monitoring (RUM): Capturing user interactions and performance from the user's perspective.
- System & Application Logs: Detailed text-based records of events, providing valuable insights into system behavior.
- Security Monitoring: Detecting and responding to security threats by analyzing logs, metrics, and looking for suspicious activity.
Benefits of Observability:
- Improved Production Support: Faster identification and resolution of issues.
- Enhanced Development: Gaining insights into real-world usage patterns to improve the product.
Key Actions:

Instrument your systems: Implement monitoring tools and collect relevant data.
Collaborate with developers: Encourage them to contribute to observability by improving custom metrics and logging.
Analyze data: Use the collected data to make informed decisions about system performance and product improvements.

Handle incidents in a DevOps environment.

Key Idea 1: Incidents are Inevitable: Even with the best design, development, and testing, systems will inevitably fail. This is a normal part of operating complex systems.
Key Idea 2: Incident Response is a Skill: Handling incidents requires specialized skills:
- Troubleshooting: Diagnosing and fixing the problem.
- Automation: Using tools to speed up the process and ensure safe actions.
- Communication: Coordinating with teams, stakeholders, and users.
Key Idea 3: Incident Management Process: A well-defined process is crucial for effective incident response, similar to how fire departments use the Incident Command System (ICS). This process guides how incidents are detected, reported, and handled.
Key Idea 4: Learning from Failures: Postmortems (incident retrospectives) are not about finding blame, but about identifying system weaknesses and improving processes.
- No Single Root Cause: Incidents often stem from multiple factors, such as testing deficiencies, monitoring gaps, and process flaws.
- Avoid Blame: Focus on understanding why decisions were made, even if they led to the incident.
- Transparency: Communicate openly with stakeholders during and after incidents to build trust.

Important Lines:

"Things are still going to break." (Acknowledges the inevitability of incidents)
"Incident response is an activity that needs to be practiced." (Highlights the importance of training and experience)
"There is no single root cause." (Emphasizes the systemic nature of incidents)
"You have to put aside common cognitive biases and examine the system from the practitioner's point of view." (Focus on understanding the system, not just individual actions)
"Building trust via transparency builds goodwill." (Emphasizes the importance of open communication)

DevOps SRE toolchain

Two Parts to an SRE Toolchain:

Building for Reliability (Difficult to generalize):
- Focuses on code and libraries for building resilient applications.
- Requires collaboration between Dev and Ops during design.
- Examples: Java's Resilience4j library.
Operational Feedback (More Standardized):
- Involves tools for monitoring and responding to incidents.
- Wide range of options available:
  - SaaS offerings (Datadog, Honeycomb, SumoLogic).
  - Open-source tools (Nagios, Grafana, Prometheus).
  - Commercial software (Solarwinds, Splunk).

Observability with a Lean Approach:

Don't try to over-design your monitoring upfront.
Use a "build-measure-learn" cycle to identify your specific needs.
Start with a basic monitoring stack and iterate based on insights.
Focus on collecting valuable information to troubleshoot issues effectively.

Importance of Collaboration:

Monitoring data should be shared with developers, product managers, and business stakeholders.
Custom visualizations can improve understanding and communication across teams.
Everyone benefits from insights from production applications.

Incident Response Tools:

Tools exist to automate workflows and manage on-call schedules (Pagerduty, VictorOps, OpsGenie).
Runbook automation tools help with routine and emergency tasks (Rundeck, Ansible Tower, StackStorm).

Transparency with Status Pages:

Communicate outages to users effectively using services like Atlassian Statuspage or Status.io.

DevSecOps: Making your systems more secure the DevOps way

The Problem:
- Historical Friction: Traditional security teams often operate in silos, leading to conflicts with development and operations teams. This stems from different priorities, communication gaps, and a lack of collaboration.
- Resource Imbalance: Severe understaffing of security teams creates a significant burden and hinders effective security implementation.
- "Throwing Over the Wall" Approach: Security teams often receive applications late in the development cycle, leading to reactive measures and delays.
DevSecOps as a Solution:
- Bridging the Gap: DevSecOps aims to integrate security seamlessly into the DevOps process by fostering collaboration and breaking down silos.
- Core Principles:
  - Culture: Emphasizes collaboration and shared responsibility. Security must not hinder development; it should be an enabler.
  - Automation: Automates security tasks throughout the development lifecycle, including early integration of security tools.
  - Measurement: Establishes clear, measurable security goals and tracks progress.
  - Sharing: Promotes knowledge sharing and collaboration between security, development, and operations teams.
Shifting Left:
- Early Security Integration: Involves incorporating security checks and tools early in the development process, such as within the development environment (IDE) and continuous integration (CI) systems.
- Benefits: Early detection and resolution of security issues, reducing the cost and time of remediation.
- Potential Pitfalls: Can become overly burdensome for developers if not carefully implemented and aligned with the overall development process.
Building a Strong Security Culture:
- Security Champions: Empowering individuals within development and operations teams to act as security advocates.
- Training and Education: Providing training and resources to improve security awareness and knowledge within the organization.
- Collaboration: Fostering open communication and collaboration between all teams involved in the software development lifecycle.
Key Message: DevSecOps is not just about adding security to DevOps; it's about fundamentally changing how security is approached and integrated into the entire software development process.

Chaos Engineering

Imagine you're building a bridge. You wouldn't just build it and hope it holds. You'd test it by applying stress, like heavy loads or strong winds. Chaos Engineering is like that for software systems.

Instead of waiting for things to break in production (which can be disastrous), you intentionally introduce controlled failures. This helps you:

Find weaknesses: Discover hidden problems you didn't know existed.
Improve response: Train your team on how to react to emergencies effectively.
Build a more robust system: Make your software more resistant to unexpected problems.

Think of it as "controlled destruction" for learning and improvement.

Important Points to Remember:

Chaos Engineering is NOT about recklessly breaking things. It's about carefully designed experiments with clear objectives.
Human element is crucial. Test how your team responds to incidents, not just the technology itself.
Continuous learning is key. Analyze the results of your experiments to identify areas for improvement.

By embracing Chaos Engineering, you can build more resilient and reliable systems, improve your team's response capabilities, and ultimately deliver a better user experience.

MLOps: Leveraging DevOps to run ML systems

MLOps bridges the gap between Machine Learning (ML) and DevOps. It's not just about deploying software; it involves managing data, models, and the intricate interplay between data scientists, developers, and operations teams.
Data Scientists are Key: Unlike traditional DevOps where developers are the primary workload generators, data scientists play a crucial role in MLOps. Their work is highly dependent on specialized hardware (HPC clusters) and requires close collaboration with operations teams.
Beyond Software: MLOps extends beyond traditional software deployment. It encompasses:
- Data Management: Versioning, managing, and moving massive datasets.
- Model Management: Versioning, deploying, and managing ML models (algorithms).
- HPC Cluster Management: Managing high-performance computing resources for intensive model training.
- Inference App Management: Building and managing applications that allow users to interact with the trained models.
Unique Challenges:
- Data-driven Feedback Loops: AI models constantly evolve, requiring continuous monitoring and adjustments based on user input and changing data.
- Model Drift: Detecting and mitigating changes in model performance over time due to evolving data distributions.
- Specialized Tools and Technologies: MLOps requires specialized tools and technologies beyond traditional DevOps, such as those for managing massive datasets and HPC clusters.
Core DevOps Principles Apply: While MLOps presents unique challenges, core DevOps principles like automation, measurement, and continuous improvement remain fundamental to its success.

In Summary:

MLOps is a critical discipline that combines the best of DevOps with specialized considerations for managing data, models, and the unique demands of machine learning workloads. It requires close collaboration between data scientists, developers, and operations teams to successfully deliver and maintain AI-powered solutions.

Search This Blog

Agile | Coaching | Product Management