sre interview questions
sre interview questions

Top 100 SRE Interview Questions & Answers

Master Site Reliability Engineering with this ultimate SRE Interview questions guide to acing your SRE interview!


SRE Interview Questions

1. Core SRE Concepts

  1. What is the primary goal of SRE?
    Answer: SRE ensures scalable, reliable systems by applying software engineering principles to operations. Key goals include automating repetitive tasks (toil reduction), defining SLIs/SLOs, and balancing innovation with reliability using error budgets.
  2. Explain the concept of an “error budget.”
    Answer: An error budget quantifies the maximum acceptable downtime for a service. For example, a 99.9% uptime SLO allows 8.76 hours/year downtime. Teams use this budget to prioritize feature releases or reliability improvements.
  3. What’s the difference between SLI, SLO, and SLA?
    Answer:
  • SLI (Service Level Indicator): A measurable metric (e.g., request latency, error rate).
  • SLO (Service Level Objective): The target value for an SLI (e.g., 99.95% uptime).
  • SLA (Service Level Agreement): A contractual commitment with penalties if SLOs are violated.

4. How do SREs reduce “toil”?
Answer: Automating repetitive tasks (e.g., deployments, backups) using tools like Kubernetes or Ansible. For example, replacing manual server scaling with auto-scaling groups.

5. What is the role of blameless post-mortems?
Answer: To identify systemic issues (not individual blame) and implement preventive measures.

Example: If a deployment fails due to missing tests, the fix might involve improving CI/CD pipelines.

6. What are the “Four Golden Signals” of monitoring?
Answer:

  • Latency: Time to serve requests.
  • Traffic: Request volume (e.g., queries per second).
  • Errors: Rate of failed requests.
  • Saturation: Resource utilization (e.g., CPU, memory).

7. How do SREs handle capacity planning?
Answer: Analyzing historical data to predict resource needs (e.g., adding nodes to a Kubernetes cluster during peak traffic). Tools like Prometheus forecast usage trends.

8. Explain “Chaos Engineering.”
Answer: Proactively testing system resilience by simulating failures (e.g., shutting down nodes with Chaos Monkey).

9. SRE vs. DevOps: Key differences?
Answer:

  • DevOps focuses on cultural collaboration between dev and ops teams.
  • SRE applies engineering rigor to operations (e.g., SLOs, error budgets).

10. What is “MTTR” and “MTBF”?
Answer:

  • MTTR (Mean Time to Recover): Average time to resolve incidents.
  • MTBF (Mean Time Between Failures): Average time between system failures.

12. What is the role of SRE in bridging Development and Operations?
Answer: SRE merges software engineering with infrastructure management to ensure systems are reliable, scalable, and cost-effective. Key tasks include automating deployments and defining SLOs.

13. What are the key responsibilities of an SRE?
Answer:

  • Define and monitor SLIs/SLOs.
  • Automate toil (e.g., CI/CD pipelines).
  • Conduct blameless post-mortems.
  • Optimize cloud resource usage.

2. Linux/Unix

  1. How do you troubleshoot high CPU usage?
    Answer:
    1. Use top or htop to identify resource-heavy processes.
    2. Profile with strace or perf. Example: A Java app with high CPU might need garbage collection tuning.
  2. What is the difference between kill -9 and kill -15?
    Answer:
    • kill -15 (SIGTERM): Requests graceful shutdown.
    • kill -9 (SIGKILL): Forces immediate termination.
  3. How to check open ports on a Linux server?
    Answer: Use netstat -tuln or ss -tuln. Example: ss -tuln | grep 443 checks HTTPS usage.
  4. Explain inode in Linux.
    Answer: Stores file metadata (permissions, timestamps). Use df -i to check inode usage.
  5. What is a zombie process?
    Answer: A terminated process lingering in the process table. Fix by reaping its exit status via the parent process.
  6. List common Linux signals.
    Answer:
    • SIGHUP (1): Reload configurations.
    • SIGINT (2): Interrupt process (Ctrl+C).
    • SIGKILL (9): Force termination.
    • SIGTERM (15): Graceful shutdown.
  7. How to find files modified in the last 7 days?
    Answer:
    bash
    find /path -type f -mtime -7
  8. What does lsof do?
    Answer: Lists open files and their processes. Example: lsof /var/log/syslog identifies log access.

3. Networking

  1. TCP vs. UDP: Key differences?
    Answer:
    • TCP: Reliable, connection-oriented (e.g., HTTP).
    • UDP: Unreliable, connectionless (e.g., VoIP).
  2. Describe the three-way handshake.
    Answer:
    1. Client sends SYN.
    2. Server responds with SYN-ACK.
    3. Client sends ACK.
  3. What is BGP?
    Answer: Border Gateway Protocol routes traffic between autonomous systems (e.g., internet backbone).
  4. How does HTTPS work?
    Answer: Encrypts HTTP traffic via TLS:
    1. Server sends certificate.
    2. Client verifies it.
    3. Symmetric key exchange.
  5. What is a CDN?
    Answer: Content Delivery Network caches static assets globally to reduce latency (e.g., Cloudflare).
  6. What is DHCP, and why is it used?
    Answer: Dynamically assigns IP addresses to devices, reducing manual configuration errors.
  7. SNAT vs. DNAT
    Answer:
    • SNAT changes the source IP (e.g., private to public IP).
    • DNAT changes the destination IP (e.g., routing traffic to a backend server).
  8. What is ICMP?
    Answer: Internet Control Message Protocol (used by ping for connectivity checks).

4. Cloud Computing (AWS/GCP/Azure)

  1. What is AWS Lambda?
    Answer: Serverless compute service for event-driven code (e.g., processing S3 uploads).
  2. Explain Azure Blob Storage.
    Answer: Scalable object storage for unstructured data (images, logs).
  3. What is GCP BigQuery?
    Answer: Serverless data warehouse for SQL queries on large datasets.
  4. How does AWS Auto Scaling work?
    Answer: Automatically adjusts EC2 instances based on demand (e.g., CPU utilization).
  5. What is Azure Availability Set?
    Answer: Distributes VMs across fault domains to ensure redundancy during hardware failures.
  6. What is IAM?
    Answer: Identity and Access Management controls permissions for cloud resources.

5. Docker

  1. Docker Image vs. Container
    Answer:
    • Image: Template with app code and dependencies.
    • Container: Running instance of an image.
  2. How to reduce Docker image size?
    Answer: Use multi-stage builds and Alpine Linux base images.
  3. What are Docker volumes?
    Answer: Persistent storage for containers (e.g., databases).
  4. How to secure Docker containers?
    Answer:
    • Run containers as non-root users.
    • Scan images for vulnerabilities (e.g., Trivy).
    • Enable Docker Content Trust.

6. Kubernetes

  1. What is a Pod?
    Answer: The smallest deployable unit in Kubernetes, hosting one or more containers.
  2. Deployment vs. StatefulSet
    Answer:
    • Deployment: Manages stateless apps (e.g., web servers).
    • StatefulSet: Manages stateful apps (e.g., databases).
  3. How to scale a Deployment?
    Answer:
    bash
    kubectl scale deployment/myapp --replicas=5
  4. What is a ConfigMap?
    Answer: Stores non-sensitive configuration data (e.g., environment variables).
  5. Explain Horizontal Pod Autoscaler (HPA).
    Answer: Scales pods based on CPU/memory usage or custom metrics.

7. Monitoring & Logging

  1. Prometheus vs. Grafana
    Answer:
    • Prometheus: Collects and stores metrics.
    • Grafana: Visualizes metrics via dashboards.
  2. What is an APM tool?
    Answer: Application Performance Monitoring (e.g., New Relic) tracks latency, errors, and throughput.
  3. ELK Stack Components
    Answer:
    • Elasticsearch: Search/analytics engine.
    • Logstash: Data processing pipeline.
    • Kibana: Visualization tool.
  4. What is Jaeger?
    Answer: Distributed tracing tool for microservices (e.g., tracking request flows).

8. CI/CD

  1. What is a canary deployment?
    Answer: Roll out changes to a small user subset before full deployment to minimize risk.
  2. GitLab CI vs. Jenkins
    Answer:
    • GitLab CI: Integrated with GitLab.
    • Jenkins: Standalone, plugin-driven automation server.

9. Scripting & Automation

  1. Write a script to count “ERROR” lines in a log file.
    Answer:
    bash
    grep "ERROR" app.log | wc -l
  2. How to parse JSON in Python?
    Answer:
    python
    import json
    data = json.loads(json_string)

10. Incident Management

  1. Steps to resolve an outage?
    Answer: Detect → Acknowledge → Diagnose → Fix → Post-mortem → Prevent recurrence.
  2. What is a “playbook”?
    Answer: Step-by-step guide for common incidents (e.g., database failure).

11. Security & Compliance

  1. Principle of Least Privilege
    Answer: Grant minimal permissions required for users/roles.
  2. How to secure SSH?
    Answer: Disable root login, use SSH keys, and enable 2FA.

12. Coding Challenges

  1. Find top 5 IPs in a log file.
    Answer:
    bash
    awk '{print $1}' access.log | sort | uniq -c | sort -nr | head -5
  2. Check if a string is a palindrome (Python).
    Answer:
    python
    def is_palindrome(s):
    return s == s[::-1]

Comments

No comments yet. Why don’t you start the discussion?

    Leave a Reply