Curriculum
OUTLINE

Module Name Module Type Duration
Linux-1 Core 3 weeks
Linux-2 Core 4 weeks
DevOps1 Core 3 months
DevOps2 Core 3 weeks
AWS1 Core 1.5 months
AWS2 Core 3 months
No-Code Automation for Ops Workflows Core 2 weeks
Engineering in the Age of AI Core 1 week
Python as the AI Control Plane Core 2 weeks
Observability for AI and Operations Core 2 weeks
Anomaly Detection Systems Core 2 weeks
Predictive Operations and Failure Prevention Core 1 week
Security Foundations for AI and Automation Core 2 weeks
AI Observability and Model Health Core 2 weeks
ML-Driven Optimization Core 2 weeks
Generative AI for Engineers Core 2 weeks
Retrieval-Augmented Generation (RAG) Core 2 weeks
Agentic AI Systems Core 2 weeks
MLOps on Kubernetes Core 2 weeks
AI Platform Security and FinOps Core 2 weeks
AWS AI/ML Foundations Core 2 weeks
Amazon SageMaker Deep Dive Core 1 week
Generative AI on AWS Core 2 weeks
AWS MLOps and Architecture Core 1 week
AI Threat Models Core 1 week
Governance and Responsible AI Core 1 week
LLM Gateway and Proxy Patterns Core 1 week
Dataset Operations (DataOps) Elective 1 week
Industry-Grade Capstone Project Core 1 week

Curriculum
Deep dive

Program Timeline

Linux-1
  • Linux Essentials
  • Linux - File Management
  • Linux - Process Management
  • Linux - Networks
  • Linux - Text Processing
  • Shell scripting introduction
  • Fundamentals of Shell Scripting
Linux-2
  • Shell scripting walk through
  • Shell scripting for DevOps - file management
  • Shell scripting for DevOps - process management
  • Shell scripting for DevOps - filesystem management
  • Shell scripting for DevOps - backup
  • Shell scripting for DevOps - Networking
  • Mock interview Demo by instructor
  • Git introduction
  • Git in DevOps - GitHook, PR, Branching
DevOps1
  • Docker introduction
  • Docker image and Docker containers
  • Building your own Docker image
  • Images and Containers Deep dive
  • Docker volume
  • Docker networking deep dive
  • Hardening the Docker env.
  • Demo of implementing industry standard implementation (e.g. in image hardening and other security standards)
  • Introduction to Orchestration
  • Kubernetes architecture
  • Kubernetes Pod deep dive
  • Kubernetes Replicaset and Deployment deep dive, side car
  • Kubernetes networking
  • Scaling options in kubernetes - HPA, CA and VPA - compare with Karpentor
  • Introduction to scaling in kubernets, Network policies
  • Kubernetes security - RBAC
  • Kubernetes volumes
  • Kubernetes ingress
  • Init Containers, Static Pods, Scheduling
  • Kubernetes Jobs, Probes, admission controllers
  • CRD, CNI
  • Config maps and secrets
  • Troubleshooting
  • Introduciton to Observability. Prometheus and Grafana setup
  • Prometheus Monitoring Configuration
  • Advanced Monitoring Techniques
  • Getting Started with Grafana
  • Variables, Data Sources, and Persistent Dashboards in Grafana
  • Scaling Grafana: Best Practices, Real-World Use Cases, and Performance Tuning
  • Deploying your graph in Graphana on a Kubernetes cluster
  • Advanced observability using Newrelic
  • Deploying a microservice in Kubernetes cluster project
DevOps2
  • Introduction to DevOps
  • Introduction to CI/CD
  • Jenkins introduction, architecture, jobs
  • Project: CI/CD on Jenkins using simple job
  • Jenkins pipeline syntaxt walk through
  • Project: CI/CD on Jenkins using pipelines
  • Jenkins advanced - add nodes, RBAC
  • Project by engineers on Jenkins end to cicd in industry standard.
  • Introduction to GitHub Actions
  • Secrets, Events, and Workflow Optimization
  • Advanced Techniques and Best Practices
  • Efficient Workflows and Security Practices
  • Project by engineers on GitHub actions end to cicd in industry standard.
  • Advanced ArgoCD Usage and Configuration
  • Advanced Features and Integration
  • ArgoCD-Jenkins Integration
  • Project Demo
  • IaC fundamentatls and Ansible Fundamentals
  • Ansible Playbooks & Modules
  • Templates, Variables & Facts
  • Roles & Advanced Topics
  • Ansible Vault & Best Practices
  • Introduction to SRE concepts
  • Ansible Project
AWS1
  • Cloud Basics & Evolution
  • AWS Compute & EC2 Intro
  • EC2 Deep Dive
  • EC2 Hosting & Load Balancing
  • AWS IAM Basics
  • AWS Storage & CloudFront
  • S3 Deep Dive
  • AWS CLI & Boto Intro
  • AWS VPC & Networking Basics
  • Advanced AWS Networking
  • AWS Observability & CloudWatch
  • CloudWatch & CloudTrail Advanced
  • Docker App on EC2 Project
  • AWS Lambda & API Gateway
  • End-to-End Project
AWS2
  • AWS ECS & ECR Intro
  • ECS Fundamental
  • EKS Basics
  • EKS Advanced
  • Microservices on AWS Project
  • AWS CI/CD & Pipelines
  • Querying AWS Services
  • Serverless on AWS
  • AWS Cloud Formation
  • AWS Security: Identity & Network
  • AWS Security: Data & Detection
  • AWS Security Hands-on
  • AWS Control Tower
  • AWS RDS (SQL Databases)
  • DynamoDB (NoSQL)
  • AWS Glue & Athena
  • AWS Migration Services
  • Generative AI on AWS
  • AWS Services Deep Dive
  • Project Architecture Review
  • Kubernetes AI Project
  • AWS Backup
  • DevOps Cert Discussion
  • Terraform & IaC Intro
  • Terraform Language Deep Dive
  • Terraform State & Modules
  • Terraform Dependencies
  • Advanced Terraform Workspaces
  • Terraform Cluster Project
  • Deploying an app - Project
No-Code Automation for Ops Workflows
  • Building no-code workflows (using tools like Zapier, n8n, or internal AI agents) to connect Jira tickets to GitHub Actions or Slack alerts.
  • Intent-Based Infrastructure:
  • The AI-SRE & Incident Response:
  • Event-Driven No-Code Ops
  • AI Governance & Guardrails(Implementing Policy-as-Code (OPA) and "Human-in-the-Loop" gates to ensure autonomous agents operate within security and budget limits)
  • GPU FinOps & Future-Proofing
Engineering in the Age of AI
  • DevOps in the Age of AI: Course Overview
  • Why Traditional Automation Fails
  • Deterministic vs Probabilistic Systems
  • AI Workloads vs Traditional Workloads
  • Training vs Inference: What DevOps Needs to Know
  • The AI Lifecycle and Its DevOps Impact
  • Mapping AI Lifecycle Stages to a DevOps Pipeline
  • Real Production Incidents: Where AI Intervenes
  • Classical Automation vs AI Decision Logic
  • Case Study: AI-Integrated Incident Response
  • The Future of the DevOps Role
  • Module Review and Assessment
Python as the AI Control Plane
  • Python for DevOps Engineers: Beyond Scripting
  • Infrastructure Introspection with boto3
  • System Profiling with psutil and subprocess
  • API-Driven Infrastructure Intelligence
  • Using GitHub Copilot for Ops Coding
  • Orchestration Scripting with Python
  • Building a Live System Profiler
  • Python Automation for Infrastructure Inspection
  • Collecting Historical Ops Data for ML
  • Building a Python-Based Orchestration Layer
  • Connecting Python to LLM Reasoning
  • Module Review and Assessment
Observability for AI and Operations
  • Metrics, Logs and Traces as ML Signals
  • Feature Extraction from Telemetry Data
  • Data Drift vs Signal Noise
  • Observability for Humans vs Machines
  • Designing ML-Ready Observability Pipelines
  • Building Features from Metrics and Logs
  • Baseline Telemetry Health Checks
  • Machine-Readable Observability Layer: Design
  • Machine-Readable Observability Layer: Build
  • OpenTelemetry for AI Workloads
  • Validating Observability Data Quality
  • Module Review and Assessment
Anomaly Detection Systems
  • From Observability to Intelligence
  • Behavioral Baselines: Theory and Application
  • Infrastructure Anomaly Detection
  • Network and Log Anomaly Detection
  • Feature Engineering for Ops Data
  • Alert Fatigue vs Intelligent Alerting
  • Building a Real-Time Anomaly Detection Service
  • Detecting Traffic and Log Anomalies
  • Creating Behavioral Fingerprints of Services
  • Building an Alert Scoring and Suppression Engine
  • Tuning Detection Thresholds in Production
  • Module Review and Assessment
Predictive Operations and Failure Prevention
  • From Detection to Prevention: The Shift
  • Predictive Monitoring Fundamentals
  • Silent Failure Detection
  • Trend-Based Alerting Design
  • Failure Forecasting with Time-Series Models
  • Building Predictive Alerting Models
  • Forecasting Saturation and Outage Risks
  • Creating an Early-Warning System
  • Failure Prevention Simulation
  • SLO Design for Predictive Systems
  • Integrating Predictions into Operations Workflow
  • Module Review and Assessment
Security Foundations for AI and Automation
  • Security Before Agents: The Foundational Principle
  • Why AI Amplifies Blast Radius
  • Read-Only vs Action-Taking Systems
  • Identity, Permissions and Least Privilege
  • Guardrails as First-Class Architecture
  • Designing Read-Only AI Tooling Boundaries
  • Building Permission-Scoped AI Tools
  • Simulating Privilege Escalation Risks
  • Implementing Approval Gates
  • Zero-Trust Design for AI Systems
  • Audit Logging for Autonomous Systems
  • Module Review and Assessment
AI Observability and Model Health
  • What is Data Drift and Why It Matters
  • Concept Drift: When Models Silently Fail
  • ML Performance Decay Over Time
  • Trust Degradation in Production AI
  • Implementing Drift Detection Pipelines
  • Monitoring Live ML Model Performance
  • Building an ML Health Dashboard
  • Simulating Silent ML Failure in Production
  • Retraining Triggers and Automation
  • Model Health SLOs
  • Integrating Model Health into Incident Response
  • Module Review and Assessment
ML-Driven Optimization
  • From Reactive to Intelligent Infrastructure
  • Predictive Autoscaling: Concepts and Design
  • Kubernetes Right-Sizing with ML
  • Cost-Aware ML Systems
  • Building an ML-Based Autoscaler
  • Predictive Resource Optimization in Practice
  • Designing a Cost Forecasting Engine
  • Tuning Workloads with ML Recommendations
  • Benchmarking ML vs Threshold-Based Scaling
  • FinOps Fundamentals for Platform Engineers
  • Cost Reporting and Governance
  • Module Review and Assessment
Generative AI for Engineers
  • Foundation Models and the AWS Bedrock Ecosystem
  • Infra-Aware Prompt Engineering
  • Validation and Grounding Strategies for GenAI
  • Building an Engineering Assistant with Bedrock
  • Generating Terraform with LLMs Safely. Prompt engineering for Infrastructure-as-Code (IaC); using LLMs to generate complex Terraform modules and Kubernetes manifests.
  • Generating Helm Charts and Bash with LLMs
  • Implementing Response Validation Pipelines
  • Detecting Hallucinated Infrastructure Output
  • Safe GenAI in CI/CD Pipelines
  • Production Checklist for LLM-Generated IaC
  • Prompt Templates for DevOps Use Cases
  • Module Review and Assessment
Retrieval-Augmented Generation (RAG)
  • Why RAG: Enterprise AI That Knows Your Systems
  • Embeddings: How They Work
  • Vector Databases: Selection and Setup
  • Indexing Logs for Semantic Search
  • Indexing Runbooks and Incident Tickets
  • Building a Private RAG System
  • Implementing Semantic Search on Operational Data
  • Building an AI Troubleshooting Assistant
  • RAG Pipeline Evaluation and Quality
  • Keeping the Knowledge Base Current
  • RAG vs Fine-Tuning: Choosing the Right Approach
  • Module Review and Assessment
Agentic AI Systems
  • What is Controlled Autonomy
  • Tool-Using AI: How Agents Execute Actions
  • Guardrails and Policy Enforcement for Agents
  • Autonomous vs Semi-Autonomous Workflows
  • Multi-Agent Architecture Patterns
  • Building an AI Agent That Executes CLI Tools
  • Creating a Guarded Remediation Agent
  • Building AI-Driven Incident Resolution Workflows
  • Designing a Multi-Agent DevOps System
  • Agent Testing and Failure Mode Analysis
  • Human-in-the-Loop Checkpoints for Production Agents
  • Module Review and Assessment
MLOps on Kubernetes
  • Introduction to the MLOps Stack
  • Kubeflow: Pipeline Design and Execution
  • Ray: Distributed ML Workloads
  • MLflow: Experiment Tracking and Model Registry
  • KServe: Inference Platform Deployment
  • GPU Partitioning with MIG
  • GPU Health Observability in Production
  • Spot Instance Checkpointing for Training
  • Building Kubeflow Pipelines End-to-End
  • Deploying Inference Platforms with KServe
  • Tracking Experiments Across Training Runs
  • Module Review and Assessment
AI Platform Security and FinOps
  • GPU Quota Governance: Design and Enforcement
  • Zero-Trust Networking for AI Workloads
  • Cost Governance Fundamentals
  • Preventing GPU Resource Abuse
  • Building AI Cost Attribution Dashboards
  • Simulating and Containing Cost-Shock Scenarios
  • Policy-Based Access Control for AI Platforms
  • FinOps Maturity Model for AI Infrastructure
  • Cost Reporting for ML Teams
  • Security Audit for AI Platforms
  • Billing Alerts and Automated Cost Controls
  • Module Review and Assessment
AWS AI/ML Foundations
  • The AWS AI Ecosystem: Service Overview
  • Shared Responsibility Model for AI Workloads
  • Selecting the Right AWS AI Service
  • Security Baseline for AWS AI Deployments
  • Governance and Compliance on AWS AI
  • Designing a Reference Architecture for Secure AI
  • IAM Design for ML Workloads
  • CloudTrail and Audit Logging for AI
  • AWS Well-Architected Framework Applied to AI
  • Cost Estimation for AWS AI Workloads
  • Networking for AI Services on AWS
  • Module Review and Assessment
Amazon SageMaker Deep Dive
  • SageMaker Architecture and Core Concepts
  • Configuring and Running Training Jobs
  • Hyperparameter Tuning Strategies
  • SageMaker Endpoints: Deployment Options
  • Monitoring Endpoint Performance with CloudWatch
  • SageMaker Studio: The ML IDE
  • Data Processing with SageMaker Processing Jobs
  • Model Versioning and the Model Registry
  • Multi-Model Endpoints and Cost Optimization
  • SageMaker Security and Network Isolation
  • Integrating SageMaker into Application Architectures
  • Module Review and Assessment
Generative AI on AWS
  • AWS Bedrock: Foundation Models and Access Patterns
  • Building GenAI Applications with Bedrock
  • Bedrock Knowledge Bases for RAG
  • Implementing RAG on AWS
  • Bedrock Agents: Design and Configuration
  • Tool Use with Bedrock Agents
  • Serverless Inference Architecture for GenAI
  • Bedrock Security and Data Privacy
  • Cost Optimization for Bedrock Workloads
  • Monitoring and Evaluation of GenAI on AWS
  • Prompt Management and Versioning on AWS
  • Module Review and Assessment
AWS MLOps and Architecture
  • CI/CD for ML: The SageMaker Pipelines Approach
  • Building an End-to-End SageMaker Pipeline
  • Automated Drift Detection with Model Monitor
  • Retraining Triggers and Automation
  • Multi-Account AI Platform Design
  • AWS Organizations for ML Governance
  • CloudFormation for AI Infrastructure
  • Model Approval Workflows and Human Gates
  • Cross-Account Model Deployment
  • Compliance and Audit for Multi-Account AI
  • Cost Governance Across Accounts
  • Module Review and Assessment
AI Threat Models
  • The AI Threat Landscape: OWASP LLM Top 10
  • Model Abuse: Attack Vectors and Defences
  • Data Poisoning: Detection and Prevention
  • Pipeline Compromise: Security Architecture
  • PII Masking Gateway: Design
  • PII Masking Gateway: Implementation
  • Model Weight Security and Integrity Verification
  • Adversarial Prompt Injection: Simulation
  • Defending Against Prompt Injection
  • AI-Specific Threat Modeling Methodology
  • Security Testing for AI Pipelines
  • Setting guard rails (Policy as code)
  • Module Review and Assessment
Governance and Responsible AI
  • Human-in-the-Loop: Design Principles
  • Designing Approval Workflows for Autonomous Actions
  • Output Validation Layers: Design and Implementation
  • Compliance Frameworks for AI Infrastructure
  • Mapping AI Decisions to Regulatory Requirements
  • Audit Trail Design for AI Systems
  • Risk Assessment for AI Deployments
  • Responsible AI Principles in Practice
  • Documentation Standards for AI Systems
  • Incident Response for AI Governance Failures
  • Internal Governance Structures for AI
  • Module Review and Assessment
LLM Gateway and Proxy Patterns
  • The AI Gateway Pattern: Architecture and Rationale
  • Request Routing Across Foundation Models
  • Rate Limiting for LLM APIs
  • Fallback Routing for High Availability
  • Semantic Caching: How It Works
  • Implementing Semantic Caching with Redis
  • PII Filtering at the Gateway Layer
  • Building an LLM Gateway with AWS API Gateway
  • Testing Gateway Reliability and Failover
  • Gateway Observability and Alerting
  • Cost Monitoring for LLM API Usage
  • Module Review and Assessment
Dataset Operations (DataOps)
  • Introduction to DataOps for AI Systems
  • Feature Store Architecture and Use Cases
  • Integrating Feast into an ML Pipeline
  • Object Storage Performance Tuning
  • Data Validation with Great Expectations
  • Parquet and Columnar Storage Optimization
  • Data Quality Monitoring for ML Pipelines
  • Versioning Datasets and Features
  • Building a Data Validation Pipeline
  • Data Lineage and Governance
  • DataOps in Multi-Team Environments
  • Module Review and Assessment
Industry-Grade Capstone Project
  • Capstone Brief: Requirements and Architecture Scope
  • System Design: AI-Powered Observability Platform
  • Building the Observability Platform
  • System Design: Predictive Failure Prevention Engine
  • Building the Failure Prevention Engine
  • System Design: Autonomous Incident Resolution System
  • Building the Incident Resolution Agent
  • Secure Kubernetes Platform Design and Build
  • AWS Enterprise AI Architecture Design
  • Threat Modeling and Security Review
  • Cost Engineering Analysis and Optimization
  • Architecture Defense and Production Readiness Review
  • Basic analysis before forming the prompts
  • Tactics to identify root causes by properly framing prompts.