Milad Jahandideh - Site Reliability Engineer

About

Site Reliability Engineer and Tech Lead with 7+ years of experience designing, operating, and scaling high-availability cloud infrastructure serving thousands of users. At ArvanCloud, I lead SRE initiatives across a large-scale IaaS platform built on OpenStack, Ceph, and Kubernetes — owning reliability, observability, and infrastructure automation across distributed systems.

I combine deep technical execution with engineering leadership: defining SLOs, driving incident culture, and guiding teams to build reliable systems at scale. My background spans private cloud infrastructure, container orchestration, network engineering, and full-cycle DevOps automation.

Experience

Site Reliability Engineer / Tech Lead

ArvanCloud.ir · Full-time

Nov 2020 – Present

ArvanCloud is a leading Iranian cloud provider delivering Infrastructure as a Service (IaaS) at scale, built on OpenStack, Ceph, and Kubernetes.

Lead the SRE chapter for ArvanCloud's IaaS platform, defining SLOs, owning incident response processes, and establishing reliability standards across core infrastructure services.
Architected and delivered a VPC project connecting three OpenStack clusters via VXLAN overlays and BGP EVPN routing using OVN and Open vSwitch, enabling secure and unified inter-cluster networking at scale; published an open-source OVN/OVS CLI cheatsheet on GitHub.
Designed and scaled the observability stack using Prometheus, Grafana, and custom alerting rules, significantly reducing MTTR and improving incident detection across distributed systems.
Deployed and operated multiple production Kubernetes clusters, managing dozens of microservices via Helm charts and GitOps workflows with ArgoCD.
Integrated Ceph RBD with OpenStack Cinder for persistent block storage and deployed Ceph CSI for Kubernetes persistent volume provisioning; published an open-source Ceph CLI cheatsheet on GitHub.
Implemented Load Balancer as a Service (LBaaS) using OpenStack Octavia, extending self-service networking capabilities for cloud tenants.
Built and maintained CI/CD pipelines with GitLab CI and standardized Infrastructure-as-Code practices with Ansible and Terraform, enabling consistent and automated deployments.
Participate in on-call rotations, lead incident response, and author post-mortems to drive systemic reliability improvements.
Maintain operational documentation including architecture diagrams, runbooks, and on-call playbooks to support team scaling and knowledge transfer.

Linux System Administrator

Mahsan.co · Full-time

Dec 2018 – Nov 2020

Administered and maintained a large-scale Linux server environment, supporting deployments across hundreds of servers for a defense-sector organization.
Deployed VMware ESXi virtualization infrastructure, enabling isolated development and testing environments for engineering teams.
Eliminated manual toil by automating infrastructure operations with Ansible and Shell Scripting.
Containerized and migrated monolithic applications to LXC, improving resource efficiency and deployment repeatability.
Built a centralized logging platform using the ELK Stack to aggregate and analyze logs from thousands of servers, enabling proactive issue detection.
Configured Zabbix monitoring with alerting, enabling proactive capacity tracking and server health visibility.

Embedded Systems Developer

Adeeco · Full-time

Dec 2017 – Nov 2018

Developed firmware for industrial embedded systems using AVR and ARM microcontrollers.
Designed and built hardware-software integrated solutions for automation and control applications.