Site Reliability Engineer / Tech Lead
Nov 2020 – Present
Infrastructure as a Service (IaaS) with OpenStack, Ceph, and Kubernetes.
- Participate in project planning, task estimation, and prioritization to maintain alignment with organizational goals.
- Collaborate with cross-functional teams to design and implement scalable, resilient infrastructure solutions.
- Improve system reliability by building robust monitoring and alerting stacks using Prometheus, Grafana, and custom alerting rules.
- Promote SRE best practices, driving a culture of reliability, automation, and continuous improvement.
- Design and maintain CI/CD pipelines using GitLab CI, improving deployment speed and consistency.
- Contribute to Infrastructure-as-Code initiatives using Ansible, enabling repeatable and automated deployments.
- Implement Load Balancer as a Service (LBaaS) using the OpenStack Octavia project.
- Deploy, operate, and optimize multiple Kubernetes clusters.
- Manage dozens of microservices on Kubernetes using Helm charts, ensuring efficient rollouts and lifecycle management.
- Led a VPC project connecting three OpenStack clusters using VXLAN overlays and BGP EVPN routing, leveraging OVN and Open vSwitch to deliver unified networking and secure inter-cluster communication.