Dhruv Chavda photo

👋 Hello! I'm Dhruv

I'm a CKA & ICA Certified DevOps/SRE/Platform Engineer with over 4 years of hands-on experience in Kubernetes, Docker, and AWS cloud services. I specialize in Istio, K8s Operators, optimizing workflows, and implementing cost-saving strategies

About me 🤷🏻‍♂️

I’m a DevOps Engineer with a Bachelor's degree from Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT). Originally from Porbandar, Gujarat, I am currently based in Pune. As a DevOps Engineer, I am passionate about automation, optimization and enhancing system reliability. Outside of work, I enjoy swimming 🏊🏻, hitting the gym 🏋🏻, playing the guitar 🎸, and diving into self-help and mythology books 📚.

Feel free to explore my   Blogs   , check out my   Certifications on Credly   , or   Book a 1-on-1 on Topmate   😄.

Skills 🧑🏻‍💻

Languages

  • PythonPython
  • Bash - ZshBash - Zsh
  • GoGo
  • C++C++

Cloud Tech

  • Amazon Web ServicesAmazon Web Services
  • DigitalOceanDigitalOcean
  • KubernetesKubernetes
  • DockerDocker
  • Istio (Service Mesh)Istio (Service Mesh)
  • CloudflareCloudflare

Automations

  • HelmHelm
  • TerraformTerraform
  • AWS CloudFormationAWS CloudFormation

CI | CD

  • GitLab PipelinesGitLab Pipelines
  • JenkinsJenkins
  • GitLeaks

Logging and Monitoring

  • PrometheusPrometheus
  • AlertmanagerAlertmanager
  • GrafanaGrafana
  • PagerDutyPagerDuty
  • DatadogDatadog
  • SumologicSumologic
  • Loki (Log Aggregation)Loki (Log Aggregation)

Soft Skills

  • Flexibility and Adaptability
  • Teamwork
  • Proactivity
  • Critical Thinking
  • Problem Solving
  • Always Learning

Tools & Systems

  • LinuxLinux
  • GitGit
  • GithubGithub
  • SlackSlack Workflows
  • JiraJira
  • WordPressWordPress
  • Apache AirflowApache Airflow
  • Infisical
  • Microsoft Entra ID

Experience 🚀

  1. Senior DevOps Engineer

    ZZAZZ AI

    • Kubernetes Architecture & Observability: - Owned and architected a secure Kubernetes platform from scratch across 2 environments, supporting ~150 microservices and serving ~15K RPS per cluster; introduced RBAC, network policies, namespace isolation, and multi-region clusters for GDPR compliance and release safety. - Drove an OSS-first platform strategy over vendor tooling across observability, CI/CD, certificate management, and orchestration; adopted Helm, cert-manager, Kubernetes Operators, and the LGTM stack, cutting licensing costs and avoiding vendor lock-in. - Built the platform-wide observability stack from the ground up using Prometheus, Grafana, Alertmanager, and Loki; delivered team-specific alerting and dashboards covering nodes, pods, workloads, compute, and networking. - Standardized deployment patterns for 150+ services with reusable Helm templates, reducing setup overhead and powering scalable self-service delivery. • Self-Hosted GitLab & CI/CD at Scale: - Established and operationalized self-hosted GitLab for 50+ users, ~300 projects, and 20k+ pipeline runs per month; transferred 250+ repositories from Bitbucket with zero data loss and added DR automation with RPO ~24h / RTO ~1h. - Streamlined CI/CD on Kubernetes-based runners, cutting container image build times by 60–70%, adding GitLeaks security scans, and introducing deployment guardrails such as central kill switches for safer releases. • ZTNA & Infrastructure Security: - Architected centralized identity and access management by rolling out SSO and SCIM across 45+ tools and applications, standardizing authentication and automating user lifecycle management organization-wide. - Strengthened platform access controls by rolling out Cloudflare ZTNA across 500+ droplets, onboarding staging and production services behind private-network access patterns. - Enabled safer developer access by adding Cloudflared + ZTNA authenticated test endpoints for Kubernetes services and rolling out Cloudflare Warp for 35+ developers and contractors. - Unified multi-cloud operations across DigitalOcean, AWS, and Cloudflare with OAuth-enabled tooling and resource tagging for governance and cost accountability. • Argus — AI Ops Chatbot: - Developed Argus, an internal AI Ops chatbot on Kubernetes, Prometheus, and Grafana MCP servers, reducing MTTR by 75% and unlocking self-serve cluster debugging. - Launched an admin dashboard for chatbot monitoring, token cost analytics, and usage insights. • Cloudflare DNS & Workers: - Simplified ingress and edge routing with dynamic DNS allocation and a single shared load balancer, eliminating 60+ load balancers and enabling faster, standardized ingress creation. - Transitioned 15+ frontend services to Cloudflare Workers and Pages, strengthening edge delivery and standardizing end-to-end CI/CD automation for frontend releases. Also served as a 24x7 on-call engineer across cloud, platform, and infrastructure, supporting developer enablement, production incident response, and end-to-end platform ownership.

    Learn more...
  2. Software Development Engineer-2 (DevOps)

    Mindtickle

    • Istio Service Mesh: - Reduced inter-AZ data transfer costs by 70% through Istio locality-aware traffic management and expanded mesh visibility with Grafana dashboards for both control plane and data plane observability. - Delivered zero-downtime Istio upgrades in production by introducing a canary upgrade strategy and transitioning mesh deployments to Helm-based version control. - Operated Istio service mesh across multiple clusters serving 500+ microservices, strengthening traffic control and production reliability for platform-managed services. • Cost-Saving Initiatives: - Replatformed Kubernetes workloads onto AWS Graviton-based instances through ARM image build and deployment pipelines, generating $270,000 in savings. - Automated EC2 and EBS cleanup to remove redundant capacity, driving $100,000+ annual savings. • Isolated Sandbox Testing: - Designed an isolated sandbox testing model for Kubernetes-native applications based on Shift Left Testing, driving a 40% reduction in error rates before production rollout. - Used Kubernetes Operators and Istio reliability features to make the solution repeatable and production-like, with plans to open-source the approach. • Chaos Testing Enhancements: - Engineered an in-house chaos testing mechanism using Kubernetes Operators and Istio fault injection to validate resiliency across 50+ microservices. - Extended resilience testing to infrastructure level with AWS Fault Injection Simulator on clusters spanning 150+ nodes. • Repository Migration: - Developed migration and validation tooling to move 1000+ repositories from GitHub to GitLab across 5 language ecosystems, saving around $40,000 annually. • Slack Workflows and Automation: - Automated production escalation setup with a Slack workflow that integrated Jira, PagerDuty, and Google Meet, reducing manual coordination during incidents. - Implemented a custom Slackbot for access management and routine developer tasks; cut manual workload by 30% and saved the team 50+ hours monthly. • DevOps Help-Desk: - Developed an internal portal to manage and route developer on-call requests, supporting 400+ requests/month through Jira Service Management and Slack integrations. - Introduced SLAs and reporting for recurring issue analysis, streamlining support prioritization and reducing tickets by 10% month-over-month.

    Learn more...
  3. Platform Architect (12,000+ Sites)

    Freelance Project

    • Platform Architecture & Cost Optimization: - Designed and delivered a Kubernetes-based hosting platform for 12,000+ internal websites, reducing monthly hosting costs by ~$25k/month, from ~$30k to ~$5k/month. - Engineered the platform's database layer, observability, security, and access controls end-to-end, with automated DNS, SSL certificate management, and monitoring at 12,000-site scale. - Optimized traffic-based performance with Redis + local caching and Autoscaling. • Automation & Operational Tooling: - Developed automation tooling for bulk content push and theme/plugin management at scale using custom scripts and WP CLI. - Exposed API endpoints to automate operational website actions such as article publishing and plugin-driven workflows across the fleet. - Coordinated delivery for demo and client-critical sites, handling access, plugin procurement, timelines, and end-to-end execution.

Stay in Touch 👋🏻