SSV

Hi, my name is

Sowmya Sree Velpula.

I build reliable systems at scale.

Developer-turned-SRE with 6+ years of experience building and operating production-grade distributed systems. Currently ensuring platform reliability at Solidigm (SK Hynix), managing infrastructure, observability, and incident response across mission-critical environments.

01.About Me

I started my career as a software engineer at a high-growth startup, building backend services and RESTful APIs with Node.js and PostgreSQL. That hands-on experience shipping features under tight deadlines shaped my ownership mentality and natural gravitation toward reliability engineering.

Today, I combine strong backend engineering skills in Node.js, Java, and Python with deep operational expertise to ensure systems stay up, perform well, and scale gracefully. I specialize in CI/CD automation, incident management, capacity planning, and production readiness for event-driven microservices.

I hold a Master of Science (Summa Cum Laude) from the Virginia Institute of Science and Technology and a B.Tech in Computer Science from JNTUH, Hyderabad.

Infrastructure at Scale

Managing multi-tenant enterprise platforms on AWS with Docker, Kubernetes, and Terraform across production environments.

Full-Stack Observability

Building monitoring strategies with Splunk, Dynatrace, Prometheus, and Grafana covering 50+ API endpoints with SLI/SLO tracking.

AWS Certified

AWS DevOps Associate certified with deep expertise across EC2, ECS, Lambda, RDS, S3, SQS, SNS, and CloudWatch.

02.Where I've Worked

Site Reliability Engineer @ Solidigm

June 2024 - Present ยท San Jose, California

  • Own end-to-end production reliability for a multi-tenant enterprise platform, managing infrastructure provisioning, CI/CD pipelines, observability, and incident response.
  • Architected a comprehensive monitoring strategy using Splunk, Dynatrace, CloudWatch, Prometheus, and Grafana with 20+ dashboards covering API latency (p50/p95/p99) and error budgets.
  • Designed zero-downtime deployment pipelines using GitHub Actions, reducing failed deployments by 70% and cutting deployment time from 25 to 8 minutes.
  • Automated operational toil using AWS Lambda, EventBridge, and Python/Bash scripts, eliminating 15+ hours/week of manual effort.
  • Led incident management for production outages, reducing recurring incidents by 55% through preventive automation and blameless post-mortems.
AWSDockerKubernetesTerraformSplunkDynatraceNode.jsPython

03.Things I've Built

Project links

Enterprise Observability Platform

Solidigm

Architected a comprehensive monitoring strategy with 20+ dashboards covering infrastructure health, API latency (p50/p95/p99), error budgets, and resource utilization. Integrated Splunk, Dynatrace, CloudWatch, Prometheus, and Grafana for full-stack visibility across a multi-tenant platform.

SplunkDynatracePrometheusGrafanaCloudWatchPython
Project links

Zero-Downtime CI/CD Pipeline

Solidigm

Designed deployment pipelines using GitHub Actions with parallel jobs, health checks, atomic release switching, and automated rollback. Reduced failed deployments by 70% and cut deployment time from 25 minutes to 8 minutes with blue-green and canary strategies.

GitHub ActionsDockerAWS ECSTerraformNginx
Project links

Event-Driven Order Pipeline

ValueLabs

Engineered a Kafka-based order processing pipeline handling 50K+ daily orders with 8 topic partitions and 3 consumer groups. Implemented dead-letter queues, exactly-once delivery semantics, and a Redis caching layer that cut p99 latency from 450ms to 85ms.

Apache KafkaRedisNode.jsPostgreSQLAWS SQS
Project links

Serverless Notification Service

ValueLabs

Built a high-throughput notification microservice processing 100K+ daily events via AWS SQS FIFO queues with exponential backoff retry logic. Implemented fan-out dispatch using SNS with topic filtering across email (SES), SMS, and push channels.

AWS LambdaSQSSNSSESEventBridgePython
Project links

Infrastructure as Code Platform

Solidigm

Engineered repeatable provisioning of EC2, RDS, S3, Lambda, EventBridge, and IAM resources using Terraform and AWS CDK. Maintained environment parity across dev/staging/prod with least-privilege IAM access controls and automated compliance checks.

TerraformAWS CDKCloudFormationIAMPythonBash
Project links

Production Toil Automation

Solidigm

Automated 15+ hours/week of operational toil using AWS Lambda and EventBridge for nightly maintenance: usage quota recalculation, stale session cleanup, metrics aggregation, and certificate rotation across production environments.

AWS LambdaEventBridgePythonBashCloudWatch

04.Skills & Technologies

Languages

  • Python
  • Node.js
  • JavaScript
  • Java
  • Go
  • Bash

Cloud & Infra

  • AWS
  • Docker
  • Kubernetes (EKS)
  • Terraform
  • Linux
  • Nginx

Observability

  • Splunk
  • Dynatrace
  • Prometheus
  • Grafana
  • Datadog
  • OpenTelemetry

Backend & APIs

  • Express.js
  • Spring Boot
  • GraphQL
  • REST APIs
  • Prisma
  • Sequelize

Databases

  • PostgreSQL
  • MySQL
  • MongoDB
  • Redis

CI/CD & DevOps

  • GitHub Actions
  • Jenkins
  • GitOps
  • SonarQube
  • Blue-Green
  • Canary

SRE Practices

  • SLI/SLO/SLA
  • Incident Mgmt
  • RCA
  • Runbooks
  • On-Call
  • Error Budgets

Messaging

  • Apache Kafka
  • AWS SQS/SNS
  • EventBridge
  • Dead-Letter Queues

05. What's Next?

Get In Touch

I'm always open to discussing new opportunities, interesting projects, or just connecting with fellow engineers. Whether you have a question or just want to say hi, feel free to reach out.

470-530-5179