Hi, my name is

Sowmya Sree Velpula.

I build reliable systems at scale.

Developer-turned-SRE with 5+ years of experience building and operating production-grade distributed systems across AWS, Docker, Kubernetes, and Linux infrastructure. Currently ensuring platform reliability at Solidigm (SK Hynix), combining strong backend engineering skills with deep operational expertise in observability, CI/CD automation, and incident management.

View My Work Get In Touch

01.About Me

I started my career as a software engineer at a high-growth startup, building backend services and RESTful APIs with Node.js and PostgreSQL. That hands-on experience shipping features under tight deadlines shaped my ownership mentality and natural gravitation toward reliability engineering.

Today, I combine strong backend engineering skills in Node.js, Java, and Python with deep operational expertise to ensure systems stay up, perform well, and scale gracefully. I specialize in CI/CD automation, incident management, capacity planning, and production readiness for event-driven microservices.

I hold a Master of Science (Summa Cum Laude) from the Virginia Institute of Science and Technology and a B.Tech in Computer Science from JNTUH, Hyderabad.

Infrastructure at Scale

Managing multi-tenant enterprise platforms on AWS with Docker, Kubernetes, and Terraform across production environments.

Full-Stack Observability

Building monitoring strategies with Splunk, Dynatrace, Prometheus, and Grafana covering 50+ API endpoints with SLI/SLO tracking.

AWS Certified

AWS Certified DevOps Engineer — Professional with deep expertise across EC2, ECS, Lambda, RDS, S3, SQS, SNS, and CloudWatch.

02.Where I've Worked

Site Reliability Engineer @ Solidigm

June 2024 - Present · San Jose, California

Own end-to-end production reliability for a multi-tenant enterprise platform, managing infrastructure provisioning, CI/CD pipelines, observability, and incident response.
Architected a comprehensive monitoring strategy using Splunk, Dynatrace, CloudWatch, Prometheus, and Grafana with 20+ dashboards covering API latency (p50/p95/p99) and error budgets.
Designed zero-downtime deployment pipelines using GitHub Actions, reducing failed deployments by 70% and cutting deployment time from 25 to 8 minutes.
Automated operational toil using AWS Lambda, EventBridge, and Python/Bash scripts, eliminating 15+ hours/week of manual effort.
Led incident management for production outages, reducing recurring incidents by 55% through preventive automation and blameless post-mortems.

AWSDockerKubernetesTerraformSplunkDynatraceNode.jsPython

03.Things I've Built

Enterprise Observability Platform

Solidigm

Architected a comprehensive monitoring strategy with 20+ dashboards covering infrastructure health, API latency (p50/p95/p99), error budgets, and resource utilization. Integrated Splunk, Dynatrace, CloudWatch, Prometheus, and Grafana for full-stack visibility across a multi-tenant platform.

SplunkDynatracePrometheusGrafanaCloudWatchPython

Zero-Downtime CI/CD Pipeline

Solidigm

Designed deployment pipelines using GitHub Actions with parallel jobs, health checks, atomic release switching, and automated rollback. Reduced failed deployments by 70% and cut deployment time from 25 minutes to 8 minutes with blue-green and canary strategies.

GitHub ActionsDockerAWS ECSTerraformNginx

Event-Driven Order Pipeline

ValueLabs

Engineered a Kafka-based order processing pipeline handling 50K+ daily orders with 8 topic partitions and 3 consumer groups. Implemented dead-letter queues, exactly-once delivery semantics, and a Redis caching layer that cut p99 latency from 450ms to 85ms.

Apache KafkaRedisNode.jsPostgreSQLAWS SQS

Serverless Notification Service

ValueLabs

Built a high-throughput notification microservice processing 100K+ daily events via AWS SQS FIFO queues with exponential backoff retry logic. Implemented fan-out dispatch using SNS with topic filtering across email (SES), SMS, and push channels.

AWS LambdaSQSSNSSESEventBridgePython

Infrastructure as Code Platform

Solidigm

Engineered repeatable provisioning of EC2, RDS, S3, Lambda, EventBridge, and IAM resources using Terraform and AWS CDK. Maintained environment parity across dev/staging/prod with least-privilege IAM access controls and automated compliance checks.

TerraformAWS CDKCloudFormationIAMPythonBash

Production Toil Automation

Solidigm

Automated 15+ hours/week of operational toil using AWS Lambda and EventBridge for nightly maintenance: usage quota recalculation, stale session cleanup, metrics aggregation, and certificate rotation across production environments.

AWS LambdaEventBridgePythonBashCloudWatch

04.Skills & Technologies

Languages

Python
Node.js
JavaScript
Java
Go
Bash

Backend & APIs

Express.js
Spring Boot
RESTful APIs
GraphQL (Apollo Server)
Prisma ORM
Sequelize

Cloud & IaC

AWS
GCP
Linux (Ubuntu, RHEL)
Terraform
AWS CDK
CloudFormation
Ansible

Containers

Docker
Kubernetes (EKS)
Helm
Nginx
ALB
Auto-Scaling

Observability

Splunk
Dynatrace
Prometheus
Grafana
Datadog
CloudWatch
ELK Stack
OpenTelemetry

CI/CD & Release

GitHub Actions
Jenkins
GitOps
CodeQL
SonarQube
Blue-Green
Canary
Rollback Strategies

Data & Messaging

PostgreSQL
MySQL
MongoDB
Redis
Apache Kafka
SQS/SNS
EventBridge
DLQs

SRE Practices

SLI/SLO/SLA
Incident Mgmt
RCA
Runbooks
On-Call
Capacity Planning
Post-Mortems
Error Budgets
PagerDuty
ServiceNow
JIRA

05. What's Next?

Get In Touch

I'm always open to discussing new opportunities, interesting projects, or just connecting with fellow engineers. Whether you have a question or just want to say hi, feel free to reach out.

Say Hello LinkedIn

571-293-2228