Lucas Saboya

Staff | Lead Site Reliability Engineer

image

Experienced Site Reliability Engineer and DevOps specialist with a 15-year IT career that includes success in enhancing DevOps practices to streamline deployment processes, fortify system reliability, and bolster security. Passionate champion for fostering DevOps cultures, reducing failure rates for software releases, and creating seamless experiences for development teams. Effective bilingual communicator with experience in remote roles.


Work Experience

Co-founder

TrustD Solutions | Nov. 2019 - Present

TrustD Solutions is an SRE and DevOps consultancy founded to bridge the gap between Brazilian workforces and US demand, providing expertise and best practices for public cloud operations, cost optimization, and system reliability.

  • Refactored SRE strategy, introduced tooling, and staffed a DevOps team for a health-tech startup.
  • Designed and implemented the company’s full public cloud presence to include runtime infrastructure and CI/CD pipelines for a fintech startup.
  • Provided DevOps and SRE consulting services including the design and implementation of underlying infrastructure. For a software development consultancy firm.

Staff Site Reliability Engineer

Aetion | Nov. 2019 - Present

Aetion is a healthcare analytics company that delivers real-world evidence for the manufacturers, purchasers, and regulators of medical treatments and technologies. Recruited by a former coworker as a Senior SRE with work focused on the HITRUST certification effort and earned promotion to Staff SRE role with responsibility for project leadership and mentoring team members.

  • Contributed to successful HITRUST certification efforts with oversight for AWS cloud infrastructure hardening.
  • Migrated workloads from AWS ECS to EKS to reduce spending on runtime costs and enabled deployments of Aetion’s platform to client public cloud accounts.
  • Spearheaded Terraform Cloud adoption for AWS infrastructure.
  • Maintained a DRY codebase to enable SREs to quickly and reliably provision new environments.
  • Accelerated CI/CD pipeline execution and reduced feedback loops for developers.
  • Authored Ansible playbooks, Python scripts, and Golang applications to automate manual, repetitive, and error-prone tasks (e.g. user access provisioning, client infrastructure provisioning, security patching, secret rotation).

Site Reliability Engineer

Gainbridge.life | Aug. 2019 - Nov. 2019

Gainbridge.life is an insurtech company focused on improving the process of buying and managing annuities and insurance policies. I engaged to lead an SRE team through digital transformation efforts.

  • Guided efforts to refactor CI/CD pipelines to enable development team’s migration from microservices to a monolith while maintaining systems availability.
  • Leveraged AWS-managed services including EKS, MSK, RDS, and SSM.
  • Executed a runtime platform migration from KOPS-managed Kubernetes clusters to EKS.
  • Oversaw implementation and adoption of Terraform Cloud and Terraform IaC.
  • Introduced observability and telemetry tooling to improve understanding of runtimes in both monolith and microservices environments.
  • Right-sized Kubernetes clusters, node pools, and resource requirements to avoid idle compute resources within clusters while leveraging AWS’ spot instances. Efforts led to ~50% reduction in infrastructure costs.

Site Reliability Engineer

Forestry.io | Jan. 2019 - Aug. 2019

Forestry.io builds a content management system for static site generators. I was recruited to design, build, and maintain a new infrastructure architecture to support a planned feature that required execution of several parallel jobs and rendering of static site changes previews on-the-fly.

  • Developed operating infrastructure on multiple cloud providers to expedite development workflows via automation.
  • Leveraged Docker containers to package apps and migrated workloads across cloud providers to optimize performance, improve availability, and reduce cloud costs.
  • Leveraged DigitalOcean’s Managed Kubernetes cluster and GKE clusters. Terraformed Kubernetes clusters on GCP.
  • Packaged up workloads to Docker images.
  • Built CI/CD pipelines to deploy workloads.

Site Reliability Engineer

Kairos AR Inc. | Jan. 2018 - Jan. 2019

Kairos AR builds facial recognition and emotion analysis software powered by AI and ML. Hired as the company’s first SRE to build and maintain all underlying runtime infrastructure and CI/CD pipelines.

  • Guided successful migration from AWS into GCP and deployed replicas of the Kairos API and its underlying dependencies to AliCloud.
  • Built, deployed, and maintained an MLOps pipeline to support Kairos’ latest API improvements and features.
  • Created Grafana dashboards to monitor and alert Aerospike deployments that remain the most downloaded today.
  • Deployed Tensorflow-based workloads in Kubernetes with GPUs as schedulable units.
  • Terraformed Kubernetes clusters on GCP and AWS.
  • Built CI/CD pipelines on Jenkins and Travis CI to deploy workloads.

Site Reliability Engineer

Anki Inc. | Aug. 2016 - Jan. 2018

Anki was a robotics and AI company best known for consumer-friendly products (Anki Drive & Overdrive, Cozmo robot). Joined to refactor and improve Anki’s main website and e-commerce platform’s CI/CD pipeline, and transitioned to the platform team focused on API infrastructure.

  • Built and maintained CI/CD pipeline operations using Docker, Ansible, Chef, OpsWorks, Jenkins, and TeamCity.
  • Delivered playbooks for provisioning infrastructure for data science and robotics teams.
  • Built and maintained Ansible roles and modules to drive Anki cloud infrastructure on AWS.
  • Refactored CI/CD pipelines used to deploy Anki’s storefront and website.

Site Reliability Engineer

HE:labs | Jan. 2016 - Aug. 2016

HE:labs is a technology consulting firm focused on digital transformation and software development. Hired in a remote role to an Agile team with work focused on automation to build, package, and deploy Rails apps.

  • Partnered with multiple clients in building cloud provider automation to enable faster, highly available, and cheaper solutions to roll out workloads.
  • Built Ansible playbooks & automation tools in Python and Ruby.

Support Engineer

iFactory Solutions | Jan. 2015 - Dec. 2015

iFactory is a digital agency specializing in web design, development, and digital marketing. Joined a team responsible for supporting OpenText CEM deployments and AWS for US-based clients.

  • Co-authored support, deployment, and maintenance best practices

Computer Networks Analyst

Instito Federal do Ceará (IFCE) | Jan. 2008 - Jan. 2015

Integrating the Computer Security Incident Response Team - CSIRT, Mitigating IFCE’s network security issues, Managing the Rectory rewalls, Providing level II networking and security support to other campuses tech staff.

  • IPv6 networks deployment
  • VMWare vSphere deployment