Job Description
Your success is a train ride away!
As we move Americaâs workforce toward the future, Amtrak connects businesses and communities across the country. We employ more than 20,000 diverse, energetic professionals in a variety of career fields throughout the United States. The safety of our passengers, our employees, the public and our operating environment is our priority, and the success of our railroad is due to our employees.
Are you ready to join our team?
Our values of âDo the Right Thing, Excel Together and Put Customers Firstâ are at the heart of what matters most to us, and our Core Capabilities, âBuilding Trust, Accountability, Effective Communication, Customer Focus, and Proactive Safety & Securityâ are what every employee needs to know and do to be most impactful at Amtrak. By living the Amtrak values, focusing on our capabilities, and actively embracing and fostering diverse ideas, backgrounds, and perspectives, together we will honor our past and make Amtrak a company of the future.
JOB SUMMARY:
At Amtrak, the Principal DevOps Engineer is a Principal technical leader responsible for ensuring the resilience, scalability, and security of our digital platforms. This role combines software engineering, systems engineering, and a deep operational mindset to improve reliability through automation, observability, and proactive incident response. The successful candidate will drive architectural decisions around SLOs, error budgets, infrastructure as code, and deployment strategies while mentoring engineers and standardizing practices across teams. They will collaborate cross-functionally to implement scalable solutions that align with our goals for service health, security, and development velocity.
ESSENTIAL FUNCTIONS:
1) CI/CD & Release
⢠  Architect progressive delivery (canary/blue-green/feature flags) of DevSecOps CI/CD pipelines
⢠  Automate rollback/fail-forward and release evidence capture.
⢠  Standardize quality gates (tests, perf/chaos pre-prod).
2) Platform (IaC, Cloud, Containers)
⢠  Publish hardened base images and golden IaC modules with guardrails.
⢠  Enforce k8s/RBAC, network policies, quotas; secret standards.
⢠  Design multi-env promotion workflows with policy checks.
3) Observability, SLOs & Incidents
⢠  Establish SLOs/error budgets; drive cross-team reliability improvements.
⢠  Bake runbooks into alerts; add synthetic/load tests to pipelines.
⢠  Lead major incidents; land systemic fixes (not just patches).
4) Security & Compliance
⢠  Enforce short-lived creds, zero-trust patterns, and attestation/signing.
⢠  Automate compliance checks and evidence collection.
⢠  Partner with security on threat-modeling for platform changes.
5) Automation & Tooling
⢠  Create internal libraries/CLIs with telemetry and docs.
⢠  Measure automation ROI (time saved, error-rate drop).
⢠  Orchestrate complex workflows (e.g., Step Functions/Argo Workflows).
6) Platform DX, Docs & Collaboration
⢠  Own a platform capability end-to-end (roadmap, SLAs, upgrades).
⢠  Drive adoption of best practices across multiple teams.
⢠  Write ADRs and decision logs that clarify trade-offs.
7) Networking, Data Resilience & FinOps
⢠  Define/validate RPO/RTO; automate restore drills and reports.
⢠  Tune critical paths for latency/throughput and cost.
⢠  Forecast impacts of migrations; deliver measurable cost/perf wins.
MINIMUM QUALIFICATIONS:
⢠  Bachelorâs degree in Computer Science, Engineering, or related technical discipline.
⢠  At least 5 years of experience in DevOps, SRE, or Platform Engineering roles with leadership experience in automation and infrastructure reliability.
⢠  3+ years hands-on experience in high-availability production environments with cloud-native security and observability tooling.
PREFERRED QUALIFICATIONS:
⢠  Masterâs degree in Computer Science or equivalent.
⢠  Certifications: AWS DevOps Engineer Pro, Terraform Associate, CKA, or SRE-focused credentials.
⢠  Experience with developer portals (e.g., Backstage), service mesh (e.g., Istio), and security tooling (e.g., Vault, Falco, Trivy).
⢠  Knowledge of DORA metrics, reliability KPIs, and engineering effectiveness measurement frameworks.
⢠  Background in regulated environments (e.g., PCI, HIPAA, FedRAMP) with experience implementing security automation at scale.
KNOWLEDGE, SKILLS and ABILITIES:
Deep expertise in AWS (or equivalent cloud platform), especially in compute, networking, IAM, and monitoring.
Proficiency in Terraform, AWS CDK, Â CloudFormation, Docker, and Linux systems.
Experience with âpipelines as codeâ and setting up CI/CD with Github Actions, AWS CodeBuild/CodePipelines, Jenkins automation.
Experience implementing and managing CI/CD systems with security tollgates and rollback logic.
Strong scripting skills in Python, Go, or Bash for automation and tooling.
In-depth understanding of SRE practices including incident response, SLO/SLA management, chaos engineering, and capacity modeling.
F...