Key job responsibilities
- Provide support for cluster and node management, ensuring smooth operation of LLM infrastructure.
- Continuously improve and automate cluster, capacity, and maintenance upgrades.
- Develop automation tools to enhance operational excellence.
- Work on operations and maintenance-driven coding projects, primarily using Ruby, Rails, Java, Python, shell scripts, and web technologies.
- Have hands-on experience with Kubernetes and expertise in various AWS services.
- Drive company-wide campaigns with support and engineering teams and see them through to completion.
- Participate in design and code reviews, identify bottlenecks, troubleshoot, and thoroughly research root causes to resolve defects.
BASIC QUALIFICATIONS
- 3+ years of experience in systems administration, including networking, storage systems, operating systems, and hands-on systems engineering.
- Proficiency in at least one modern programming language such as Python, Ruby, Golang, Java, C++, C#, or Rust.
- Experience with Linux/Unix systems.
- Experience with CI/CD pipelines and build processes.
PREFERRED QUALIFICATIONS
- Experience working with distributed systems at scale.
Our inclusive culture empowers Amazon employees to deliver the best results for our customers. If you have a disability and need workplace accommodations during the application, interview, or onboarding process, please visit this link for more information. If your country or region isn't listed, please contact your recruiting partner.
Amazon is an equal opportunity employer and does not discriminate based on veteran status, disability, or other legally protected categories.