SantaClaraRecruiter Since 2001
the smart solution for Santa Clara jobs

Site Reliability Engineering Manager - NeMo LLM Service

Company: NVIDIA Corporation
Location: Santa Clara
Posted on: May 28, 2023

Job Description:

Site Reliability Engineering Manager - NeMo LLM Service page is loaded Site Reliability Engineering Manager - NeMo LLM Service Apply locations US, CA, Santa Clara US, Remote time type Full time posted on Posted 4 Days Ago job requisition id JR1965885 NVIDIA is the leading artificial intelligence computing company and paving the way with innovations in generative AI, conversational AI, supercomputing, gaming and visualization. NVIDIA gives research institutions, cloud providers, large companies and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems.As the Manager of Site Reliability, you will establish an enthusiastic and dedicated SRE team serving the forefront of the latest science and technology trends. Working together with the NeMo development team, you will build and run large-scale, fault-tolerant systems and services able to run in any cloud. Are you passionate about infrastructure and looking for complex meaningful issues? Are you ready to run the next generation of cloud services, design and code innovative solutions that address the needs of a whole organization? Then we are excited to have a motivated person like you!What You Will Be Doing:The NeMo Service team is responsible for building and deploying Generative AI services, including large language models and BioNeMo - our drug discovery cloud service. You will apply engineering leadership and deep knowledge of infrastructure and software development at scale to own the operation, adoption, and evolution of these services. You will lead by example, mentor the SRE and engineering teams, and establish credibility through quality technical execution, including hands-on contributions to code and automation to keep things running smoothly.What We Need To See:

  • 5+ overall years of demonstrated ability in site reliability and technical operations leadership
  • BSCS or BSEE or equivalent experience
  • Experience building large and geographically disperse infrastructure supporting business-critical cloud & on-premises services
  • 3+ years of people management and team leadership experience, including headcount planning and developing strong and motivated teams
  • Experience running AI/ML operations through CI/CD pipeline
  • Experience designing and implementing CI/CD back-end services.
  • Strong programming skills in Go. Python proficiency.
  • Excellent debugging and troubleshooting skills.Ways To Stand Out From The Crowd:
    • Excellent understanding of Kubernetes and one or more public clouds
    • Ability to reason and choose the best possible algorithm to meet scaling and availability challenges.
    • Skilled at decomposing complex requirements into simple tasks and reuse available solutions to implement most of those.
    • You can design simple and reliable systems that can work without much support.
    • Strong cloud management foundation.
    • Proven record of delivering solutions using Agile process and methodologies.The base salary range is $216,000 - $333,500. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits . NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. Similar Jobs (5) Senior Cloud Backend Engineer, NeMo LLM Service locations 6 Locations time type Full time posted on Posted 6 Days Ago Senior GPU Supercomputer Scheduler Engineer locations 4 Locations time type Full time posted on Posted 6 Days Ago About Us 0:00 -/ - 3:32 NVIDIA is a Learning Machine NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and the metaverse is transforming the world's largest industries and profoundly impacting society. Learn more about NVIDIA .

Keywords: NVIDIA Corporation, Santa Clara , Site Reliability Engineering Manager - NeMo LLM Service, Engineering , Santa Clara, California

Click here to apply!

Didn't find what you're looking for? Search again!

I'm looking for
in category

Log In or Create An Account

Get the latest California jobs by following @recnetCA on Twitter!

Santa Clara RSS job feeds