Share this Job

HPC-AIML Ops Engineer

Date: 18-Jan-2023

Location: UNIV ADMIN, Kent Ridge Campus, SG

Company: National University of Singapore

About NUS IT

NUS Information Technology is the cornerstone to providing reliable, high-performance and secure IT solutions and effective IT governance for the campus. Here at NUS IT, we aim to transform NUS into a borderless computing community providing knowledge at its fingertips by enhancing the use of effective applications and services for teaching and learning. 

 

We drive a culture that is forward-looking. With a strong passion for IT, our people are always striving to improve, push boundaries and innovate with a "can-do" attitude.  We embrace collaboration, open communication and knowledge sharing. If you see yourself thriving in a dynamic environment and breaking new grounds with innovative ideas, you will find yourself at home in NUS IT. 

 

As part of our team, you can look forward an empowered work environment that allows you to take charge of your own career path. We provide competitive remuneration as well as flexible work arrangements to enable your growth and development. We pride ourselves on our diverse workforce and are committed to transforming NUS into a leading global University shaping the future. 

 

https://nusit.nus.edu.sg/ 

Job Description

The Research Computing Group provides advanced computing infrastructure for compute-intensive research in NUS, including: high-performance computing systems, HPC-AI computing systems, project data storage, high-speed interconnect, commercial scientific software applications, and other scientific support services.


As the host of many AI/ML-enabling technologies and relevant expertise, the group has taken an active role in supporting research stakeholders implement AI/ML techniques to accelerate their research discovery. This service is also extended towards the university administration stakeholders to allow them to gain actionable insights and derive new business value from their data collection.


We are looking for an HPC and AI/ML Operation Engineer to take a technical role to administer the HPC and AI/ML infrastructure including establishing process, developing task automation, maintaining SLA, and liaising with vendors and other NUS IT team to ensure operational continuity. In addition, the candidate will also support the initiative to operationalize many AI/ML projects under a common infrastructure platform. 


At the Research Computing Group, team members have some freedom to pursue independent research/work interests. The group has access to many advanced computing resources on-prem and on-the-cloud that can be used to perform experimentations. And working with researchers, the group is actively involved in cutting-edge field of research and emerging technologies such as quantum, digital twin, knowledge graph, and others.

 

Duties & Responsibilities

  • Provide operational leadership on HPC and AI/ML Infrastructure.
  • Establish, maintain, and monitor management processes on the HPC and AI/ML Infrastructure.
  • Liaise and manage vendors and other team to meet internal operational SLA and ensure operational continuity.
  • Support new research computing infrastructure development and operational matters.
  • Keep informed and test recommended practices in the areas of HPC, AI/ML, and general IT operation framework.
  • Provide technical support and user training for REC’s scientific computing resources and services.
  • Contribute to community engagement activities such as technical writing, organizing meetups, conference and events.

Qualifications

  • Degree in a field with a quantitative focus (computer science, data science, statistics, mathematics, physics, engineering, or others).
  • Have at least 3 years of management experience and 5 years of relevant HPC & AI/ML infrastructure experience.
  • Familiarity with Linux system administration and high-performance computing (HPC) environment.
  • Familiarity with AI/ML infrastructure administration particularly GPU and container management.
  • Demonstrated experience providing technical leadership in HPC and AI/ML infrastructure environment.
  • Excellent written and oral communication skills.
  • Demonstrated strong organizational skills, capable of handling multiple projects at the same time.

Covid-19 Message

At NUS, the health and safety of our staff and students are one of our utmost priorities, and COVID-vaccination supports our commitment to ensure the safety of our community and to make NUS as safe and welcoming as possible. Many of our roles require a significant amount of physical interactions with students/staff/public members. Even for job roles that may be performed remotely, there will be instances where on-campus presences are required. 

In accordance with Singapore's legal requirements, unvaccinated workers will not be able to work on the NUS premises with effect from 15 January 2022. As such, job applicants will need to be fully COVID-19 vaccinated to secure successful employment with NUS.

More Information

Location: Kent Ridge Campus

Organization: NUS Information Technology

Department : Research Computing

Employee Referral Eligible: Yes

Job requisition ID : 15281