HPC-AI Lead Engineer
Date: 14 May 2023
Location: UNIV ADMIN, Kent Ridge Campus, SG
Company: National University of Singapore
About NUS IT
NUS Information Technology is the cornerstone to providing reliable, high-performance and secure IT solutions and effective IT governance for the campus. Here at NUS IT, we aim to transform NUS into a borderless computing community providing knowledge at its fingertips by enhancing the use of effective applications and services for teaching and learning.
We drive a culture that is forward-looking. With a strong passion for IT, our people are always striving to improve, push boundaries and innovate with a "can-do" attitude. We embrace collaboration, open communication and knowledge sharing. If you see yourself thriving in a dynamic environment and breaking new grounds with innovative ideas, you will find yourself at home in NUS IT.
As part of our team, you can look forward an empowered work environment that allows you to take charge of your own career path. We provide competitive remuneration as well as flexible work arrangements to enable your growth and development. We pride ourselves on our diverse workforce and are committed to transforming NUS into a leading global University shaping the future.
https://nusit.nus.edu.sg/
Job Description
The Research Computing Group provides advanced computing infrastructure and services for compute-intensive research in NUS, including: high-performance computing systems, GPU cluster, research data storage, high-speed interconnect, cloud access, commercial scientific software applications, and other research support services.
We are seeking an experienced HPC (High-Performance Computing) Engineer to lead our team responsible for the design, deployment, and operation of our HPC infrastructure. The successful candidate will work closely with cross-functional teams to ensure optimal performance and utilization of NUS HPC resources to support business and research needs.
At the Research Computing Group, team members have some freedom to pursue independent research/work interests. The group has access to many advanced computing resources on-prem and on-the-cloud that can be used to perform experimentations. And working with researchers, the group is actively involved in cutting-edge field of research and emerging technologies such as green HPC, quantum, digital twin, knowledge graph, and others.
Duties & Responsibilities
- Lead the administration and operation of our HPC infrastructure (both on-premise and cloud), including hardware, software, and networking components.
- Assist on the development plan of HPC infrastructure together with lead architect and management team.
- Develop, implement, and document standard operation procedures and best practices for HPC operations, including system monitoring, performance tuning, and security.
- Ensure the operation process and practices adhere to prevailing institutional policies and governance principle.
- Ensures the HPC infrastructure’s availability and performances are according to the predefined SLA and/or standards. Perform corrective actions if the metrics are not met.
- Perform regular operational review with the managed service vendors.
- Work closely with cross-functional teams to ensure optimal performance and utilization of HPC resources.
- Provide technical guidance and mentoring to other engineers on the HPC team.
- Provide a point-of-contact for any system-related issue outside of working hours.
- Identify and implement new technologies and processes to improve HPC efficiency and scalability.
- Contribute to community engagement activities such as technical writing, organizing meetups, conference, and events.
Qualifications
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in HPC operations and administration. Preferably with 3+ years of experience managing an operational team.
- Strong knowledge of HPC technologies, including high-performance computing hardware, parallel file systems, job schedulers, and cluster management software.
- Proficient in scripting and programming languages such as Python, Bash, and Perl.
- Experience with Linux operating systems and system administration.
- Excellent analytical and problem-solving skills.
- Strong written and verbal communication skills.
- Ability to work in a team environment and lead other engineers.
Covid-19 Message
At NUS, the health and safety of our staff and students are one of our utmost priorities, and COVID-vaccination supports our commitment to ensure the safety of our community and to make NUS as safe and welcoming as possible. Many of our roles require a significant amount of physical interactions with students/staff/public members. Even for job roles that may be performed remotely, there will be instances where on-campus presences are required.
In accordance with Singapore's legal requirements, unvaccinated workers will not be able to work on the NUS premises with effect from 15 January 2022. As such, job applicants will need to be fully COVID-19 vaccinated to secure successful employment with NUS.
More Information
Location: Kent Ridge Campus
Organization: NUS Information Technology
Department : Research Computing
Employee Referral Eligible: Yes
Job requisition ID : 15281