About NUS IT
NUS Information Technology is the cornerstone to providing reliable, high-performance and secure IT solutions and effective IT governance for the campus. Here at NUS IT, we aim to transform NUS into a borderless computing community providing knowledge at its fingertips by enhancing the use of effective applications and services for teaching and learning.
We drive a culture that is forward-looking. With a strong passion for IT, our people are always striving to improve, push boundaries and innovate with a "can-do" attitude. We embrace collaboration, open communication and knowledge sharing. If you see yourself thriving in a dynamic environment and breaking new grounds with innovative ideas, you will find yourself at home in NUS IT.
As part of our team, you can look forward to an empowered work environment that allows you to take charge of your own career path. We provide competitive remuneration as well as flexible work arrangements to enable your growth and development. We pride ourselves on our diverse workforce and are committed to transforming NUS into a leading global University shaping the future.
Job Description
The role is for an experienced HPC (High-Performance Computing) Architect or Engineer to take a lead in designing, deploying, enhancing, and managing our HPC infrastructure. This infrastructure includes GPU clusters (B200/H200/A40), liquid-cooled CPU cluster, and a cloud-based HPC system, with future expansions planned to support cutting-edge NUS research.
Duties and Responsibilities
- Lead the administration and operation of our HPC infrastructure (both on-premise and/or cloud), including hardware, software, and networking components.
- Lead the development of HPC infrastructure together with lead architect and management team.
- Perform project management activities and vendor management on strategic HPC infrastructure projects.
- Develop, implement, and document standard operation procedures and best practices for HPC operations, including system monitoring, performance tuning, and security.
- Ensure the operation process and practices adhere to prevailing institutional policies and governance principle.
- Ensure the HPC infrastructure’s availability and performances are according to the predefined SLA and/or standards. Perform corrective actions if the metrics are not met.
- Perform regular operational review with the managed service vendors.
- Work closely with cross-functional teams to ensure optimal performance and utilization of HPC resources.
- Provide technical guidance and mentoring to other engineers on the HPC team.
- Provide a point-of-contact for any system-related issue outside of working hours.
- Identify and implement new technologies and processes to improve HPC efficiency and scalability.
- Provide support and engagement with the research community (e.g., principal investigators, research institutes, and strategic stakeholders)
- Contribute to community engagement activities such as technical writing, organizing meetups, conference, and events.
Qualifications
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in HPC operations and administration. Preferably with 3+ years of experience managing an operational team and/or HPC projects.
- Strong knowledge of HPC technologies, including high-performance computing hardware, parallel file systems, job schedulers, and cluster management software.
- Proficient in scripting and programming languages such as Python, Bash, and Perl.
- Experience with Linux operating systems and system administration.
- Experience in administering an HPC workload scheduler, e.g. PBS pro, Slurm, SGE, etc.
- Experience in administering a parallel file system, e.g. GPFS, LUSTRE, etc.
- Familiarity with data centre technology, specifically with direct-to-chip liquid cooling is a plus.
- Excellent analytical and problem-solving skills.
- Strong written and verbal communication skills.
- Ability to work in a team environment and lead other engineers.