Job Description
Prof Yang Zhang's Lab at the School of Computing, National University of Singapore (NUS) is seeking a skilled and adaptable System Administrator with expertise in Red Hat Linux and UNIX environments. The primary responsibility of this role is to oversee the construction and management of a High-Performance Computing (HPC) cluster comprising 4,500 CPU/GPU cores, BeeGFS storage, and Infiniband interconnects. The ideal candidate will play a critical role in ensuring the stability, security, and optimal performance of the system through proficient management and troubleshooting of the HPC infrastructure.
Role & Responsibilities
• Oversee the management, monitoring, and maintenance of the HPC cluster and online server systems operating on Red Hat Linux and UNIX platforms.
• Perform routine system administration tasks, including user management, access control, file system maintenance, and system backups.
• Continuously monitor system performance and resource utilization, proactively identifying and resolving bottlenecks to maintain optimal responsiveness.
• Troubleshoot and resolve hardware, software, and network issues, collaborating with cross-functional teams when necessary.
• Implement and enforce security protocols to protect systems from unauthorized access, vulnerabilities, and cyber threats.
• Plan and execute system patches, updates, and upgrades, ensuring a secure and up-to-date computing environment.
• Investigate and respond to system alerts and incidents, performing root cause analysis and implementing preventive measures.
• Maintain comprehensive documentation of system configurations, procedures, and troubleshooting steps for internal reference and knowledge sharing.
• Provide technical support to end-users, assisting with system-related inquiries, issue resolution, and training as needed.
• Participate in capacity planning and scalability assessments to ensure system resources align with both current and future requirements.
• Work closely with vendors and third-party service providers to manage hardware and software procurement, maintenance, and support contracts.
• Undertake any additional responsibilities related to the procurement, updating, and maintenance of the HPC cluster and network infrastructure.
Apply:
Interested candidates should submit their CV along with a brief description of their experience and interest to Prof. Yang Zhang via email at zhang@nus.edu.sg.
Job Requirements
• A Bachelor’s degree or higher in Computer Science, Information Technology, or a related field. Relevant certifications such as Red Hat Certified Engineer (RHCE) are a plus.
• Familiarity with artificial intelligence (AI) and machine learning is not mandatory but would be advantageous.
• Expertise in CPU/GPU architectures, BeeGFS storage systems, and Infiniband interconnects is highly desirable.
• Proven experience as a System Administrator, with a strong focus on Red Hat Linux and UNIX environments.
• Proficiency in shell scripting and automation to optimize system administration workflows.
• In-depth understanding of networking concepts, protocols, and troubleshooting techniques in a mixed-platform environment.
• Experience with Linux Job Schedulers such as SLURM, PBS, or similar is critical and preferred.
• Familiarity with system monitoring and management tools, including Nagios, Zabbix, and Ansible.
• Strong analytical and problem-solving skills, with the ability to diagnose and resolve complex technical issues efficiently.
• Excellent communication skills, both written and verbal, to collaborate effectively with technical and non-technical stakeholders.
• Meticulous attention to detail, with a commitment to maintaining comprehensive documentation and accurate records.
• Adaptability and teamwork skills, with the ability to work efficiently in a dynamic, evolving technological landscape.
More Information
Location: Kent Ridge Campus
Organization: School of Computing
Department : Department of Computer Science
Employee Referral Eligible: No
Job requisition ID : 28282