Job Description
Job Title:  Research Assistant (System Admin), Yang Zhang Lab
Posting Start Date:  20/03/2025
Job Description: 

Job Description

Prof Yang Zhang's Lab at the School of Computing, National University of Singapore (NUS) is seeking a skilled and adaptable System Administrator with expertise in Red Hat Linux and UNIX environments. The primary responsibility of this role is to oversee the construction and management of a High-Performance Computing (HPC) cluster comprising 4,500 CPU/GPU cores, BeeGFS storage, and Infiniband interconnects. The ideal candidate will play a critical role in ensuring the stability, security, and optimal performance of the system through proficient management and troubleshooting of the HPC infrastructure.

 

Role & Responsibilities
•    Oversee the management, monitoring, and maintenance of the HPC cluster and online server systems operating on Red Hat Linux and UNIX platforms.
•    Perform routine system administration tasks, including user management, access control, file system maintenance, and system backups.
•    Continuously monitor system performance and resource utilization, proactively identifying and resolving bottlenecks to maintain optimal responsiveness.
•    Troubleshoot and resolve hardware, software, and network issues, collaborating with cross-functional teams when necessary.
•    Implement and enforce security protocols to protect systems from unauthorized access, vulnerabilities, and cyber threats.
•    Plan and execute system patches, updates, and upgrades, ensuring a secure and up-to-date computing environment.
•    Investigate and respond to system alerts and incidents, performing root cause analysis and implementing preventive measures.
•    Maintain comprehensive documentation of system configurations, procedures, and troubleshooting steps for internal reference and knowledge sharing.
•    Provide technical support to end-users, assisting with system-related inquiries, issue resolution, and training as needed.
•    Participate in capacity planning and scalability assessments to ensure system resources align with both current and future requirements.
•    Work closely with vendors and third-party service providers to manage hardware and software procurement, maintenance, and support contracts.
•    Undertake any additional responsibilities related to the procurement, updating, and maintenance of the HPC cluster and network infrastructure.
 

 

Apply:

Interested candidates should submit their CV along with a brief description of their experience and interest to Prof. Yang Zhang via email at zhang@nus.edu.sg.

Job Requirements

•    A Bachelor’s degree or higher in Computer Science, Information Technology, or a related field. Relevant certifications such as Red Hat Certified Engineer (RHCE) are a plus.
•    Familiarity with artificial intelligence (AI) and machine learning is not mandatory but would be advantageous.
•    Expertise in CPU/GPU architectures, BeeGFS storage systems, and Infiniband interconnects is highly desirable.
•    Proven experience as a System Administrator, with a strong focus on Red Hat Linux and UNIX environments.
•    Proficiency in shell scripting and automation to optimize system administration workflows.
•    In-depth understanding of networking concepts, protocols, and troubleshooting techniques in a mixed-platform environment.
•    Experience with Linux Job Schedulers such as SLURM, PBS, or similar is critical and preferred.
•    Familiarity with system monitoring and management tools, including Nagios, Zabbix, and Ansible.
•    Strong analytical and problem-solving skills, with the ability to diagnose and resolve complex technical issues efficiently.
•    Excellent communication skills, both written and verbal, to collaborate effectively with technical and non-technical stakeholders.
•    Meticulous attention to detail, with a commitment to maintaining comprehensive documentation and accurate records.
•    Adaptability and teamwork skills, with the ability to work efficiently in a dynamic, evolving technological landscape.

More Information

Location: Kent Ridge Campus

Organization: School of Computing

Department : Department of Computer Science

Employee Referral Eligible: No

Job requisition ID : 28282