Overview
Responsibilities Required
– identify hardware issues on all systems and open support cases with upstream vendors
– this does not mean wait until something fails, it requires parsing of logs, monitoring info to determine proactively, rather than re actively, issues
– amber light walk
– primary interactions with IBM, NetApp
– see issues through to resolution for routine hardware issues
e.g. disk, controller failure, that replacement is requested, arrives, is installed and
system returns to optimal
– work with on-site technicians as needed
– patch and maintain base operating systems for RHEL and CentOS as needed
– read release notes, determine any impact of upgrades
– implement new baseline for OS/kernel/MOFED/GPFS/(Lustre version for data
transfer nodes) as determined by NERSC engineers across all systems
– document date of change and systems involved, any issues encountered
– maintain up-to-date ESS and GPFS software and firmware on IBM ESS
– read release notes, determine any impact
– document date of change and systems involved, any issues encountered
– rack and cable both new and existing equipment
– primarily intra-rack cabling and routine hardware swap
– larger-scale integration responsibilities shared with other groups
– participate in on-call 24/7 responsibilities
– tier 1/triage of issues reported via Nagios
– one-week rotation between 3-4 other individuals
– average
Optional
– identify areas for routine process optimization and implement solutions
– automation of common tasks, contributing to Nagios monitoring infrastructure
– develop scripts and tools and contribute them to internal Gitlab repository
– contribute to integration and implementation planning for future system upgrades
an deployments
– assist with debugging integration between GPFS and Cray Data Virtualization Service
(DVS
–
– provided by Dice
To Apply: https://www.jobg8.com/Traffic.aspx?QPoIpR8FdV7aDvRSzoWm3Qf