Overview

Responsibilities Required

– identify hardware issues on all systems and open support cases with upstream vendors

– this does not mean wait until something fails, it requires parsing of logs, monitoring info to determine proactively, rather than re actively, issues

– amber light walk

– primary interactions with IBM, NetApp

– see issues through to resolution for routine hardware issues

e.g. disk, controller failure, that replacement is requested, arrives, is installed and

system returns to optimal

– work with on-site technicians as needed

– patch and maintain base operating systems for RHEL and CentOS as needed

– read release notes, determine any impact of upgrades

– implement new baseline for OS/kernel/MOFED/GPFS/(Lustre version for data

transfer nodes) as determined by NERSC engineers across all systems

– document date of change and systems involved, any issues encountered

– maintain up-to-date ESS and GPFS software and firmware on IBM ESS

– read release notes, determine any impact

– document date of change and systems involved, any issues encountered

– rack and cable both new and existing equipment

– primarily intra-rack cabling and routine hardware swap

– larger-scale integration responsibilities shared with other groups

– participate in on-call 24/7 responsibilities

– tier 1/triage of issues reported via Nagios

– one-week rotation between 3-4 other individuals

– average

Optional

– identify areas for routine process optimization and implement solutions

– automation of common tasks, contributing to Nagios monitoring infrastructure

– develop scripts and tools and contribute them to internal Gitlab repository

– contribute to integration and implementation planning for future system upgrades

an deployments

– assist with debugging integration between GPFS and Cray Data Virtualization Service

(DVS

– provided by DiceTracking

To Apply: https://www.jobg8.com/Traffic.aspx?QPoIpR8FdV7aDvRSzoWm3Qf