An overhaul of the HPC network and storage environments and the creation of a high-speed data transfer node will cause an HPC cluster outage in the coming weeks. The HPC cluster outage will begin on Monday, July 29, at 8 a.m. and is expected to end at 5 p.m. on Friday, August 10. During this outage, all URC clusters—Copperhead, Sidewinder, Taipan, Steelhead, Hammerhead, and Titan—will be unavailable.
To Plan for this Outage
Don’t submit jobs with wall time requests that extend beyond the start of this outage. If you do, the job won’t get scheduled. Why? Because the batch scheduler will be configured not to run any jobs that would overlap the outage dates. Lowering your wall time request to reflect better the actual run time that your job requires will reduce the impact of this restriction.
Plan to resubmit any previously submitted jobs that won’t finish before the outage. Why? Because the outage includes an upgrade of the Job Scheduling software, it will not be possible to preserve any jobs remaining in the queue. You may receive an error message regarding these jobs. After August 10, you must resubmit jobs that were previously scheduled the finish within the outage window.
About the Upgrades
The current HPC network environment is made up of two components: the network equipment (Cisco) and the InfiniBand (IB) equipment. Both are five years or older, and the Cisco equipment will reach end-of-support in December 2019. The goal is to replace both components this year, dramatically increasing both the speed and performance of the cluster. The network speed will go from 1Gbps/10Gbps to 40Gbps while the IB speed will go from 40Gbps to 100Gbps. This speed increase should have a corresponding decrease in job completion time as storage can be accessed quicker.
The current HPC storage environment is made up of two storage systems: NFS and Lustre. Lustre is the primary storage system for the cluster, and NFS is the central storage system for project space, user home space, and those with many small files. This upgrade will increase the speed and size of the Lustre environment, as well as increase the speed for the NFS environment. Lustre will increase capacity, from 1PB to 2PB of total storage, while NFS will speed from 10Gbps to 40Gbps.
Data Transfer DMZ
The current HPC environment does not allow for incoming high-speed data transfer, which presents problems when researchers have large datasets that need to move into the cluster. This upgrade includes the creation of a high-speed data transfer demilitarized zone (DMZ) for the research cluster. The DMZ will create a dedicated 10Gb data transfer channel (can be increased as needed) for researchers to bring data into the cluster.
It is possible that some of the cluster services may be available before August 10. We will send updates to all URC users as online capabilities return.
If you have any questions, contact URC Support at email@example.com.