System Outages

ICDS engineers have updateed and expanded the outage protocol to improve recovery time and expand testing.

The outage workflow has been updated to make use of serviceNOW and provide tracking.

  • changes are throughly documented
  • includes a review by leadership to understand potential impacts
  • system test plan that includes client-submitted use cases

Outage WorkFlow Diagram

Post Outage Test

Post outage ICDS Engineers go through a series of test to show basic connectivity and functionaloty of services (i.e. job submission, SLURM, OOD portal, Globus access, science gateways). This will be followed by application test detailed below.

Includes sample test for the following applications:

  • C
  • alltest (MPI network latency mapping test)
  • bash / SLURM submission testing
  • comsol
  • cpp
  • fluent
  • fortran
  • gpu nbody
  • gromacs
  • java
  • julia
  • mathematica
  • matlab
  • py_array
  • python
  • r
  • starccm

Includes user test for the following applications:

  • Ansys Fluent job
  • MPI fluid solver
  • Gaussian
  • OpenFOAM
  • COMSOL
  • MATLAB sine_wave

Your input is valuable. At the conclusion of every outage, ICDS engineers run extensive use case tests to ensure that the system will work as expected. If your team runs your own post outage tests or if you have ideas for tests you’d like ICDS engineers to run, please let us know.

Planned Outage 2025-05-14

Outage Duration

  • Planned May 14, 2025 17:00 -- May 15, 2025 17:00
  • Actual May 14, 2025 17:00 -- May 15, 2025 17:03

Plan of Action

  • STORAGE: troubleshoot power redundancy configuration on RC group storage Complete

  • STORAGE: continue to troubleshoot RDMA timeout issues on RC group storage vendor logs collected, complete for now, nodes using TCP

  • STORAGE: Globus software update from 5.4.80 to 5.4.85. Complete

  • NETWORK: resolve hardware error on Interconnect Switch Complete

  • SCHEDULER: Slurm Update from 24.05.4 to 24.05.8 Complete

  • Operating System Image and Package updates Complete

  • Workflow: update symlink at /storage/icds/tools/sw/firefox to point to updated firefox. Complete

  • Cluster Admin Node Updates Complete

  • Re-sync the software stack between RC and RR Complete

  • License Updates: MATLAB, COMSOL, Mathematica, tecplot Complete

Known issues:

  • Schrodinger license manager fails to load Investigating

  • Mathematica license shows as expiring May 30 Investigating

ServiceNow Form