← Back to News List

taki Back Online After Unscheduled Downtime

To all taki users,

At 12:30PM portions of the taki cluster environment became unavailable.

The management system attempted to correct the issue by selectively killing administrative processes until, at 12:56PM, the entire system became unresponsive.

After reseting the management system and ensuring various processes restarted appropriately, we were able to bring the system back online at around 2PM.

While no compute hardware was rebooted, this unscheduled downtime likely resulted in some loss of state on running jobs. Please inspect the output of any jobs that completed during this time for obvious errors or rerun them.

As a result of the management system being unavailable for >1hour, attempts to access the cluster or use SLURM during this time likely failed. We will be resolving tickets related to this with a link to this message.

If you have experienced any other issues during this time---or have any other questions, comments, or concerns---please submit a descriptive help request via the form found at the following link.


Roy Prouty
HPC System Administrator

Posted: September 28, 2021, 3:15 PM