Problem with LOTUS batch scheduler - SLURM
Posted on February 11, 2021 (Last modified on October 19, 2023) • 2 min read • 222 words13:30 11th Feb - update 1
There is currently an issue with the LOTUS batch scheduler - SLURM. This has manifested in unusually long pending time per job, a slow response time when querying SLURM and occasional failed job submissions.We are aware of the issue and are working on a resolution. Please do not email the helpdesk about issues related to SLURM. We will update you when we know more.
Sorry for the inevitable inconvenience caused.
JASMIN team
16:30 11th Feb - update 2
The team is still working on resolving the issue. Unfortunately, we cannot provide an estimate time frame for a fix. It is unlikely that this will be resolved before 5pm today. We will provide a further update tomorrow morning.
10:00 12th Feb - update 3
There are a lot less jobs in the queue and jobs appear to being processed normally. We have found quite a few nodes with jobs stuck in completing state which we believe are contributing to the problem. The team are working their way through these.
You can now submit jobs as normal - but the service should be considered at risk. We will keep a close eye on the situation and if the problem recurs we may ask you to stop submitting jobs again.
Thank you for patience whilst we resolve this issue.