In general, the typical answer to this issue is that there is a mismatch between the available and requested resources.
A common situation is that you have asked for resources which are simply not available at present (but might become available in the future, once other jobs end).
You can verify your situation with
oarstat -fj <jobid>, in particular look at the
For example, have you investigated if there is a reservation in the future which prevents your (potentially long) job from running? Check the Gantt charts of Chaos & Gaia to find that out:
One way to significantly improve the scheduling of your jobs is, to minimize their resource usage:
- Do you really need all those nodes?
- Do you really need all those cores?
- Do you really need those specific (bigmem, bigsmp, GPU) resources?
- Do you really need all that walltime?
In the last case, you may have one very elegant and correct solution: use checkpointing. Depending on your application, this mechanism could be trivial to apply to your workflow. See the following resources for more details: