LSF to track and report IO usage per job

Disk space is a valuable resource like memory or GPU/CPU/slots. Many IC design applications consume significant disk space when writing simulation results to disk. Tracking disk usage or I/O rate **PER JOB** will allow cluster administrators to monitor and prevent jobs from filling up shared drives. A 100% full drive is guaranteed to crash all applications running on it.

Please update the LSF monitoring system to also include I/O rate monitoring per job. This could be included in the output of bjobs -l or bjobs -o '<field_name>' in total GB or GB/sec. This way the cluster admins can identify I/O intensive jobs and take preventative action early to avoid a disk filling up to 100%, crashing all applicatios running on it.

A competitor product (SGE/OGE) we used prior had this functionality. We are surprised LSF doesn't support a similar function. I am certain this can be implemented in LSF too.

Please let me know if you have any questions or concerns.

Thank you,
Eduard

Needed By

Yesterday (Let's go already!)

Post comment

Guest

Nov 25, 2025

I thing with Lustre, it has the ability directly in Lustre to report I/O on that file system by Slurm job as reported through the a Lustre extension to the proc filesystem. The problem has always been: "Should there not be a standard way to do this in the Kernel?" Kernel is the operative term. If there can be some standardization there, it's possible to make it work with tools like LSF in a way that is sustainable. Having one off's for Lustre is not scalable.

Reply
Hide replies

Admin

Bill McMillan

Nov 20, 2025

This has been previously discussed in an older enhancement request.
The main challenge is that accurately tracking disk usage and IO statistics per job is not practical.

For example, a job can read or write to any directory/file local or remote - if a local block device that can trapped (with some overhead), but for a remote file, it may get served out of cache, so never generates network io. From the example in grid engine's documentation, it is reporting on usage within a TMPDIR dynamically created for the job.

Similarly, while cgroup does have an io.stat subsystem, it is only correct for block devices - ie local disks; it does not report accurate usage for shared (NFS) devices)

If you are using Storage Scale as your file system, then there is an LSF integration that will report on true IO from the job to the backend storage.

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

LSF to track and report IO usage per job