This portal is to open public enhancement requests against products and services offered by the IBM Data & AI organization. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).
Shape the future of IBM!
We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:
Search existing ideas
Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,
Post your ideas
Post ideas and requests to enhance a product or service. Take a look at ideas others have posted and upvote them if they matter to you,
Post an idea
Upvote ideas that matter most to you
Get feedback from the IBM team to refine your idea
Specific links you will want to bookmark for future use
Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.
IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.
ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.
IBM Employees should enter Ideas at https://ideas.ibm.com
The systemd unit file as specified (and provided in LSF 10.1 fix 432732) is not compatible with the running of MPI jobs over Infiniband since it gives a memlock limit of 64, rather than the much larger (or, typically, unlimited) values required to register memory for RDMA buffers.
Here's the text of my support request reporting the problem, including workaround:
Problem Details
.
Product or Service: Spectrum LSF Standard Edition A.1.0
Component ID: 5725G8201
.
Operating System: Linux
.
Problem title
hostsetup on systemd-controlled hosts disallows MPI over Infiniband
(low memlock)
.
Problem description and business impact
Description:
When using hostsetup on SLES12.1 (and presumably other systemd-enabled
distros), the following lsfd.service file is created:
---
[Unit]
Description=IBM Spectrum LSF
After=network.target nfs.service autofs.service gpfs.service
[Service]
Type=forking
ExecStart=/10.1/linux3.10-glibc2.17-x86_64/etc/lsf_daemons
start
ExecStop=/10.1/linux3.10-glibc2.17-x86_64/etc/lsf_daemons stop
KillMode=none
[Install]
WantedBy=multi-user.target
---
When lsf is started using 'systemctl start lsfd', the ludicrously low
default memlock limit of 64 is used, which is then the value inherited
by sbatchd. All jobs started on the host then inherit that as a limit.
Problem:
When starting an multi-node job on these nodes that uses MPI over
infiniband (Ansys Fluent in this case), the following error is observed:
1494948637 | fluent_mpi.17.0.0: Rank 0:1: MPI_Init: ibv_create_cq()
failed 4
1494948637 | fluent_mpi.17.0.0: Rank 0:1: MPI_Init: Can't initialize
RDMA device
1494948637 | fluent_mpi.17.0.0: Rank 0:1: MPI_Init: Internal Error:
Cannot initialize RDMA protocol
The RDMA protocol cannot be initialised because it requires that memory
be pinned for use as RDMA buffers. In general it is advised to allow
the maximum value for locked memory to be unlimited and to allow the
MPI implementation to ensure that a sensible amount is actually
registered.
The Workaround:
Modify hostsetup to insert the line 'LimitMEMLOCK=infinity', as below:
cat > $_tmp_service_file << EOF
[Unit]
Description=IBM Spectrum LSF
After=network.target nfs.service autofs.service gpfs.service
[Service]
LimitMEMLOCK=infinity
Type=forking
ExecStart=${LSF_SERVERDIR}/lsf_daemons start
ExecStop=${LSF_SERVERDIR}/lsf_daemons stop
KillMode=none
[Install]
WantedBy=multi-user.target
EOF
The Fix:
Offer an option to insert the line, or not, based on usage requirements.
The Impact:
For me, none, as I have fixed my hostsetup. For anyone who hasn't
figured it out, failure of all multi-node jobs that use RDMA.
Apologies if that has already been fixed in a subsequent patch.
This enhancement is available for LSF913. Please download it from Fix Central URL:
http://www-933.ibm.com/support/fixcentral/swg/selectFixes?parent=Platform%2BComputing&product=ibm/Other+software/Platform+LSF&release=All&platform=All&function=fixId&fixids=lsf-9.1.3-build431869&includeSupersedes=0
Fix ID: lsf-9.1.3-build431869
Change it to right status from all customer view.
test
We will address this in a future fix pack.
Creating a new RFE based on Community RFE #89685 in product Platform LSF.