IBM has suggested to implement SD Bypass patch in order to resolve ongoing issue in Citi grid LNPRD cluster. Details of issue is below in form of investigations:
We have analyzed the data we collected yesterday and have found the delay in authentication process is caused mostly in OS syscalls. Below is snippets from both LND and SW grids.
1. strace
23933 20:04:20.437828 write(3, "2021-06-21 20:04:20.437 GMT DEBUG [23585:139771402884864] sd.ssmManager.SdSsmManager - SsmManager::sdkConnect(): Entered:...for app name: symping7.2\n", 150) = 150 <0.000014>
-> SD thread PID=23933 logs sdkConnect message prior to start authentication and authorization processes of the job submission user.
23933 20:04:20.444712 poll([{fd=26, events=POLLIN}], 1, 15000
23933 20:04:20.580480 <... poll resumed>) = 1 ([{fd=26, revents=POLLIN}]) <0.135755>
-> a delay occurred during poll() syscall. Waiting on file descriptor 26.
23933 20:04:20.584180 write(3, "2021-06-21 20:04:20.584 GMT DEBUG [23585:139771402884864] servers.common.resmngr.VEMResManagerBase - VEMResManagerBase::checkRbacUserPermissionInternal() : cache kit key=Admin#/SymTesting/Symping72#SOAM_APP_LOGIN cache size=489\n", 228) = 228 <0.000022>
-> logged by SD as soon as authentication and authorization process is done.
2. lsof
sd 23585 symadmp 26u IPv4 28674654 0t0 TCP gridmstsw30p.nam.nsroot.net:47690->gmwgtdcpsp07p.nam.nsroot.net:44443 (ESTABLISHED)
-> checking SD's lsof output, 26u file descriptor is TCP socket between SD<->Siteminder. This indicates SD was waiting for data arrival at FD=26 from Siteminder port 44443. (POLLIN event waits until data is ready at given FD.)
1. strace
25048 19:18:56.989822 write(3, "2021-06-21 19:18:56.989 GMT DEBUG [24709:139991709902592] sd.ssmManager.SdSsmManager - SsmManager::sdkConnect(): Entered:...for app name: symping7.2\n", 150) = 150 <0.000016>
-> SD thread PID=25048 logs sdkConnect message prior to start authentication and authorization processes of the job submission user.
25048 19:18:56.989893 futex(0x34bf760, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000012>
25048 19:18:56.989940 futex(0x3bf4750, FUTEX_WAIT_PRIVATE, 2, NULL
25047 19:18:57.645953 futex(0x34bf760, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000012>
25047 19:18:57.646001 futex(0x3bf4750, FUTEX_WAKE_PRIVATE, 1
25048 19:18:57.646037 <... futex resumed>) = 0 <0.656088>
25048 19:18:57.650643 poll([{fd=26, events=POLLIN}], 1, 15000
25048 19:18:57.777387 <... poll resumed>) = 1 ([{fd=26, revents=POLLIN}]) <0.126732>
25048 19:18:57.780711 poll([{fd=25, events=POLLIN}], 1, 15000
25048 19:18:57.894638 <... poll resumed>) = 1 ([{fd=25, revents=POLLIN}]) <0.113916>
25048 19:18:57.897883 poll([{fd=26, events=POLLIN}], 1, 15000
25048 19:18:58.024320 <... poll resumed>) = 1 ([{fd=26, revents=POLLIN}]) <0.126426>
-> delay occurred during 3 poll() sequential syscalls. SD waiting on file descriptor 26 and 25.
25048 19:18:58.027782 write(3, "2021-06-21 19:18:58.027 GMT DEBUG [24709:139991709902592] servers.common.resmngr.VEMResManagerBase - VEMResManagerBase::checkRbacUserPermissionInternal() : cache kit key=Admin#/SymTesting/Symping72#SOAM_APP_LOGIN cache size=167\n", 228
-> logged by SD as soon as authentication and authorization process is done.
2. lsof
sd 24709 symadmp 25u IPv4 136235231 0t0 TCP gridmstln30p.eur.nsroot.net:53580->gmwmwdcpsp07p.nam.nsroot.net:44443 (ESTABLISHED)
sd 24709 symadmp 26u IPv4 133881256 0t0 TCP gridmstln30p.eur.nsroot.net:40578->gmwgtdcpsp07p.nam.nsroot.net:44443 (ESTABLISHED)
-> checking SD's lsof output, those FDs are TCP sockets between SD<->Siteminder. It contacted 2 different siteminder server.
• We observe delays are related to siteminder authentications.
• In LND grid case, during authentication phase SD thread had to contact two different siteminder servers. Please turn off second siteminder server (we believe it is gmwmwdcpsp07p based on comparison with SW case), as it may help avoid attempt sending the user query to the second siteminder.
• Or check with Siteminder team if two siteminder servers are in synced in terms of user database. (i.e. if user A is not query-able in siteminder A, it may reach out to B)
• Siteminder plugin code in symphony is owned in Citi, suggest to investigate plugin code to remediate the delays.
.The required enhancement is available on IBM Fix Center: sym-7.2.0.2-build600609-citi