Removal of an unnecesary system crash due to must-complete or latch holding agent - RC 00F3040A

See this idea on ideas.ibm.com

Today, there is an intentional system crash designed and exploited with an abend code 00F3040A when an agent abending in system is in so called must-complete state or is holding a latch. We have been running a case TS006198341 where our system crashed again for the above mentioned reason. Whithin the discussion with IBM support in this case, we did not receive any information that could explain this strange DB2 behavior. Currently, when such situation as described occurre, DB2 will crash "to protect the integrity and stability of the entire group". We have been also told that DB2 "still holds to the philosophy that "failing fast is better than a slow, grinding death." Our business model and aplications often run an extremally long running transactions, sometimes longer then a few days in duration. Most often at least few hours. I believe this philosophy does take into account that restart, retained locks processing etc in such system is not at all a fan of such philosophy, when such crash kills well behaving hundreds or more agents without a reason.

We have also gotten confirmed in the case information that during the restart, DB2 needs access to logs, structures etc to perform the restart and backout. It will commit or abort and rollback agents that were executing during abend etc. We were told it is all documented with the follwing links:

https://www.ibm.com/docs/en/db2-for-zos/12?topic=msc-heuristic-decisions-about-whether-commit-abort-indoubt-thread

https://www.ibm.com/docs/en/db2-for-zos/12?topic=consistency-after-termination-failure

https://www.ibm.com/docs/en/db2-for-zos/12?topic=db2-restart-options-after-abend

However, based on the full description and restart logic, it is clear that there is not a single byte of a "new" information that DB2 will have thanks to the termination. It is a standard in mainframe shops that when such abend happens, System Automatoin, Netview, ARM - any automated solution will immediatelly restart the subsystem without any human intervention and DB2 is always able to fully clean up, recover, restart, commit, abort, free latch and do all taht is necessary to resolve the crash cause.

Proposed solution:

Based on that, we believe that this crash is not necessary and exactly the same logic that is applied after crash/restart, should be applied without it, to recover from the trouble. It should be enough to just terminate and crash only the agent that is causing the havoc, not the whole subsystem. There is no reason to kill thousands of innocent agents, only because one is bad. As mentioned before, crash does not bring anything new to the tables in terms of informartion for DB2 how to proceed with all the recovery and restart actions in such case. It must also be much faster to just conduct a termination/recovery procedure just to one bad agent in compariosn to the same for 1000 of them which is the case during crash/restart scenario.

Needed By

Yesterday (Let's go already!)

Post comment

Admin

Janet Figone

Reply
| Aug 10, 2021

Thank you for submitting this Aha! Ideas. Db2 development has reviewed it and provided the following:
When an agent fails due to an abend or it is cancelled, Db2 does have code to clean up, including when latches are held such as page latches. There are certain scenarios where this cleanup is not possible, e.g. when the latch holder has gone or if we are in the middle of a process that cannot be rectified. The reason why crashing Db2, or Db2 being recycled, solves this issue is that the the temporary inconsistency that the latch protects against exists in memory only, so by restarting Db2 the in-memory structures and chains are all thrown away.
For this reason there are situations where failing Db2 is unavoidable. It is not done lightly and we continue to make improvements in terms of Db2 resiliency. But this requirement cannot be met wholesale.
Therefore, unfortunately, we will no be implementing this Idea. We do encourage you to continue submitting your enhancement requests as we review each one as potential candidates for Db2 for z/OS.
Sincerely,
The Db2 for z/OS Development team

0 reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

Removal of an unnecesary system crash due to must-complete or latch holding agent - RC 00F3040A