Skip to Main Content
IBM Data and AI Ideas Portal for Customers


This portal is to open public enhancement requests against products and services offered by the IBM Data & AI organization. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:


Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,


Post your ideas

Post ideas and requests to enhance a product or service. Take a look at ideas others have posted and upvote them if they matter to you,

  1. Post an idea

  2. Upvote ideas that matter most to you

  3. Get feedback from the IBM team to refine your idea


Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

IBM Employees should enter Ideas at https://ideas.ibm.com


Status Not under consideration
Workspace Db2 for z/OS
Created by Guest
Created on Aug 2, 2021

Removal of an unnecesary system crash due to must-complete or latch holding agent - RC 00F3040A

Today, there is an intentional system crash designed and exploited with an abend code 00F3040A when an agent abending in system is in so called must-complete state or is holding a latch. We have been running a case TS006198341 where our system crashed again for the above mentioned reason. Whithin the discussion with IBM support in this case, we did not receive any information that could explain this strange DB2 behavior. Currently, when such situation as described occurre, DB2 will crash "to protect the integrity and stability of the entire group". We have been also told that DB2 "still holds to the philosophy that "failing fast is better than a slow, grinding death." Our business model and aplications often run an extremally long running transactions, sometimes longer then a few days in duration. Most often at least few hours. I believe this philosophy does take into account that restart, retained locks processing etc in such system is not at all a fan of such philosophy, when such crash kills well behaving hundreds or more agents without a reason.

We have also gotten confirmed in the case information that during the restart, DB2 needs access to logs, structures etc to perform the restart and backout. It will commit or abort and rollback agents that were executing during abend etc. We were told it is all documented with the follwing links:

https://www.ibm.com/docs/en/db2-for-zos/12?topic=msc-heuristic-decisions-about-whether-commit-abort-indoubt-thread


https://www.ibm.com/docs/en/db2-for-zos/12?topic=consistency-after-termination-failure

https://www.ibm.com/docs/en/db2-for-zos/12?topic=db2-restart-options-after-abend

However, based on the full description and restart logic, it is clear that there is not a single byte of a "new" information that DB2 will have thanks to the termination. It is a standard in mainframe shops that when such abend happens, System Automatoin, Netview, ARM - any automated solution will immediatelly restart the subsystem without any human intervention and DB2 is always able to fully clean up, recover, restart, commit, abort, free latch and do all taht is necessary to resolve the crash cause.

Proposed solution:

Based on that, we believe that this crash is not necessary and exactly the same logic that is applied after crash/restart, should be applied without it, to recover from the trouble. It should be enough to just terminate and crash only the agent that is causing the havoc, not the whole subsystem. There is no reason to kill thousands of innocent agents, only because one is bad. As mentioned before, crash does not bring anything new to the tables in terms of informartion for DB2 how to proceed with all the recovery and restart actions in such case. It must also be much faster to just conduct a termination/recovery procedure just to one bad agent in compariosn to the same for 1000 of them which is the case during crash/restart scenario.



Needed By Yesterday (Let's go already!)
  • Admin
    Janet Figone
    Reply
    |
    Aug 10, 2021

    Thank you for submitting this Aha! Ideas. Db2 development has reviewed it and provided the following:

    When an agent fails due to an abend or it is cancelled, Db2 does have code to clean up, including when latches are held such as page latches. There are certain scenarios where this cleanup is not possible, e.g. when the latch holder has gone or if we are in the middle of a process that cannot be rectified. The reason why crashing Db2, or Db2 being recycled, solves this issue is that the the temporary inconsistency that the latch protects against exists in memory only, so by restarting Db2 the in-memory structures and chains are all thrown away.

    For this reason there are situations where failing Db2 is unavoidable. It is not done lightly and we continue to make improvements in terms of Db2 resiliency. But this requirement cannot be met wholesale.

    Therefore, unfortunately, we will no be implementing this Idea. We do encourage you to continue submitting your enhancement requests as we review each one as potential candidates for Db2 for z/OS.

    Sincerely,

    The Db2 for z/OS Development team