IBM Data and AI Ideas Portal for Customers


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Post your ideas

Post ideas and requests to enhance a product or service. Take a look at ideas others have posted and upvote them if they matter to you,

  1. Post an idea

  2. Upvote ideas that matter most to you

  3. Get feedback from the IBM team to refine your idea

Help IBM prioritize your ideas and requests

The IBM team may need your help to refine the ideas so they may ask for more information or feedback. The product management team will then decide if they can begin working on your idea. If they can start during the next development cycle, they will put the idea on the priority list. Each team at IBM works on a different schedule, where some ideas can be implemented right away, others may be placed on a different schedule.

Receive notification on the decision

Some ideas can be implemented at IBM, while others may not fit within the development plans for the product. In either case, the team will let you know as soon as possible. In some cases, we may be able to find alternatives for ideas which cannot be implemented in a reasonable time.

Additional Information

To view our roadmaps: http://ibm.biz/Data-and-AI-Roadmaps

Reminder: This is not the place to submit defects or support needs, please use normal support channel for these cases

IBM Employees:

The correct URL for entering your ideas is: https://hybridcloudunit-internal.ideas.aha.io


Status Not under consideration
Workspace Db2 for z/OS
Created by Guest
Created on Aug 2, 2021

Removal of an unnecesary system crash due to must-complete or latch holding agent - RC 00F3040A

Today, there is an intentional system crash designed and exploited with an abend code 00F3040A when an agent abending in system is in so called must-complete state or is holding a latch. We have been running a case TS006198341 where our system crashed again for the above mentioned reason. Whithin the discussion with IBM support in this case, we did not receive any information that could explain this strange DB2 behavior. Currently, when such situation as described occurre, DB2 will crash "to protect the integrity and stability of the entire group". We have been also told that DB2 "still holds to the philosophy that "failing fast is better than a slow, grinding death." Our business model and aplications often run an extremally long running transactions, sometimes longer then a few days in duration. Most often at least few hours. I believe this philosophy does take into account that restart, retained locks processing etc in such system is not at all a fan of such philosophy, when such crash kills well behaving hundreds or more agents without a reason.

We have also gotten confirmed in the case information that during the restart, DB2 needs access to logs, structures etc to perform the restart and backout. It will commit or abort and rollback agents that were executing during abend etc. We were told it is all documented with the follwing links:

https://www.ibm.com/docs/en/db2-for-zos/12?topic=msc-heuristic-decisions-about-whether-commit-abort-indoubt-thread


https://www.ibm.com/docs/en/db2-for-zos/12?topic=consistency-after-termination-failure

https://www.ibm.com/docs/en/db2-for-zos/12?topic=db2-restart-options-after-abend

However, based on the full description and restart logic, it is clear that there is not a single byte of a "new" information that DB2 will have thanks to the termination. It is a standard in mainframe shops that when such abend happens, System Automatoin, Netview, ARM - any automated solution will immediatelly restart the subsystem without any human intervention and DB2 is always able to fully clean up, recover, restart, commit, abort, free latch and do all taht is necessary to resolve the crash cause.

Proposed solution:

Based on that, we believe that this crash is not necessary and exactly the same logic that is applied after crash/restart, should be applied without it, to recover from the trouble. It should be enough to just terminate and crash only the agent that is causing the havoc, not the whole subsystem. There is no reason to kill thousands of innocent agents, only because one is bad. As mentioned before, crash does not bring anything new to the tables in terms of informartion for DB2 how to proceed with all the recovery and restart actions in such case. It must also be much faster to just conduct a termination/recovery procedure just to one bad agent in compariosn to the same for 1000 of them which is the case during crash/restart scenario.



Needed By Yesterday (Let's go already!)
  • Admin
    Janet Figone
    Aug 10, 2021

    Thank you for submitting this Aha! Ideas. Db2 development has reviewed it and provided the following:

    When an agent fails due to an abend or it is cancelled, Db2 does have code to clean up, including when latches are held such as page latches. There are certain scenarios where this cleanup is not possible, e.g. when the latch holder has gone or if we are in the middle of a process that cannot be rectified. The reason why crashing Db2, or Db2 being recycled, solves this issue is that the the temporary inconsistency that the latch protects against exists in memory only, so by restarting Db2 the in-memory structures and chains are all thrown away.

    For this reason there are situations where failing Db2 is unavoidable. It is not done lightly and we continue to make improvements in terms of Db2 resiliency. But this requirement cannot be met wholesale.

    Therefore, unfortunately, we will no be implementing this Idea. We do encourage you to continue submitting your enhancement requests as we review each one as potential candidates for Db2 for z/OS.

    Sincerely,

    The Db2 for z/OS Development team