We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:
Post your ideas
Post ideas and requests to enhance a product or service. Take a look at ideas others have posted and upvote them if they matter to you,
Post an idea
Upvote ideas that matter most to you
Get feedback from the IBM team to refine your idea
Help IBM prioritize your ideas and requests
The IBM team may need your help to refine the ideas so they may ask for more information or feedback. The product management team will then decide if they can begin working on your idea. If they can start during the next development cycle, they will put the idea on the priority list. Each team at IBM works on a different schedule, where some ideas can be implemented right away, others may be placed on a different schedule.
Receive notification on the decision
Some ideas can be implemented at IBM, while others may not fit within the development plans for the product. In either case, the team will let you know as soon as possible. In some cases, we may be able to find alternatives for ideas which cannot be implemented in a reasonable time.
Removal of an unnecesary system crash due to must-complete or latch holding agent - RC 00F3040A
Today, there is an intentional system crash designed and exploited with an abend code 00F3040A when an agent abending in system is in so called must-complete state or is holding a latch. We have been running a case TS006198341 where our system crashed again for the above mentioned reason. Whithin the discussion with IBM support in this case, we did not receive any information that could explain this strange DB2 behavior. Currently, when such situation as described occurre, DB2 will crash "to protect the integrity and stability of the entire group". We have been also told that DB2 "still holds to the philosophy that "failing fast is better than a slow, grinding death." Our business model and aplications often run an extremally long running transactions, sometimes longer then a few days in duration. Most often at least few hours. I believe this philosophy does take into account that restart, retained locks processing etc in such system is not at all a fan of such philosophy, when such crash kills well behaving hundreds or more agents without a reason.
We have also gotten confirmed in the case information that during the restart, DB2 needs access to logs, structures etc to perform the restart and backout. It will commit or abort and rollback agents that were executing during abend etc. We were told it is all documented with the follwing links:
However, based on the full description and restart logic, it is clear that there is not a single byte of a "new" information that DB2 will have thanks to the termination. It is a standard in mainframe shops that when such abend happens, System Automatoin, Netview, ARM - any automated solution will immediatelly restart the subsystem without any human intervention and DB2 is always able to fully clean up, recover, restart, commit, abort, free latch and do all taht is necessary to resolve the crash cause.
Based on that, we believe that this crash is not necessary and exactly the same logic that is applied after crash/restart, should be applied without it, to recover from the trouble. It should be enough to just terminate and crash only the agent that is causing the havoc, not the whole subsystem. There is no reason to kill thousands of innocent agents, only because one is bad. As mentioned before, crash does not bring anything new to the tables in terms of informartion for DB2 how to proceed with all the recovery and restart actions in such case. It must also be much faster to just conduct a termination/recovery procedure just to one bad agent in compariosn to the same for 1000 of them which is the case during crash/restart scenario.
Do not place IBM confidential, company confidential, or personal information into any field.