How my Oracle 19C Real Application Cluster crashed?
Wanted to share an interesting situation where one of our clusters crashed in a quite strange manner.
We have an application doing various things on a Oracle 19c RAC environment and almost all of them pass through a global sequence.
As you would be expected if for some reason this sequence stopped working - then the whole application will die in a matter of minutes due to connection pool exhaustion.
And now on the interesting part - we were waked up during the night with an application down high priority call.
When checking the sessions - it was apparent that we had a latch/mutex situation as all sessions ( ~200 ) were blocked on this particular sequence with event "enq: SV - contention", and additionally the SQL_EXEC_START times were not moving - which indicated a locking situation, not just a high volume of transactions and concurrency.
When I tried to manually select the next value from the sequence the session hanged with the same wait event which confirmed the situation.
Then we continued to troubleshoot this as a normal locking issue by ordering the active sessions based on the SQL_EXEC_START , with the difference that there was no easy way to find which is the root of the locks as there were no BLOCKING_SESSIONS populated.
So we started to check session by session and the connection with the longest wait time (oldest SQL_EXEC_START) was actually blocked on another event - > "gc current request".
Now this event normally indicates that the current instance is waiting for a block to be sent from another node, which should be in a matter of milliseconds wait time. The strange thing was that the WAIT_TIME was ~40minutes - approximately as much the time since the application was unavailable.
Now all this started to smell as a bug as it's unexpected this wait event to take more than a second.
We did a full hang analyze dump and killed this particular session with "gc current request" - > which unlocked the contention and made the application available again.
Subsequently, Oracle confirmed that we are hitting bug 32245850 "TXTSDAN : DML OPERATIONS HUNG ON "GC CURRENT REQUEST" WAITS”
Currently, our DB's were patched for the bug above and everything looks good so far, but we will keep you in touch if issues arise from this.
As a final note - be careful of the GC-related waits in your application as 19C seems prone to bugs related with cluster wait events. During the investigation with oracle, there were around 5 similar bugs that could be related with the behavior which we experienced.
Additionally never forget to get a hang to analyze the dump in such a situation.