Monday, July 8, 2013

Oracle CRS is not starting "has a disk HB, but no network HB, DHB has rcfg..." in ocssd log

Network problems in interconnect network or problems with interconnect interface can prevent CRS for starting.

If you look CRS check you'll see following:
[root@<node2> <node2>]# /u01/app/11.2.0/grid/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager


CRS alert log shows following:
(/u01/app/11.2.0/grid/log/<node2>/alert<node2>.log):
.
.
.
2013-06-03 06:56:50.778
[/u01/app/11.2.0/grid/bin/cssdagent(13124)]CRS-5818:Aborted command 'start' for resource 'ora.cssd'.
Details at (:CRSAGF00113:) {0:28:4} in /u01/app/11.2.0/grid/log/<node2>/agent/ohasd/oracssdagent_root/oracssdagent_root.log.
2013-06-03 06:56:50.779
[cssd(13138)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/<node2>/cssd/ocssd.log
.
.
.


ocssd log shows following:
 (/u01/app/11.2.0/grid/log/<node2>/cssd/ocssd.log) (this is complaining about node1 interconnect) :


2013-06-03 06:56:50.814: [    CSSD][3190012224]clssnmvDHBValidateNCopy: node 1, <node1>, has a disk HB, but no network HB, DHB has rcfg 216823918, wrtcnt, 48655493,
LATS 5387994, lastSeqNo 48655492, uniqueness 1365009957, timestamp 1370231810/927159338

2013-06-03 06:56:51.822: [    CSSD][3190012224]clssnmvDHBValidateNCopy: node 1, <node1>, has a disk HB, but no network HB, DHB has rcfg 216823918, wrtcnt, 48655494,
LATS 5389004, lastSeqNo 48655493, uniqueness 1365009957, timestamp 1370231811/927160338

2013-06-03 06:56:52.862: [    CSSD][3190012224]clssnmvDHBValidateNCopy: node 1, <node1>, has a disk HB, but no network HB, DHB has rcfg 216823918, wrtcnt, 48655495,
LATS 5390044, lastSeqNo 48655494, uniqueness 1365009957, timestamp 1370231812/927161338
.


You can check Ping and SSH between nodes via interconnect interface.
If they are not working then there is problem in network connection between cluster nodes. Fix the problem and CRS will start correctly. 

But if Ping and SSH did work between nodes via interconnect interface and still ocssd log did complain about interconnect HeartBeat (no network HB) then interconnect interface is jammed. You can try to restart it to get it fixed (NOTE! It is usually the working node interconnect interface that is needed to restart (like error message is saying in ocssd.log (it is complaining node1)). For example if node2 CRS is not starting then restart node1 interconnect interface ) :
[root@<node1> <node1>]# ifdown eth1
[root@<node1> <node1>]# ifup eth1
And check that eth1 is looking ok:
[root@<node1> <node1>]# ifconfig


After interface restart or network problem fix check that <node2> clusterware is starting again:
[root@<node2> <node2>]# /u01/app/11.2.0/grid/bin/crsctl check crs

NOTE: If clusterware is trying to connect long enough via interconnect without success it will give this message in its alert log:
[ohasd(7773)]CRS-2771:Maximum restart attempts reached for resource 'ora.cssd'; will not restart.

If this error occurs then you need to kill CRS processes manually or reboot <node2> to get it trying again the clusterware start (cssd start).

8 comments:

  1. Thanks for this guide. I just encountered this issue and your solution helped me solved the problem. Thanks.

    ReplyDelete
  2. thx for solution... what is the reason behind this problem? can u pls explain?

    ReplyDelete
  3. Thanks. Just resolved my DR cluster issue with this..

    ReplyDelete
  4. Thanks, It worked, We have only 2 nodes, Restarting other node fixed the issue but didn't purpose of having High Availability is gone.

    ReplyDelete
  5. Hi I have same issue..hbt ips are pinging each other server and services got rebooted but no luck... any other way to resolve?

    ReplyDelete