Debugging HA failures

arunmu · July 7, 2016, 1:13pm

Hello folks,
I have a set up with 1 master and 1 standby Confd in synchronous replication mode.
In general, the replication works, but for some kinds of configuration data, the replication breaks with below error in logs:

<INFO> 7-Jul-2016::08:54:57.369 standby confd[6541]: confd HA_INFO_NOMASTER: ERR_HA_CLOSED
<INFO> 7-Jul-2016::08:54:57.377 standby confd[6541]: confd HA_INFO_IS_NONE
"Internal error: Component terminated (application_controller) ({application_terminated,cdb,shutdown})\n"

I am not getting much information related to this by enabling CONFD_TRACE or CONFD_PROTO_TRACE.

Is there any configuration option that I can set to get more debug information related to such failures ?

Thanks.

waitai · July 7, 2016, 6:38pm

As described in the ConfD UG Chapter 12.9, High Availability - Cluster Events, the CONFD_HA_INFO_NOMASTER message indicates that a slave node has lost contact with the master and is now in HA state CONFD_HA_STATE_NONE. This can be due to either socket getting closed or that the slave hasn’t received any live ticks from the master.

arunmu · July 11, 2016, 5:11am

Thanks Waitai.
As I said, the error takes place only for a particular set of configuration items. In standalone mode, there is no issue when that configuration is set. It’s only in HA mode that either Master or sometimes replication confd fails. For other types of configuration I do not see this issue happening thereby ruling out the ‘live ticks’ issue.

No amount of trace is currently enough to debug this issue as no relevant error message is being logged (even if it was the live ticks issue)

cohult · July 11, 2016, 12:30pm

Hi arunmu,
Can you describe for what type of configuration items you see this error?
Thanks,
Conny

arunmu · July 12, 2016, 3:28pm

Hi Conny,

Pasting here the devel log when the supervisor terminate occurred:
12-Jul-2016::11:23:24.265 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: entering validate phase for running…
12-Jul-2016::11:23:24.266 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: grabbing transaction lock…
12-Jul-2016::11:23:24.266 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: ok
12-Jul-2016::11:23:24.266 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: creating rollback file…
12-Jul-2016::11:23:24.291 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: ok
12-Jul-2016::11:23:24.292 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: run transforms and transaction hooks…
12-Jul-2016::11:23:24.292 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: ok
12-Jul-2016::11:23:24.292 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: pre validate…
12-Jul-2016::11:23:24.292 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: ok
12-Jul-2016::11:23:24.292 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: run validation over the change set…
12-Jul-2016::11:23:24.294 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: validation over the change set done
12-Jul-2016::11:23:24.294 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: run dependency-triggered validation…
12-Jul-2016::11:23:24.295 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: validate: dependency-triggered validation done
12-Jul-2016::11:23:24.295 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: entering write phase for running…
12-Jul-2016::11:23:24.295 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: cdb: write_start
12-Jul-2016::11:23:24.295 work-vm confd[2137]: devel-cdb distributing last chunk of transaction changeset (#1, 1399 bytes in total)
12-Jul-2016::11:23:24.297 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: entering prepare phase for running…
12-Jul-2016::11:23:24.297 work-vm confd[2137]: devel-cdb distributing write_done to slaves
12-Jul-2016::11:23:24.297 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: cdb: prepare
12-Jul-2016::11:23:24.297 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: cdb: delivering prepare subscription notifications at prio 0
12-Jul-2016::11:23:24.319 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: cdb: all prepare subscription notifications acknowledged
12-Jul-2016::11:23:24.320 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: entering commit phase for running…
12-Jul-2016::11:23:24.320 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: cdb: commit
12-Jul-2016::11:23:24.320 work-vm confd[2137]: devel-cdb distributing commit to slaves
12-Jul-2016::11:23:24.322 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: entering abort phase for running…
12-Jul-2016::11:23:24.323 work-vm confd[2137]: confd Socket connection from “cdb client” rejected - probably due to ConfD being in phase0 or service not enabled in config file
12-Jul-2016::11:23:24.324 work-vm confd[2137]: confd commit progress db=running usid=16 thandle=261: cdb: abort
12-Jul-2016::11:23:24.338 work-vm confd[2137]: devel-c close_trans request daemon id: 0 session id: 259
12-Jul-2016::11:23:24.339 work-vm confd[2137]: devel-c close_trans succeeded daemon id: 0 session id: 259

I do not see any particular config node in the error. But the log does say about aborting cdb and also says Confd is running in phase-0 whereas it was transitioned to phase-2 from phase-0 during bootup. Note that, initial few configuration commit goes through without any issue.

cohult · July 12, 2016, 8:55pm

Hi Arun,

We need to try reproduce your issue to be able to figure out what is going on.

Can you show us the YANG for the configuration changes and describe the changes that result in the error?

Also helpful, can you provide us with these logs for the slave / standby node?:

ConfD
Error
Developer (you need to use developerLogLevel == trace)

Thanks

arunmu · July 14, 2016, 8:20am

Hi Conny,
We identified the main issue as some problem with our HAMgr when using NFS mount for CDB which was resulting in a race between Master and standby. Correcting that seems to have resolved the issue.

A big thanks for your time.

bkannadassan · September 23, 2017, 12:39am

Looks like we are facing similar issue, can you brief on how you got to resolve the same ?..

rgds
Balaji