Confd crashes after aborted commit

Hello all,
we are running into an issue with confd 7.7.6.
Sometimes, after commit with GUI (it is using JSON API), the commit fails (error: db is locked). If we then commit again (any commit), confd crashes. After it restarts, it works OK.
Is there a way to see what could be the issue after first failed commit that causes second commit to crash confd?

If you are asking about how to check for locks, you can inspect the output of confd --status or use MAAPI functions that deal with locks. If you are asking about the crash, if you enable the error log (/confdConfig/logs/errorLog in confd.conf), there might be something in it; if it is, confd --printlog might be of a limited help, see man confd.

Hi,

I enabled and checked all the logs and did not found anything useful.
After further testing, these steps reproduce the error:
Prerequisit: have application that spends a lot of time reading from confd outside of callbacks (a couple of seconds).

Steps:

  • modify something that will trigger the application
  • start confirmed commit
  • have the application read from confd
  • during readout, confirm the commit

The confirm fails with “database locked by session none”.
In this state, readout and normal commit work, but if we issue another confirmed commit, confd closes all the connections and resets itself.

Some more information - confd crashes when it is applying transaction (got it from progress trace):

2022 Dec 5 04:09:38.836 authpriv.info confd[1836]: audit user: admin/30 CLI ‘commit confirmed 10’
2022 Dec 5 04:09:38.836 daemon.debug confd[1836]: devel-c chk_cmd_access request daemon id: 4
2022 Dec 5 04:09:38.836 daemon.debug confd[1836]: devel-c chk_cmd_access succeeded daemon id: 4
2022 Dec 5 04:09:38.836 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli applying transaction
2022 Dec 5 04:09:38.836 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli entering validate phase
2022 Dec 5 04:09:38.836 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli grabbing transaction lock
2022 Dec 5 04:09:38.836 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli grabbing transaction lock (0.000 s)
2022 Dec 5 04:09:38.836 daemon.debug confd[1836]: devel-c new_trans request daemon id: 0 thandle: 82458
2022 Dec 5 04:09:38.984 daemon.debug confd[1836]: devel-c new_trans succeeded daemon id: 0 session id: 56 worker id: 5
2022 Dec 5 04:09:38.984 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli leaving validate phase (0.016 s)
2022 Dec 5 04:09:38.984 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli entering write-start phase
2022 Dec 5 04:09:38.984 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli leaving write-start phase (0.026 s)
2022 Dec 5 04:09:38.984 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli entering prepare phase
2022 Dec 5 04:09:38.984 daemon.debug confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli subsystem=com.metaswitch.ymmconfd.DataProvider prepare
2022 Dec 5 04:09:38.984 daemon.debug confd[1836]: devel-c prepare request daemon id: 0 session id: 56
2022 Dec 5 04:09:38.984 daemon.debug-confd[1836]: devel-c prepare succeeded daemon id: 0 session id: 56
2022 Dec 5 04:09:38.984 daemon.debug-confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli leaving prepare phase (0.081 s)
2022 Dec 5 04:09:38.984 daemon.debug-confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli entering commit phase
2022 Dec 5 04:09:39.134 daemon.debug-confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli subsystem=com.metaswitch.ymmconfd.DataProvider commit
2022 Dec 5 04:09:39.134 daemon.debug-confd[1836]: devel-c commit request daemon id: 0 session id: 56
2022 Dec 5 04:09:39.134 daemon.debug-confd[1836]: devel-c commit succeeded daemon id: 0 session id: 56
2022 Dec 5 04:09:39.134 daemon.debug-confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli leaving commit phase (0.106 s)
2022 Dec 5 04:09:39.134 daemon.debug-confd[1836]: confd progress usid=30 tid=82458 datastore=candidate context=cli applying transaction (0.231 s)

next printouts are from daemons that are reporting “ConfD closed connection”.

I’m afraid we can’t help you here. The log seems ok, I am not able to spot any suspicious entries. Since it looks like a Confd bug is involved, do you think you can create a small self-contained example that would reproduce the problem for ConfD developers?

We are working in that direction, unfortunately it seems that this is triggered by combination of requests from different daemons, so it is difficult to track down the ‘root’ sequence that triggers this.

Some additional observations:

  • it happened that we got the database in a state where we had read lock on running database without any client association. This caused ‘database locked by session none’ error, but did not cause confd to reset. Is there a sequence of commands to confd that would cause the lock to stay after session ends?
    Aditionally, is it necessary to call cdb_end_session() before calling cdb_close()? Userguide states that cdb_end_session() should be called before cdb_close(), but it also states that “it is very important that we remember to either cdb_end_session() or cdb_close() once we have read what we wish to read”, which would imply that it is not necessary to call cdb_end_session().
  • Is there a minimum time that should pass after ‘commit confirmed x’ is issued, before we call ‘commit’ to confirm? Currently the error is ‘best’ reproduced if we confirm immediately after commiting.
  • Should there be some wait time for confirmation commit, if the database is locked? All timeouts in confd configuration are set to non - zero value, but if the database is ‘locked by session none’, it fails immediately?

We recreated the bug with these steps:

  • remove all daemons that we can and the system still runs
  • create a new yang with single container with a single leaf that is not used anywhere
  • use netconf to readout unrelated data from confd in a loop (readout every 0.1 seconds)
  • use a script to change the value in new container using confirmed commit

After some time of this running, we get this output:

(config)# commit confirmed 1
Warning: The configuration will be reverted if you exit the CLI without
performing the commit operation within 1 minutes.
(config)# commit
Aborted: the configuration database is locked by session none

If we continue, the next confirmed commit fails:

(config)# confdtest testVar 2
(config)# commit confirmed 1
Error: internal error
Connection to localhost closed.

When you see something like - “the configuration database is locked by session none” - it is typically an internal “job” that is holding a lock on CDB. For example:

  • You use HA and an out of sync/new standby node connects to the active node (aka slave/master) it will trigger a sync.
  • You have a CDB subscriber that’s slow to respond in some cases.
  • You have a southbound client (CDB, MAAPI etc) that for some reason holds a lock on CDB.
  • Etc.

You should not see a crash when a transaction is aborted, but you can try to increase the confdConfig/commitRetryTimeout in confd.conf for debug purposes to try to find what cause the delay, especially if you have triggered a deadlock

For example, if you try with commitRetryTimeout PT600S and you still get a “the configuration database is locked by session none” response then there could be a deadlock due to, for example, a CDB subscriber application triggering itself by committing something. If you only increase the commitRetryTimeout to PT10S and you no longer observe the issue, you check the developer log (developerLogLevel trace) for more detailed information. Also, as suggested by @mvf , the errorLog will likely be helpful.

Posting this here, maybe it will help someone down the line :grinning:.

We were able to replicate the behaviour with tail-f example ‘modifications’:

  • we added call to ‘external callback function’ in ‘process_modifications’ function that creates a new thread that executes another function and detached it so that ‘callback’ can finish and subscription ends
  • in new thread, we open a socket to confd, start session, wait two seconds and thend end session and close socket

Then the failure can be replicated:

  • start exampe confd with ‘make all start’
  • connect to confd in another cli with ‘make cli-c’
  • modify something in ‘config exclusive’ mode
  • issue ‘commit confirmed 10; commit’ so that ‘commit’ executes when thread has session opened and db is locked
  • you get confirmation for ‘commit confirmed 10’ and ‘Aborted: the configuration database is locked by session none’ error for ‘commit’

After specified timeout, db does not revert.
After this the modification you made looks permanent, you can also do other modifications and do normal commits, everything looks OK.
But if you issue another ‘commit confirmed 10’ (or some other timeout), you get internal error. And all your modifications are gone.