Hello all,
we are running into an issue with confd 7.7.6.
Sometimes, after commit with GUI (it is using JSON API), the commit fails (error: db is locked). If we then commit again (any commit), confd crashes. After it restarts, it works OK.
Is there a way to see what could be the issue after first failed commit that causes second commit to crash confd?
If you are asking about how to check for locks, you can inspect the output of confd --status or use MAAPI functions that deal with locks. If you are asking about the crash, if you enable the error log (/confdConfig/logs/errorLog in confd.conf), there might be something in it; if it is, confd --printlog might be of a limited help, see man confd.
I enabled and checked all the logs and did not found anything useful.
After further testing, these steps reproduce the error:
Prerequisit: have application that spends a lot of time reading from confd outside of callbacks (a couple of seconds).
Steps:
modify something that will trigger the application
start confirmed commit
have the application read from confd
during readout, confirm the commit
The confirm fails with “database locked by session none”.
In this state, readout and normal commit work, but if we issue another confirmed commit, confd closes all the connections and resets itself.
I’m afraid we can’t help you here. The log seems ok, I am not able to spot any suspicious entries. Since it looks like a Confd bug is involved, do you think you can create a small self-contained example that would reproduce the problem for ConfD developers?
We are working in that direction, unfortunately it seems that this is triggered by combination of requests from different daemons, so it is difficult to track down the ‘root’ sequence that triggers this.
Some additional observations:
it happened that we got the database in a state where we had read lock on running database without any client association. This caused ‘database locked by session none’ error, but did not cause confd to reset. Is there a sequence of commands to confd that would cause the lock to stay after session ends?
Aditionally, is it necessary to call cdb_end_session() before calling cdb_close()? Userguide states that cdb_end_session() should be called before cdb_close(), but it also states that “it is very important that we remember to either cdb_end_session() or cdb_close() once we have read what we wish to read”, which would imply that it is not necessary to call cdb_end_session().
Is there a minimum time that should pass after ‘commit confirmed x’ is issued, before we call ‘commit’ to confirm? Currently the error is ‘best’ reproduced if we confirm immediately after commiting.
Should there be some wait time for confirmation commit, if the database is locked? All timeouts in confd configuration are set to non - zero value, but if the database is ‘locked by session none’, it fails immediately?
remove all daemons that we can and the system still runs
create a new yang with single container with a single leaf that is not used anywhere
use netconf to readout unrelated data from confd in a loop (readout every 0.1 seconds)
use a script to change the value in new container using confirmed commit
After some time of this running, we get this output:
(config)# commit confirmed 1
Warning: The configuration will be reverted if you exit the CLI without
performing the commit operation within 1 minutes.
(config)# commit
Aborted: the configuration database is locked by session none
When you see something like - “the configuration database is locked by session none” - it is typically an internal “job” that is holding a lock on CDB. For example:
You use HA and an out of sync/new standby node connects to the active node (aka slave/master) it will trigger a sync.
You have a CDB subscriber that’s slow to respond in some cases.
You have a southbound client (CDB, MAAPI etc) that for some reason holds a lock on CDB.
Etc.
You should not see a crash when a transaction is aborted, but you can try to increase the confdConfig/commitRetryTimeout in confd.conf for debug purposes to try to find what cause the delay, especially if you have triggered a deadlock
For example, if you try with commitRetryTimeout PT600S and you still get a “the configuration database is locked by session none” response then there could be a deadlock due to, for example, a CDB subscriber application triggering itself by committing something. If you only increase the commitRetryTimeout to PT10S and you no longer observe the issue, you check the developer log (developerLogLevel trace) for more detailed information. Also, as suggested by @mvf , the errorLog will likely be helpful.
Posting this here, maybe it will help someone down the line .
We were able to replicate the behaviour with tail-f example ‘modifications’:
we added call to ‘external callback function’ in ‘process_modifications’ function that creates a new thread that executes another function and detached it so that ‘callback’ can finish and subscription ends
in new thread, we open a socket to confd, start session, wait two seconds and thend end session and close socket
Then the failure can be replicated:
start exampe confd with ‘make all start’
connect to confd in another cli with ‘make cli-c’
modify something in ‘config exclusive’ mode
issue ‘commit confirmed 10; commit’ so that ‘commit’ executes when thread has session opened and db is locked
you get confirmation for ‘commit confirmed 10’ and ‘Aborted: the configuration database is locked by session none’ error for ‘commit’
After specified timeout, db does not revert.
After this the modification you made looks permanent, you can also do other modifications and do normal commits, everything looks OK.
But if you issue another ‘commit confirmed 10’ (or some other timeout), you get internal error. And all your modifications are gone.