confd_ha_beslave() randomly fails

per · July 9, 2015, 1:08pm

Please be precise - AFAIK the only ConfD Basic version that has been released from the 5.3 branch is 5.3.2. Is that what you are using? On both machines?

No, it is not the timeout, nor a firewall issue I believe, but rather as I wrote, the connection is closed because the HA subsystem on the master crashes - i.e. a bug.

It’s not relevant for the ConfD problem. If you don’t care about the output resulting from assert() calls in programs that you run, you can ignore it.

No, this is definitely a bug - but I need the exact versions that you use to evaluate it properly. I’ll probably ask you for some additional logging when I have that info.

yogevc · July 9, 2015, 2:05pm

confd --version gives 5.3 (on both machines).
We are in the transition from 5.3 to 5.4 basic. So for the moment I still check my functionality with 5.3.

I checked this issue with 5.4 basic on the same setup and it works …
Is there any fixed bug about it in 5.4 basic?

Thanks

per · July 9, 2015, 2:55pm

OK, thanks.

The bug I am thinking of is not fixed in 5.4-basic (it is fixed in 6.0-basic) - so it seems you have found another bug, or at least a variant of the fixed one, and it would be good to shake it out. Please do the following on both machines with your 5.3 setup:

Before starting the ConfD node, make sure the error log is enabled in confd.conf - i.e. in the section, include
true /tmp/confderr.log

I believe such a clause is included in some confd.conf sample, but with ‘enabled’ set to “false” - make sure it is “true” if so.

Run your test, producing the CONFD_ERR_HA_CLOSED and CONFD_ERR_BADSTATE errors.
Take a “debug dump”, i.e. run

confd --debug-dump /tmp/confd.dmp
Tar/zip up the /tmp/confderr.log.* and /tmp/confd.dmp files from each machine, and attach them in an e-mail to per@tail-f.com - don’t post them here.

per · July 9, 2015, 2:57pm

Oops, my xml was eaten - should be “in the <logs> section, include”

jlawitzke · July 9, 2015, 4:09pm

@yogevc -

Answering before you ask … ConfD Basic 6.0 is in the process of being released and hopefully should be available on the DevNet download server soon.

yogevc · July 9, 2015, 8:47pm

I don’t know what to say : (
I tried to reproduce the error but I can’t.

After I succeed with the 5.4 basic, I wanted to clean everything I did. So I compiled my schema once again and I loaded my initial configuration (because, for example, in the process of understanding what happened I had to delete some files. for example, I got the error about the corruption of the CDB, and the solution was to delete the confd/var/confd/cdb/*.cdb).

Now, I enabled the errorLog and started ConfD, but both confd_ha_bemaster() and confd_ha_beslave() work just fine…
I can’t reproduce the error for the moment.

This is the second time I have this issue (had it couple of weeks ago and it disappeared also). I hope to meet it again.

Do you have any thought about how to reproduce it now? (After saying this thing about the corrupted CDB and so…)

Great : )

Thanks

yogevc · July 9, 2015, 9:08pm

I mean,
If i could get to a position where the CDB is corrupted again, maybe the error will be reproduced. How can I deliberately make it happen?

Thanks.

per · July 9, 2015, 10:06pm

Be happy.

Not really, but I can tell you the condition that triggered the known bug I was thinking of: a slave running version 5.1.x or earlier connecting to a master running 5.2.y or later, or vice versa. Since 5.3.2 was the first ConfD Basic release, it is not possible to trigger the bug with ConfD Basic alone - but apparently you have been using pre-Basic versions, such as 5.3.

Could it be the case that either the master or the slave in your problem setup was actually running 5.1.y or older? It’s easy to get confused when running multiple shells on multiple machines, and ‘confd --version’ actually reports a version number from the ‘confd’ shell script, it doesn’t ask the running daemon (‘confd --status’ does though).

The corrupted CDB could perhaps be a weak indication that something like that was going on - it’s a highly unusual error unless you have been manually “messing” with the CDB files, but there might be some “large ConfD version jumps” that aren’t handled by the CDB code.

It’s not likely that having a “corrupted” CDB will trigger the problem though - as you noticed, you can’t even start ConfD in that case, far less tell it to be master or slave. If you have “messed” with the CDB files on the master while ConfD was running, it is likely to cause problems when a new slave connects, since that’s the only time except at startup that CDB actually reads the data file. But I believe those problems would occur at the slave, not the master. Anyway, for “messing”, you could try something like

echo foo > $CONFD_DIR/var/confd/cdb/A.cdb

this will obviously lose all your data!