confd_ha_beslave() randomly fails

After setting a master successfully on one machine, I’m trying to set a slave on another machine.
The function confd_ha_beslave() randomly fails with the following error:
confd_errno: 17 (operation in wrong state)
For each process of mine (master and slave) I’m doing the following confd api calls before calling the confd_ha_* functions:

  1. confd_init()
  2. confd_load_schemas()
  3. confd_ha_connect() (with the same secret token)
  4. confd_ha_beslave() / confd_ha_bemaster()

All of the nodes have a unique name.
The confd state is status: started.

Sometimes it works, and sometimes it doesn’t. is there something I’m doing wrong? Do I need to verify anything else?
Please let me know if you need further information.

Thanks!

Did you set HA to be enabled in your confd.conf configuration file?

You can also take a look at $CONFD_DIR/examples.confd/ha/dummy for a simple example of using the HA APIs.

  • Greg

I did enable it in the confd.conf. Sometimes the confd_ha_beslave() works.

I built my process using the ha dummy example.

Thanks

A couple places to check for diagnostic messages:

a) API trace output. This is turned on by either setting the “debug” parameter of confd_init() to CONFD_TRACE or by using confd_set_debug().

b) The log files. In particular, the developer log which, if enabled in confd.conf, is in the file devel.log (or whatever file name you set in confd.conf for it). The ConfD log (confd.log) might have something as well.

Also, this sounds like a race condition may be happening. Are you sure that confd_ha_bemaster() has completed successfully on the master ConfD daemon before calling confd_ha_beslave() on the salve ConfD daemon?

What are the reasons for getting this confd_errno 17 (operation in wrong state)?
I mean, in the ConfD side.

a) I’ll check it out, for the moment it’s CONFD_SILENT.

b) developerLog, confdLog and auditLog are enabled. I can see their output in the syslog. But there’s nothing when the confd_ha_beslave() fails…

Last one: I’m sure the other machine finished being master with confd_ha_bemaster(). Also, when I check the ConfD status of the master machine, I can see the master state.

Thanks.

The confd_lib_ha(3) manual page says about confd_ha_beslave():

The function will fail with error CONFD_ERR_BADSTATE if ConfD is still in start phase 0.

(The same is true for confd_ha_bemaster().) Could this be the problem in your case? E.g. if you fire up your HA controller application before the ‘confd …’ startup command has completed. In the dummy example, it is all done sequentially in the Makefile.

I don’t think this is the case here because, as I said earlier,

I get this state using confd --status.

Thanks

OK, but do you get this result from ‘confd --status’ before you do the confd_ha_beslave() call? ConfD will continue the startup regardless of the rejected call, and if you check afterwards, it may well have completed the startup at that point. Could you share some details about how you start ConfD and your application?

Also, just to be clear: it is the state of the slave-to-be that is relevant here. If you are running a setup like in the dummy example, with all ConfD nodes on the same host, and non-default values for /confdConfig/confdIpcAddress/port in confd.conf for the slaves, a plain ‘confd --status’ will report the status of the master. To query the slaves, you need to use the environment variable $CONFD_IPC_PORT. E.g. in the example, you would do

env CONFD_IPC_PORT=4575 confd --status

to get the status of the ‘node1’ slave.

I get this from the master machine, before trying to confd_ha_beslave:

cluster status:
mode: master
node id: node_1
connected slaves: 0

Only after this is true, I’m trying to beslave.
The ConfD state in the slave machine is (before trying to be slave):

status: started
cluster status:
mode: none
node id: NOT SET

And it stays like this after trying to be slave because the api call fails.

It’s on different computers so I check the default port.

Thanks

What information do you need?

This is what I get from confd_init(…, CONFD_PROTO_TRACE):

TRACE Connected (ha) to ConfD

9-Jul-2015::06:36:06.580 32737/7f168db78740/12 SEND {2,#Bin< ha_node_2>,{#Bin< ha_node_1>,{10,169,64,206}},1}
9-Jul-2015::06:36:06.580 32737/7f168db78740/12 GOT {error,17}

This is clearly the state of the master - as I wrote, it is the state of the slave-to-be that is relevant, a node can neither become slave nor master before it has reached start phase 1. The state of the master can never cause the CONFD_ERR_BADSTATE on a slave - if the master hasn’t finished becoming master, you will just get an error about connection failure, since the master isn’t listening on the HA port.

Yeah, I just understood that I didn’t said anything about the ConfD state in the slave machine, so I edited my reply about this issue, please see above…

I have also noticed that, on the slave machine, if I’m trying to be master there, I also fail with:

TRACE Connected (ha) to ConfD

9-Jul-2015::07:06:54.040 32800/7f1e5fb57740/12 SEND {1,#Bin< node_2>}
9-Jul-2015::07:06:54.040 32800/7f1e5fb57740/12 GOT {error,17}

And as I said before, the ConfD status before calling confd_ha_bemaster() is:

status: started
cluster status:
mode: none
node id: NOT SET

For starters, the complete command you give when starting ConfD, and its relation to the status check you are doing and the startup of your application. If you are doing this manually in the *nix shell, provide the relevant snippet of your session. If you are doing it all e.g. from a script, provide that.

I hate that feature:-), it makes it very difficult to carry on a conversation when the parties go back and change what they said - not to mention that the resulting thread becomes quite confusing.

Anyway, given the additional information -

a) What version of ConfD are you running?

b) If you stop and start the slave-to-be, and do nothing else before telling it to be slave, do you still get the error from confd_ha_beslave()? Or does it only occur when the node has been up and running for a while, perhaps with master/slave changes?

I understand your negative thoughts about the editing feature : )

Ok, I simplified it and I’m using the code in examples.confd/ha/dummy/ctrl.c on each machine in order to simulate my process (which supposed to do HAFW), and I get the same error but I understood why I get confd_errno 17 - It happens when I get confd_errno 26 first.
confd_errno 26 first causes confd_errno 17 later. I don’t know why confd_errno 26 happened.

I’ll write what happened:

  1. Setup: two machines, machine_1 (10.168.251.16) and machine_2 (10.168.251.18).
  2. On both of the machines the confd.conf:
    2.1) Enables confdLog, auditLog and developerLog to syslog.
    2.2) Enables high availability

< ha>
< enabled>true< /enabled>
< ip>0.0.0.0< /ip>
< port>4569< /port>
< /ha>

  1. ConfD is stopped on both of the machines.
  2. I started it using the command confd (the default confd.conf location, the file that I changed). Everything looks fine with the log (in syslog) “ConfD started” on both of the machines.
  3. confd --status on both of the machines:

status: started
cluster status:
mode: none
node id: NOT SET

  1. On machine_1 (master), I do: ./ctrl master node_1
    Output on stderr:

TRACE Connected (ha) to ConfD

9-Jul-2015::09:26:28.279 22614/7f128b422740/3 SEND {1,#Bin<node_1>}
9-Jul-2015::09:26:28.280 22614/7f128b422740/3 GOT ok

Output on syslog:

confd[31598]: confd HA_INFO_IS_MASTER

ConfD state:

status: started
cluster status:
mode: master
node id: node_1
connected slaves: 0

  1. On machine_2 (slave), I do: ./ctrl slave node_2 node_1 10.168.251.16
    Output on stderr:

TRACE Connected (ha) to ConfD

9-Jul-2015::09:28:22.406 22618/7f0447f13740/3 SEND {2,#Bin<node_2>,{#Bin<node_1>,{10,168,251,16}},1}
9-Jul-2015::09:28:28.807 22618/7f0447f13740/3 GOT {error,26}
not good: error: No such file or directory

What is this not good: error: No such file or directory? is it a ConfD’s internal error?

Output on syslog: nothing
Output on machine_1’s syslog:

confd[31598]: devel-cdb New slave transaction id 1436-367252-832646@ha_node_1 equals master - configuration db is up to date
confd[31598]: confd HA_INFO_IS_NONE

So, for a reason, now the master is dead.
ConfD statuses for both of the machines:

status: started
cluster status:
mode: none
node id: NOT SET

Since now, for every confd_ha_bemaster/beslave I try to call, I get confd_errno 17:

  1. I’ll try to set the master again with the command: ./ctrl master node_1
    Output on stderr:

TRACE Connected (ha) to ConfD

9-Jul-2015::09:34:08.051 22646/7f733781b740/3 SEND {1,#Bin<node_1>}
9-Jul-2015::09:34:08.053 22646/7f733781b740/3 GOT {error,17}
void bemaster(char**): Assertion `(confd_ha_bemaster(s, &nodeid)) == 0’ failed.
Aborted (core dumped)

No output in syslog.

  1. For confd_ha_beslave: ./ctrl slave node_1 node_2 10.168.251.18
    (I know I didn’t set machine_2 to be master. it doesn’t matter here)

TRACE Connected (ha) to ConfD

9-Jul-2015::09:43:54.043 22696/7fa1570be740/3 SEND {2,#Bin< node_1>,{#Bin< node_2>,{10,168,251,18}},1}
9-Jul-2015::09:43:54.044 22696/7fa1570be740/3 GOT {error,17}
not good: error: No such file or directory

No output in syslog.

What is this no good error? is it a ConfD’s internal error?

→ edit : )
I want to remind - this is kind of a random error.
I have another setup (another two machines) that the same commands work on.

Thanks.

Very good description, thanks - please report also the ConfD version you are using - and please check it on both the machines for this case. As you may realize, I’m looking for the possibility of a known bug here - it is present in most of the released versions of ConfD Basic, but I can’t see how it can be triggered there. I.e. it may be a different bug with similar symptoms. More comments inline below.

You can find all the confd_errno codes in the ERRORS section of the confd_lib_lib(3) man page - specifically:

CONFD_ERR_HA_CLOSED (26)
A remote HA node closed its connection to us, or there was a
timeout waiting for a sync response from the master during a call
of confd_ha_beslave().

This shouldn’t normally happen, but matches the bug perfectly - the HA subsystem on the master crashes on a certain connection from a slave, closing the connection. The subsystem is restarted by ConfD, but the restart is incomplete, resulting in the CONFD_ERR_BADSTATE error on all bemaster/beslave attempts on the used-to-be-master node, until it is completely restarted.

No - I’ve never seen it before, but I believe it is a problem with your OS. If you look into the ‘ctrl.c’ source, it (like most of the example code) has very simple error handling, basically it does an assert() on every API call to ensure that it returned CONFD_OK. If not, the assert() generates a SIGABRT and core dump, as you can see in your next attempt:

Some Linuxes these days do more than just create a core.* file in the current directory, there is some daemon (e.g. abrtd(8) on Fedora) intercepting the core dumps, saving them in some central location with additional info. There is probably some problem with this functionality in your OS installation/config.

For the moment I’m still using 5.3. About to move to 5.4 basic.

Is there a parameter in the confd.conf for this timeout? It takes only couple of seconds for getting this error.
Can it be something related to firewall, or anything like it?

I forgot to say that I initialized confd_init() with CONFD_PROTO_TRACE. That was the reason I thought it’s a ConfD internal message.
Can it somehow hint me the problem? or should I ignore it?

Do you think there’s any parameter I should check between the working setup and the non working setup?

Thanks.