confd_ha_beslave() timeout

Hi, sometimes I get timeout for confd_ha_beslave().
Error: confd_errno: 25 (Failed to connect to remote HA node)
The timeout happens after a minute, so I believe (correct me if I’m wrong) it’s related to the confd.conf element: <tickTimeout>PT20S</tickTimeout>, which, by the user guide:

The tickTimeout is a duration indicating how often each slave must send a tick
message to the master indicating liveness. If the master has not received a tick from a slave within 3 times
the configured tick time, the slave is considered to be dead. Similarly, the master sends tick messages to
all the slaves. If a slave has not received any tick messages from the master within the 3 times the timeout,
the slave will consider the master dead and report accordingly

In the master side, I turned on “traceproto” log level and I can’t see any new connection made by the client.
In the client side I get the confd_errno 25

The master is the master (I can see that using the confd --status)

What can I do?
Thanks

EDIT Got help from the experts. The connectTimeout has nothing to do with this. See post below.

What may be the reason for the timeout?
Is it ok just to retry the confd_ha_beslave()? Or there’s a meaning for the timeout? maybe something I do isn’t good?

Got help from the experts. The connectTimeout has nothing to do with this. I edited the previous post.

The reason for the timeout is likely that you got a TCP socket connect timeout.
On a mac this is usually set to 75s. Maybe you have a firewall blocking the slave?

To view the timeout value on a Mac:

$ sysctl net.inet.tcp.keepinit
net.inet.tcp.keepinit: 75000

Linux:

# cat /proc/sys/net/ipv4/tcp_syn_retries
6

To change the timeout on a Mac:

$ sudo sysctl net.inet.tcp.keepinit=10000

Which gives a connect timeout ceiling of 10 seconds.

Linux:

$ echo 7 > /proc/sys/net/ipv4/tcp_syn_retries

Here 6 gives an effective connect timeout ceiling of around 45 seconds, 7 gives around 90 seconds, and 8 gives around 190 seconds.

Thanks,
I’ll try to check that.

BTW, always the flow is:

  1. trying to be slave, confd_errno: 25 (Failed to connect to remote HA node)
  2. success

There is no firewall problem because the second attempt always succeeds.

Why do you think it’s related to the TCP socket connect timeout?

Since confd_ha_beslave() timeout. This is the behaviour you get if you try to connect to a remote HA node that doesn’t respond due to that the TCP connection cannot be established.
It is easy to verify, change the timeout like I described in the reply earlier to something shorter than you currently have and you will timeout sooner if it is the TCP socket connect that timeout for you.

Have you verified by changing the TCP socket connect timeout like I described in the previous reply?

Hi,

I did verify it, and indeed this parameter /proc/sys/net/ipv4/tcp_syn_retries modifies the timeout of confd_ha_beslave().

But still, it’s strange:

  1. Why does call to confd_ha_beslave() it fails, and then succeeds? nothing changes between the two calls.
  2. Somehow, something happened and the call to confd_ha_beslave() failed forever. I can’t really say what was the cause for this behavior, but the first failure returned immediately and the following failures got timeouts.

Are you running something like openswan on one of the machines?

Here is an example of a firewall causing a similar issue:

I think I found the problem.
It was in my code: sometimes confd_ha_beslave() was being called with correct ip (of the master node), and sometimes not.
When the ip wasn’t correct there were really ~/proc/sys/net/ipv4/tcp_syn_retries tries for trying to start a connection.

Sorry for the trouble, Thanks!