Control socket closed with error -2 and then ConfD client exit with "INTERNAL ERROR"

hzpfly · October 20, 2021, 9:34am

Hi tail-f support,
We met an issue that the control socket between ConfdD and its client closed with unknown reason. The error code is -2.
Here is the log of ConfD:

217553 <CRIT> 29-Sep-2021::09:02:14.424 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - Daemon cpa timed out
217554 <INFO> 29-Sep-2021::09:02:15.181 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: yp_internal_user/3824 terminated session (reason: normal)
217555 <INFO> 29-Sep-2021::09:02:16.541 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 assigned to groups: application-administrator
217556 <INFO> 29-Sep-2021::09:02:16.543 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 created new session via cli from 72.17.0.74:51917 with ssh
217557 <INFO> 29-Sep-2021::09:02:16.563 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI 'autowizard false'
217558 <INFO> 29-Sep-2021::09:02:16.565 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI done
217559 <INFO> 29-Sep-2021::09:02:16.614 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI 'screen-width 512'
217560 <INFO> 29-Sep-2021::09:02:16.615 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI done
217561 <INFO> 29-Sep-2021::09:02:16.661 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI 'screen-length 0'
217562 <INFO> 29-Sep-2021::09:02:16.662 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI done
217563 <INFO> 29-Sep-2021::09:02:16.713 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI 'config'
217564 <INFO> 29-Sep-2021::09:02:16.718 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI done
217565 <CRIT> 29-Sep-2021::09:02:16.878 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - Daemon cpaValidator died
217566 <CRIT> 29-Sep-2021::09:02:16.879 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - Daemon cpa died
217567 <CRIT> 29-Sep-2021::09:02:16.882 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - no registration found for callpoint cpa/get_elem of type=external
217568 <CRIT> 29-Sep-2021::09:02:16.883 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - no registration found for callpoint cpa/get_elem of type=external
217569 <CRIT> 29-Sep-2021::09:02:16.883 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - no registration found for callpoint cpa/exists of type=external
217570 <INFO> 29-Sep-2021::09:02:16.885 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: audit user: smf_admin/3826 CLI 'cc-rule-handling action Keep-policys remove'
217571 <CRIT> 29-Sep-2021::09:02:16.886 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - no registration found for callpoint cpa/get_elem of type=external
217572 <CRIT> 29-Sep-2021::09:02:16.887 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - no registration found for callpoint cpa/get_elem of type=external
217573 <CRIT> 29-Sep-2021::09:02:16.887 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - no registration found for callpoint cpa/exists of type=external

And here is the log of client:

19752 {"message":"Socket closed","ret":-2,"service_id":"confd-client","severity":"error","timestamp":"2021-09-29T09:02:14.925Z","version":"0.2.0"}
19753 {"db":"running","fd":31,"message":"Socket closed","ret":-2,"service_id":"confd-client","session_id":3804,"session_type":"cli","severity":"error","timestamp":"2021-09-29T09:02:14.925Z","tx":52662,"tx_mode":"READ_WRITE","version":"0.2.0"}
19754 {"func":"CheckExecutePhasedConfdStartup","message":"Going to start phase 1","service_id":"confd-client","severity":"info","timestamp":"2021-09-29T09:02:14.952Z","version":"0.2.0"}
19755 {"error":"failed to go to start phase 1","func":"CheckExecutePhasedConfdStartup","message":"go to start phase 1 failed, may already be in startphase 2","service_id":"confd-client","severity":"info","timestamp":"2021-09-29T09:02:14.953Z","version":"0.2.0"}
19756 {"func":"CheckExecutePhasedConfdStartup","message":"Clearing NACM cache","service_id":"confd-client","severity":"info","timestamp":"2021-09-29T09:02:14.953Z","version":"0.2.0"}
19757 {"func":"CheckExecutePhasedConfdStartup","message":"Going to start phase 2","service_id":"confd-client","severity":"info","timestamp":"2021-09-29T09:02:15.181Z","version":"0.2.0"}
19758 {"func":"CheckExecutePhasedConfdStartup","message":"Going to start phase 2 done","service_id":"confd-client","severity":"info","timestamp":"2021-09-29T09:02:15.181Z","version":"0.2.0"}
19759 {"category":"CM","facility":"log audit","message":"Session start","service_id":"confd-client","session_id":3826,"session_type":"cli","severity":"info","subject":"smf_admin","timestamp":"2021-09-29T09:02:16.542Z","version":"0.2.0"}
19760 INTERNAL ERROR: No trans found with tid=53216
19761
19762 Error on control socket request: internal error (18): No trans found with tid=53216
19763
19764 Ended with exit code 1.

When ConfD client starts, there are several steps below:

init ConfD library by calling confd_init
init daemon by calling confd_init_daemon
load schemas by calling confd_load_schemas
create control socket which connects to ConfD
register callbacks
execute ConfD client phased startup
client polls the control socket between ConfD

ConfD said ConfD client daemon cpa timed out(why?), then sent CONFD_EOF to client through control socket. Client received it. The handling is

close the control socket
release the daemon
start from step 2 above.

The handling process seems ok. You can in line 19759, a new session was started. But just after that an INTERNAL ERROR came and it caused the client restarted.
My question is why the internal error “no trans found with tid=53216” show? Is there any missing steps in handling control socket problem?

Thank you in advance.
BRs
Michael

cohult · October 27, 2021, 6:39pm

The CONFD_EOF (-2) error is returned when the socket to ConfD has been closed.

For example when something like a timeout happens:

217553 <CRIT> 29-Sep-2021::09:02:14.424 welktx08vzwcsmfaat-y-ec-x-002 confd[201]: - Daemon cpa timed out

hzpfly · November 4, 2021, 5:33am

Hi cohult,
Thank you for the reply.
I would like reproduce confd close the control socket and see how the client recreate daemon. Would you please give me some guide on how to close the control socket in ConfD server side?
BRs
Michael

cohult · November 4, 2021, 1:36pm

A simple way to simulate an EOF (-2) on the socket would be to setup port forwarding by
have your client connect using for example port 4575 instead of 4565. You can then use something like socat to do the forwarding. Example when starting socat on the same host as you are running ConfD where your application is running on the same or some other host:

$ socat TCP-LISTEN:4575,fork TCP:127.0.0.1:4565

Start ConfD and connect your client through port 4575. Hit Ctrl-C to close socat (and the connection).

hzpfly · November 8, 2021, 6:58am

Hi cohult,
Really appreciate your great suggestion.
Using socat I can simulate an EOF on the control socket and hence the INTERNAL ERROR “No trans found with tid”. This causes ConfD client restart. We don’t think this restart is an expected behavior. Can it be avoided? Or can this internal error be avoided?
BRs
Michael

cohult · November 8, 2021, 10:10am

Hi,

ConfD does not need to be restarted, and ConfD does not restart because of an aborted transaction due to a data provider error. There is likely an application that tell ConfD to restart, check your code for that application.

In the log you provided a data provider stopped responding, so a transaction failed. The “No trans found with tid” error comes from the libconfd client library that the application use. The application should not use a transaction id of a transaction that has been aborted (due to the data provider application failing)

From the libconfd client library source code that comes with the ConfD releases:

$ grep -ne "No trans found with tid" libconfd/src/confd_lib.c
8732:        return confd_internal_error("No trans found with tid=%d\n",tid);
10145:            ret = confd_internal_error("No trans found with tid=%d\n",tid);
10179:            ret = confd_internal_error("No trans found with tid=%d\n",tid);
10213:            ret = confd_internal_error("No trans found with tid=%d\n",tid);
10247:            ret = confd_internal_error("No trans found with tid=%d\n",tid);
10282:            ret = confd_internal_error("No trans found with tid=%d\n",tid);
10316:            ret = confd_internal_error("No trans found with tid=%d\n",tid);
10391:            ret = confd_internal_error("No trans found with tid=%d\n",tid);
10439:            ret = confd_internal_error("No trans found with tid=%d\n",tid);

hzpfly · November 9, 2021, 1:28am

Hi cohult,
Thanks for your explanation.
To avoid misunderstand I prepared a picture below to show the related components.

To simulate EOF, the command “/tmp/socat TCP-LISTEN:4575,reuseaddr,fork TCP:127.0.0.1:4565” is run and interrupts it by clicking “ctrl + c” and run the command again immediately.
Meanwhile, in CLI there is a Netconf Get /nacm running every 1s.

As I described at first in this post, the ConfD daemon in Application is already recreated after receiving EOF. The old transaction should be cleared in application, right?
BRs
Michael

cohult · November 9, 2021, 4:27pm

HI Michael,

Thank you for the picture. Very helpful.

When ConfD client starts, there are several steps below:

init ConfD library by calling confd_init

init daemon by calling confd_init_daemon

load schemas by calling confd_load_schemas

create control socket which connects to ConfD

register callbacks

execute ConfD client phased startup

client polls the control socket between ConfD

ConfD said ConfD client daemon cpa timed out(why?), then sent CONFD_EOF to client through control socket. Client received it. The handling is

close the control socket

release the daemon

start from step 2 above.

The handling process seems ok. You can in line 19759, a new session was started. But just after that an INTERNAL ERROR came and it caused the client restarted.

You do not need to load the schema again. So move step 3, load schema, to before step 2, init daemon.
When you receive an EOF on the control socket, you need to close the worker socket too. You likely see the libconfd error as you are, for example, replying to something over a worker socket when the control socket has been closed and the transaction has therefore been aborted.
Before reconnecting, also make sure you call ‘confd_release_daemon()’ before you call ‘confd_init_daemon()’ again.

Best regards,
Conny

hzpfly · November 11, 2021, 1:27am

Hi Conny,
Thank you very much for the 3 suggestions.
Actually we have already done the 1st and 3rd ones. I made a mistake when describing the recreate process. Now it should be as below:

close the control socket
close the worker socket(TODO)
release the daemon and then create the daemon
start from step 4 above

So you think the internal error “no trans found” is caused by the left worker sockets, right? I will have a try.
BRs
Michael

unavarasala · October 16, 2023, 10:26am

Hi @cohult ,
I’m trying to handle control socket timeout error, and followed the steps mentioned in above comment.


When ConfD client starts, there are several steps below:

1. init ConfD library by calling confd_init
2. init daemon by calling confd_init_daemon
3. load schemas by calling confd_load_schemas
4. create control socket which connects to ConfD
5. register callbacks
6. execute ConfD client phased startup
7. client polls the control socket between ConfD

The handling process seems ok. You can in line 19759, a new session was started. But just after that an INTERNAL ERROR came and it caused the client restarted.
ConfD said ConfD client daemon cpa timed out(why?), then sent CONFD_EOF to client through control socket. Client received it. The handling is

1. close the control socket
2. release the daemon
3. start from step 2 above.

But I’m facing issues with register callbacks steps,
If I register callbacks post deamon init and socket re-opening, system doesn’t come up since it’s detecting duplicate registrations.
If I skip registering call backs system come up fine, but later when data is queried query fails as callback functions are not registered.
Could you help us if I’m missing anything.?
Let us know if I need to do anything specific to de-register old callbacks (I’m releasing demon as suggested, closing all old control and worker sockets and re-opening them).
Thanks a lot in advance.

Regards,
Usha

cohult · October 18, 2023, 11:33am

I believe you are looking for something like the confd_set_daemon_flags() function CONFD_DAEMON_FLAG_REG_REPLACE_DISCONNECT flag. See the confd_lib_dp(3) man page for details and this old topic:

unavarasala · October 27, 2023, 5:31pm

Thanks for your inputs @cohult.
I tried setting above flag suggested CONFD_DAEMON_FLAG_REG_REPLACE_DISCONNECT flag in both daemons and we were still seeing duplicate registration issue.
anything other way to solve this issue?
Thanks,
Usha

cohult · October 27, 2023, 7:27pm

Too little information. You need the developer log with the log level set to trace for the ConfD side and the application side libconfd log set to CONFD_TRACE to figure out what is happening.

Also,

That is not enough. You need to close the worker socket, as I wrote a couple of years ago above in this thread: