Notification to ConfD is not working

Hi all,
I’m facing a random issue where, after reboot, only the first notification is getting send & the rest of the notifications that follows it are not send.

The below is the code snippet where the notification is send to ConfD,.

status = confd_notification_send(m_LiveContext, &EventTime, values, NUM_OF_TAGS);
if (status != CONFD_OK)
{
		DEBUG(netConfTraceP, <<"SendNotification failed for keyPath: " << keyPath);
}
else
{
		DEBUG(netConfTraceP, <<"SendNotification successful for keyPath: " << keyPath);
}

For the first notification, I’m able to see that “SendNotification successful for keyPath” is printed in the logs. From second notification onwards, I see it is printing “SendNotification failed for keyPath” which means the confd_notification_send is not returning CONFD_OK.

I’m initializing the ConfD library as below:`

pConfdTracesFile = fopen("/var/log/confd_traces.log", "w");
/* Initialize confd configuration daemon context */
confd_init("ConfdAdaptor", pConfdTracesFile , CONFD_PROTO_TRACE);

Though I’m using CONFD_PROTO_TRACE, I’m not able to find anything useful in confd_traces.log except for the extra debug that comes from second notification onwards:

The below is the output of confd_traces.log after sending the notifications:

TRACE NOTIFICATION_SEND NETCONF //first notification

17-Jul-2020::19:27:08.200 5513/f2eff470/28 SEND op=15 isrel=0 th=-1 {170,‘NETCONF’,undefined,{19,{2020,7,17,19,27,8,200140,0,0}},0,{hxml,[{[1913099140|1458797552],start},{269073225,#Bin</
/object=slot//slot=(3)…>},{1821525234,{44,{601620947,951616458}}},{941725206,#Bin},{190171445,{19,{2020,7,17,19,27,5,0,0,0}}},{1194414666,{28,6}},{1518107580,#Bin},{[993414199|1079952141],{28,5}},{[993414199|1241890965],{12,2007}},{[993414199|2043370731],#Bin},{[1913099140|1458797552],stop}]}}

TRACE NOTIFICATION_SEND NETCONF //second notification
DEBUG Socket to ConfD is closed
TRACE NOTIFICATION_SEND NETCONF //third notification
DEBUG Socket to ConfD is closed

Can anyone explain what could be the reason for socket to be closed? OR suggest ways to enable more logs to understand the issue?

This means that the socket was closed from the ConfD side, not that the application’s socket was closed. It should be due to either a) ConfD closed the specific sockets for your daemon due to “bad behavior”, typically time out for a request response, or b) ConfD was terminated or crashed.

a) is prominently logged in both confd.log and devel.log, and b) should be obvious from the fact that ConfD is no longer running (unless you restart it, in which case you will see a new PID in confd.log and the normal startup logging).

Hi @per, Thanks for replying back!
Here is the snapshot of confd_devel.log after reboot:

20-Jul-2020::09:04:26.286 NPT-1050i-Sukeshni confd[1232]: devel-c Daemon closed (name: ConfdNotification, daemon id: 2)
20-Jul-2020::09:04:34.409 NPT-1050i-Sukeshni confd[1232]: devel-c Daemon ‘ConfdAdaptor’ replaced registration for validation point vpBgpNeighbors
20-Jul-2020::09:04:34.409 NPT-1050i-Sukeshni confd[1232]: devel-c Daemon ‘ConfdAdaptor’ replaced registration for validation point vpBgpRoot
20-Jul-2020::09:04:34.410 NPT-1050i-Sukeshni confd[1232]: devel-c Daemon ‘ConfdAdaptor’ replaced registration for validation point vpBgpNeighbors
20-Jul-2020::09:04:34.410 NPT-1050i-Sukeshni confd[1232]: devel-c Daemon ‘ConfdAdaptor’ replaced registration for validation point vpBgpRoot
20-Jul-2020::09:04:34.411 NPT-1050i-Sukeshni confd[1232]: devel-c Daemon ‘ConfdAdaptor’ replaced registration for validation point vpBgpNeighbors
20-Jul-2020::09:04:34.411 NPT-1050i-Sukeshni confd[1232]: devel-c Daemon ‘ConfdAdaptor’ replaced registration for validation point vpBgpRoot

This was printed just after reboot. When the notifications are send, there is nothing printed in this log as well as in confd.log. How can we resolve this issue?

Did you check the other possibility I mentioned, that ConfD is no longer running? If it crashed, there won’t necessarily be anything written to those logs.

Hi @per, The ConfD is not crashed. It is running. I’m able to perform configuration changes from netconf client.

Sorry, I can’t think of a case where ConfD would continue to run, but close the sockets for a specific daemon without any traces in the logs. But hm, maybe this is such a trace:

Is this your notification sender? There is no sign in your log snippet of that daemon connecting again. Could it be that your successful sending happened before that log entry, and the unsuccessful ones afterwards? (The time stamps obviously can’t be correlated since they are from different dates.)

“Daemon closed” means that the daemon closed its control socket (intentionally or by terminating), which isn’t per se an error for a daemon without registered callbacks. ConfD then closes all the connections to the daemon and forgets about it. Note that while notifications are only sent on worker socket(s), the daemon must still keep its control socket connection in order to be considered “alive and present” by ConfD.

Hi @per,
Using tcpdump on CONFD_PORT (4565), I see the following dump when notifications are send successfully:

23:45:42.799164 IP localhost.localdomain.60664 > localhost.localdomain.4565: Flags [S], seq 1616365006, win 14600, options [mss 1460,sackOK,TS val 268118 ecr 0,nop,wscale 9], length 0
23:45:42.799224 IP localhost.localdomain.4565 > localhost.localdomain.60664: Flags [S.], seq 3051796305, ack 1616365007, win 14480, options [mss 1460,sackOK,TS val 268118 ecr 268118,nop,wscale 9], length 0
23:45:42.799262 IP localhost.localdomain.60664 > localhost.localdomain.4565: Flags [.], ack 1, win 29, options [nop,nop,TS val 268118 ecr 268118], length 0
23:45:42.800077 IP localhost.localdomain.60664 > localhost.localdomain.4565: Flags [P.], seq 1:10, ack 1, win 29, options [nop,nop,TS val 268119 ecr 268118], length 9
23:45:42.800122 IP localhost.localdomain.4565 > localhost.localdomain.60664: Flags [.], ack 10, win 29, options [nop,nop,TS val 268119 ecr 268119], length 0
23:45:42.800299 IP localhost.localdomain.4565 > localhost.localdomain.60664: Flags [P.], seq 1:10, ack 10, win 29, options [nop,nop,TS val 268119 ecr 268119], length 9
23:45:42.800345 IP localhost.localdomain.60664 > localhost.localdomain.4565: Flags [.], ack 10, win 29, options [nop,nop,TS val 268119 ecr 268119], length 0
23:45:42.800537 IP localhost.localdomain.60664 > localhost.localdomain.4565: Flags [P.], seq 10:75, ack 10, win 29, options [nop,nop,TS val 268119 ecr 268119], length 65
23:45:42.802318 IP localhost.localdomain.4565 > localhost.localdomain.60664: Flags [P.], seq 10:59, ack 75, win 29, options [nop,nop,TS val 268119 ecr 268119], length 49
23:45:42.802650 IP localhost.localdomain.60664 > localhost.localdomain.4565: Flags [F.], seq 75, ack 59, win 29, options [nop,nop,TS val 268119 ecr 268119], length 0
23:45:42.802757 IP localhost.localdomain.4565 > localhost.localdomain.60664: Flags [F.], seq 59, ack 76, win 29, options [nop,nop,TS val 268119 ecr 268119], length 0
23:45:42.802791 IP localhost.localdomain.60664 > localhost.localdomain.4565: Flags [.], ack 60, win 29, options [nop,nop,TS val 268119 ecr 268119], length 0
23:45:45.038659 IP localhost.localdomain.60119 > localhost.localdomain.4565: Flags [P.], seq 156646442:156646799, ack 1613957451, win 29, options [nop,nop,TS val 268678 ecr 4294950514], length 357
23:45:45.038713 IP localhost.localdomain.4565 > localhost.localdomain.60119: Flags [.], ack 357, win 31, options [nop,nop,TS val 268678 ecr 268678], length 0

And the following dump when notifications are getting failed to send:

23:00:31.659098 IP localhost.localdomain.37463 > localhost.localdomain.4565: Flags [S], seq 486974673, win 14600, options [mss 1460,sackOK,TS val 21317665 ecr 0,nop,wscale 9], length 0
23:00:31.659170 IP localhost.localdomain.4565 > localhost.localdomain.37463: Flags [S.], seq 2167486249, ack 486974674, win 14480, options [mss 1460,sackOK,TS val 21317665 ecr 21317665,nop,wscale 9], length 0
23:00:31.659289 IP localhost.localdomain.37463 > localhost.localdomain.4565: Flags [.], ack 1, win 29, options [nop,nop,TS val 21317665 ecr 21317665], length 0
23:00:31.659923 IP localhost.localdomain.37463 > localhost.localdomain.4565: Flags [P.], seq 1:10, ack 1, win 29, options [nop,nop,TS val 21317665 ecr 21317665], length 9
23:00:31.659978 IP localhost.localdomain.4565 > localhost.localdomain.37463: Flags [.], ack 10, win 29, options [nop,nop,TS val 21317665 ecr 21317665], length 0
23:00:31.660190 IP localhost.localdomain.4565 > localhost.localdomain.37463: Flags [P.], seq 1:10, ack 10, win 29, options [nop,nop,TS val 21317665 ecr 21317665], length 9
23:00:31.660580 IP localhost.localdomain.37463 > localhost.localdomain.4565: Flags [.], ack 10, win 29, options [nop,nop,TS val 21317665 ecr 21317665], length 0
23:00:31.660739 IP localhost.localdomain.37463 > localhost.localdomain.4565: Flags [P.], seq 10:75, ack 10, win 29, options [nop,nop,TS val 21317665 ecr 21317665], length 65
23:00:31.662417 IP localhost.localdomain.4565 > localhost.localdomain.37463: Flags [P.], seq 10:59, ack 75, win 29, options [nop,nop,TS val 21317666 ecr 21317665], length 49
23:00:31.662949 IP localhost.localdomain.37463 > localhost.localdomain.4565: Flags [F.], seq 75, ack 59, win 29, options [nop,nop,TS val 21317666 ecr 21317666], length 0
23:00:31.663056 IP localhost.localdomain.4565 > localhost.localdomain.37463: Flags [F.], seq 59, ack 76, win 29, options [nop,nop,TS val 21317666 ecr 21317666], length 0
23:00:31.663272 IP localhost.localdomain.37463 > localhost.localdomain.4565: Flags [.], ack 60, win 29, options [nop,nop,TS val 21317666 ecr 21317666], length 0

Please suggest if you can understand something from these packet dumps. Thanks.

Did you read my previous message? Could it explain what you’re seeing?

Hi @per,
Regarding your question,

Is this your notification sender?

Yes. ConfdNotification is the notification sender.

Could it be that your successful sending happened before that log entry, and the unsuccessful ones afterwards?

The entry Daemon closed is coming before the first notification after reboot is send.

Did you read my previous message? Could it explain what you’re seeing?

In the tcpdump when notifications are send correctly, there are 14 packets & in the tcpdump when notifications are getting failed, there are 12 packets. Also, the port number for the last 2 packets in the 14 packets are different compared to the first 12. So, the last 2 must be related to control socket, right?

Hm, that’s strange, since (as I wrote) ConfD closes all connections to your daemon when that message is logged (but see below). However you do close the daemon’s control socket at that point, right? As I also wrote, you mustn’t do that.

Yes, that seems likely - but is also strange, since if a daemon is only a notification sender, and doesn’t register the replay callbacks, there shouldn’t be any control socket communication after the initial confd_register_notification_stream() + confd_register_done() calls.

And there is something else that is very strange with the traces - it seems you connect() a worker socket for each notification, i.e. confd_connect() - confd_notification_send() - close(). This is definitely not the intent, in fact it isn’t even supported by the interface - you give the worker socket fd that you intend to use in the struct confd_notification_stream_cbs that you pass to confd_register_notification_stream(), and that socket will then be used for all confd_notification_send() calls - i.e. it must obviously remain connected.

This per-notification socket connect might also be the reason you are able to send even a single notification despite having closed the control socket - ConfD may treat it as a new daemon connection, where the worker socket connect manages to arrive at the ConfD side before the control socket connect, and process the notification while the control socket connect is still pending. But (presumably) when the control socket connect never happens, it won’t work to send additional notifications.

At this point I have to suggest that you study the implementation of a notification sender in notifier_builtin_replay_store.c in the netconf_notifications example, and adjust your implementation to match what the example does - there are just to many issues with your implementation to address one-by-one in this forum.

Thanks Per for the help and suggestion.
Issue was resolved. The issue was with the ordering of the socket opening. In the code , worker socket was opened first then the control socket which lead to issues.