We use the confd7.5.5.1 and hit this issue.
There have been some sessions that send rpc success and receive the success response and also receive the changed data notification. All works well.
At some point, a new session sends and gets timeout. and later, it re-sends and gets failure like below:
<?xml version="1.0" encoding="UTF-8"?><rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="206">
<create-subscription
xmlns="urn:ietf:params:xml:ns:netconf:notification:1.0">
<stream>exa-events</stream>
</create-subscription>
</rpc>
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="206"><rpc-error>
<error-type>protocol</error-type>
<error-tag>operation-failed</error-tag>
<error-severity>error</error-severity>
<error-message xml:lang="en">A subscription is already active for this session</error-message></rpc-error>
</rpc-reply>
I look into the confd_netconf.log and found every timeout, there is something handling in other session. Don’t know whether this will impact the response.
Please refer to the below snippet of confd_netconf.log:
The confd took around 3m to respond to the rpc while processing the ont-arrival notification**
why does create-subscription rpc get stuck when there is something handling in other session?
How to cancel subscription and then re-create subscription? now the NBI re-sends and gets failure and then stop the following process. It’s a critical issue since it stops and blocks all the following process.
You may have hit a known (mis)feature: when a NETCONF session A subscribes to notifications and another session B sends a RPC that should cause a notification to be sent to A, B does not get a reply until the notification has been sent. If the session A is processing another long-running request, the reply to B is blocked. Is it possible that this is what you see?
There are some attempts to fix this scenario, but the forthcoming release 7.8 does not contain the fix yet, so you might need work around that, e.g. to split the session that causes the delay (in your log that would probably be session 12) to two, one that has these long-running requests, another that subscribes to events.
@mvf, Sounds like what’s your description is similar with what I have seen, but not same. Our case is:
When a NETCONF session A subscribes to notifications and our EDP has some data change like onts’ arrival or departure, so our EDP generates relative external events like “ont-arrival”, “ont-departure” to the confd, and then confd tries to send these notifications to all sessions like B, C ,etc, which has subscribed the notification.
In this case, session A will be blocked for several minutes. But it’s not always blocked in the same case.
Please take a look at the below logs, between receiving create-subscription and replying create-subscription, there are lots of notifications, but it’s very quickly.
I suggest you take a look at other logs, like devel.log (possibly with developerLogLevel set to trace), to inspect what is going on in case you wait so long for a reply. You may also try NETCONF trace log; but if there is indeed a problem with large messages sent between ConfD and one of the daemons, the trace log will also become quickly very large.
I see if a session sends “create-subscription”, the confd sends rpc-reply and also sends lots of notifications to this session even at that time there is no data changes from our EDP. I don’t know how the confd gets these data changes or events raised earlier by our EDP. Is there any callback to our EDP to retrieve that? I don’t find that.
Thanks.
That might well explain the problem you see. ConfD supports notification replay, it can be either built-in (which is probably your case) or custom. If configured, when a NETCONF client sends a create-subscription message with parameter startTime, the server is supposed to immediately start streaming notifications that has been created since that startTime value; see the RFC section. If your replay buffer is large, this becomes just another instance of the problem with NETCONF sessions blocked due to notifications.
You have a couple of configuration options:
Turn of notifications replay altogether (for your stream). In confd.conf, just modify replaySupport to false for your stream - right now, you obviously have this there:
<replaySupport>true</replaySupport>
Limit the replay size; there are two parameters affecting that, maxSize and maxFiles. I’m not really familiar with how they work, you might need to experiment to find reasonable values.
Implement your own custom replay where you can limit replays as you please.
When creating a subscription, either leave out startTime paramenter (without the paramenter the NETCONF server does not do any replay) or limit it to a not-so-distant past so that the number of notifications is small.
The feature is well documented in the ConfD user guide, and the NETCONF notification example shows how to use it.
@mvf , thanks a lot for your elaboration. I got understanding of notification replay now. Yes, there are a couple of configuration options on our system:
Sorry to reply so late, for I was busy with other features and now I am back to this issue.
One question here:
when a NETCONF client A sends a create-subscription message with parameter , the CONFD will read the history notifications from /tmp/confd/event-store, right?
If at the same time, another session B triggers a notification, does CONFD write this notification into /tmp/confd/event-store and lock the notification to cause session A stuck? Is it possible ?
Thanks.
Saw your closed Tail-f support ticket. In that log you had a socket failure on a send notification to a NETCONF notification subscriber, so ConfD is likely waiting for that notification to finish sending before replying to the create-subscription that asks for a replay of notifications in the built-in replay store.
Because of the flow control in SSH, ConfD uses /confdConfig/netconf/writeTimeout to terminate the irresponsive session so that ConfD doesn’t get blocked. It means if the notification subscriber is not handling notifications, ConfD will not be able to write to the SSH connection anymore, then this timeout kicks in and drops the subscription session.
You can use the examples.confd/netconf_notifications example, set /confdConfig/netconf/writeTimeout to “PT4S”, then suspended the NC subscriber and keep sending notifications. You will likely eventually get something like the following in netconf.log:
@cohult , @mvf , I see. Thank your elaboration of this socket failure case.
We hit the “create-subscription” stuck issue in 2 cases so far. One case shows there is A session socket failure, and meanwhile, B session stuck. The other case shows there are lots of notifications like lots of onts arrival and departure posted by our EDP to confd, which will trigger confd to send these notifications to all subscribers, and at this time, a session “create-subscription” stuck.
So can I understand these two cases have the same root cause? I mean the root cause is just as what you said " ConfD is likely waiting for that notification to finish sending before replying to the create-subscription that asks for a replay of notifications in the built-in replay store."
If so, here are another 3 questions:
1#: In the case of confd receiving notifications from our EDP, Does confd send them to all subscribers before save them in the built-in replay store? or Does confd save them in the built-in replay store first and then send them to all subscribers?
2#: In the case of confd receiving “create-subscription” with “startTime” parameter, what’s the reply sequence? Does confd send “rpc-reply”, "sending notifications that have been created since that startTime value ", and then “sending notification replayComplete”? or other order?
3# In the case of confd receiving “create-subscription” with “startTime” parameter,
Our issue now is that the netconf client doesn’t receive “rpc-replay” for this “create-subscription” with “startTime” parameter on time in these two cases. I was wondering whether confd can send the “rpc-reply” immediately when confd receives in these two cases? , and can be sent later till confd finishes sending notifications or completes saving notifications in the built-in replay store.
1#
When a notification has been sent it is stored in the replay store.
2#
rpc-reply
replay notifications
replayComplete
3#
ConfD does send the rpc-reply to the create-subscription before sending the replay notifications. In the log you provided earlier however, there is another session that notifications are being replayed to, which is why there are notifications being sent between the create-subscription and the rpc-reply.
@cohult , thank your confirmation.
Regarding 3#, our issue is why the rpc-reply is stuck 3m and is not sent immediately when confd receives the “create-subscription” which causes the netconf client 11 timeout. Based on the pervious comments, My understanding is:
There are notifications sent by our EDP to confd and confd sends these notifications to other subscribers like session 12, so confd has to wait for these notifications sent completely and save them to the built-in replay store, and then sent “rpc-reply” to session 11. So the “rpc-reply” is stuck. Is my understanding correct? If correct, my question is whether “rpc-reply” can be sent immediately rather than stuck? Thanks.
@cohult , I’m a bit confused now. My understand is /confdConfig/netconf/writeTimeout only affect a netconf client which can’t complete within the writeTimeout, like in my case , it only impacts on session 13. But now the question is session 14 in below log is stuck for more than 4m . Do you mean this writeTimeout impacts on all netconf clients? Does session 14 also need to wait for session 13 notification completion?
Please looking at the below logs, Before and after confd sends “rpc-reply”, there are lots of notifications sent to session 12. So how to explain why session 11 “create-subscription” is stuck for 3m in this case?
There is no /confdConfig/netconf/writeTimeout value in our device, which means we use the default value 0s.
/confdConfig/netconf/writeTimeout (xs:duration) [PT0S]
Maximum time for a write operation towards a client to complete. If the time is exceeded, the
NETCONF session is terminated. The default value is 0, which means no timeout.
Modification of this value will only affect connections that are established after the modification has
been done
Yes, once the notification from session 13 is sent or fail to be sent the rpc-reply for session 14 is sent as the log you provided read.
ConfD is likely working on sending a notification that has a log entry before the “got rpc:” log entry that you did not include in the log you provided.
“0s” means no timeout, i.e. ConfD will wait for the notification send until successful or an error such as a TCP socket close due to failing to write occurs.
@cohult , there are notifications sent to session 12 at 10:28:28.368 before “session 11 :got rpc create-subscription” at 10:29:05.460 . But after that notification, there are some netconf request/reply for this same session 12. So that notification should be sent completely, right? so why session 11 “create-subscription” still get stuck for 3m? I can’t understand this case and don’t know how to explain it to our customer.
Again, not “stuck”- waiting for the notification at 10:28:28.375 to be sent over the TCP connection to the client.
The notifications are handled by a different “process” than the <edit-config> <get>, etc. operations.
That is why you do not see any further notifications being sent or replies to create subscription requests while a notification is being sent until it is successfully sent or a timeout on the SSH or TCP level occurs as the NETCONF level write timeout is set to infinite PT0S in confd.conf.
@cohult , Got it now. Thanks a lot.
So how shall we address such issue? so for, this issue doesn’t re-occur any more. If it re-occurs, what information do we need to collect to look into further what’s the reason causing confd to wait notification completion?
Thanks.
As our support suggested, enable more ConfD logs. E.g. developer, audit, error logs.
Here the TCP connection is failing, which is a network issue. Why the TCP connection fail is outside the scope of ConfD since the TCP connection is setup by the underlying OS, e.g. Linux sockets. See the Linux documentation for your Linux version on how to debug such network issues.
If you want to give up on a failing connection faster or verify that the issue indeed is a write timeout you can set the timeout on the NETCONF level using /confdConfig/netconf/writeTimeout to, for example, PT60S.
There is no any information in confd logs when hitting this issue. Setting the timeout on the NETCONF level using /confdConfig/netconf/writeTimeout after it reoccurs, may not help since the issue does not always happen.
If there are no logs, not even a “failed to write to closed socket”, that likely means that the TCP packets was sent successfully before a timeout or other TCP connection failure occurred.
Why not set a confdConfig/netconf/writeTimeout that you always have in your configuration that is slightly shorter than the time you, on average, observe it takes for the suspected notification TCP/SSH level timeout or other connection failure to occur?