Create-subscription rpc gets stuck

jhu · April 24, 2022, 6:40am

We use the confd7.5.5.1 and hit this issue.
There have been some sessions that send rpc success and receive the success response and also receive the changed data notification. All works well.
At some point, a new session sends and gets timeout. and later, it re-sends and gets failure like below:


<?xml version="1.0" encoding="UTF-8"?><rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="206">

<create-subscription

xmlns="urn:ietf:params:xml:ns:netconf:notification:1.0">

<stream>exa-events</stream>

</create-subscription>

</rpc>

<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="206"><rpc-error>

<error-type>protocol</error-type>

<error-tag>operation-failed</error-tag>

<error-severity>error</error-severity>

<error-message xml:lang="en">A subscription is already active for this session</error-message></rpc-error>

</rpc-reply>

I look into the confd_netconf.log and found every timeout, there is something handling in other session. Don’t know whether this will impact the response.

Please refer to the below snippet of confd_netconf.log:
The confd took around 3m to respond to the rpc while processing the ont-arrival notification**

....
<INFO> 23-Apr-2022::10:25:47.059 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:notification:1.0}create-subscription attrs: message-id="m-62642771afa4e6201529724f-0"
<INFO> 23-Apr-2022::10:25:47.063 eolus confd[9732]: netconf id=12 create-subscription stream='exa-events' attrs: message-id="m-62642771afa4e6201529724f-0"
<INFO> 23-Apr-2022::10:25:47.067 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id="m-62642771afa4e6201529724f-0"
...
...
<INFO> 23-Apr-2022::10:29:05.460 eolus confd[9732]: **netconf id=11** got rpc: {urn:ietf:params:xml:ns:netconf:notification:1.0}**create-subscription** attrs: message-id="201"
<INFO> 23-Apr-2022::10:29:05.463 eolus confd[9732]: netconf id=11 create-subscription stream='exa-events' attrs: message-id="201"
<INFO> 23-Apr-2022::10:32:09.556 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:09.563 eolus confd[9732]: netconf id=11 sending rpc-reply, attrs: message-id="201"

<INFO> 23-Apr-2022::10:32:09.565 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival

There are 2 questions:

why does create-subscription rpc get stuck when there is something handling in other session?
How to cancel subscription and then re-create subscription? now the NBI re-sends and gets failure and then stop the following process. It’s a critical issue since it stops and blocks all the following process.

Please help to address it. Thanks.

mvf · April 26, 2022, 10:51am

You may have hit a known (mis)feature: when a NETCONF session A subscribes to notifications and another session B sends a RPC that should cause a notification to be sent to A, B does not get a reply until the notification has been sent. If the session A is processing another long-running request, the reply to B is blocked. Is it possible that this is what you see?

There are some attempts to fix this scenario, but the forthcoming release 7.8 does not contain the fix yet, so you might need work around that, e.g. to split the session that causes the delay (in your log that would probably be session 12) to two, one that has these long-running requests, another that subscribes to events.

jhu · April 27, 2022, 2:39am

@mvf, Sounds like what’s your description is similar with what I have seen, but not same. Our case is:
When a NETCONF session A subscribes to notifications and our EDP has some data change like onts’ arrival or departure, so our EDP generates relative external events like “ont-arrival”, “ont-departure” to the confd, and then confd tries to send these notifications to all sessions like B, C ,etc, which has subscribed the notification.
In this case, session A will be blocked for several minutes. But it’s not always blocked in the same case.
Please take a look at the below logs, between receiving create-subscription and replying create-subscription, there are lots of notifications, but it’s very quickly.

<INFO> 25-Apr-2022::19:23:47.394 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.405 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.410 NAPd-E3-2G-023 confd[9897]: netconf id=39 got rpc: {urn:ietf:params:xml:ns:netconf:notification:1.0}create-subscription attrs: message-id="199"
<INFO> 25-Apr-2022::19:23:47.424 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.436 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.446 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.456 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.465 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.478 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.490 NAPd-E3-2G-023 confd[9897]: netconf id=11 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.497 NAPd-E3-2G-023 confd[9897]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.501 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.510 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.512 NAPd-E3-2G-023 confd[9897]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id="m-62642771afa4e6201529724f-1663"
<INFO> 25-Apr-2022::19:23:47.515 NAPd-E3-2G-023 confd[9897]: netconf id=12 get attrs: message-id="m-62642771afa4e6201529724f-1663"
<INFO> 25-Apr-2022::19:23:47.525 NAPd-E3-2G-023 confd[9897]: netconf id=12 sending rpc-reply, attrs: message-id="m-62642771afa4e6201529724f-1663"
<INFO> 25-Apr-2022::19:23:47.526 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.540 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.548 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.558 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.567 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.582 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.602 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.619 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.635 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.651 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.663 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.675 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.687 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.699 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.710 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.724 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.735 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.756 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.771 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.788 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.800 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.812 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.823 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.833 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.844 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.856 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.867 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.879 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.889 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.899 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.907 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.915 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.922 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.930 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.937 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.945 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.958 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.965 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:47.972 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.984 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:47.994 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:48.005 NAPd-E3-2G-023 confd[9897]: netconf id=39 create-subscription stream='exa-events' attrs: message-id="199"
<INFO> 25-Apr-2022::19:23:48.008 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:48.020 NAPd-E3-2G-023 confd[9897]: netconf id=39 sending rpc-reply, attrs: message-id="199"
<INFO> 25-Apr-2022::19:23:48.024 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:48.040 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 25-Apr-2022::19:23:48.062 NAPd-E3-2G-023 confd[9897]: netconf id=38 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 25-Apr-2022::19:23:48.070 NAPd-E3-2G-023 confd[9897]: netconf id=39 sending notification {http://www.calix.com/ns/exa/base}environment-door
...
...

so there is no long-running requests from the netconf session, right? any idea?
Thanks a lot.

mvf · April 27, 2022, 12:55pm

I suggest you take a look at other logs, like devel.log (possibly with developerLogLevel set to trace), to inspect what is going on in case you wait so long for a reply. You may also try NETCONF trace log; but if there is indeed a problem with large messages sent between ConfD and one of the daemons, the trace log will also become quickly very large.

jhu · April 28, 2022, 1:58am

Thanks @mvf , there is no any logs in the confd_developer.log.
Could you help me to understand what’s the meaning of “notification replay”?

...
<INFO> 23-Apr-2022::10:32:21.379 eolus confd[9732]: netconf id=11 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:21.391 eolus confd[9732]: netconf id=11 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:21.399 eolus confd[9732]: netconf id=11 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:21.406 eolus confd[9732]: netconf id=11 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:21.409 eolus confd[9732]: netconf id=11 sending notification replayComplete
..

I see if a session sends “create-subscription”, the confd sends rpc-reply and also sends lots of notifications to this session even at that time there is no data changes from our EDP. I don’t know how the confd gets these data changes or events raised earlier by our EDP. Is there any callback to our EDP to retrieve that? I don’t find that.
Thanks.

mvf · April 28, 2022, 10:24am

That might well explain the problem you see. ConfD supports notification replay, it can be either built-in (which is probably your case) or custom. If configured, when a NETCONF client sends a create-subscription message with parameter startTime, the server is supposed to immediately start streaming notifications that has been created since that startTime value; see the RFC section. If your replay buffer is large, this becomes just another instance of the problem with NETCONF sessions blocked due to notifications.

You have a couple of configuration options:

Turn of notifications replay altogether (for your stream). In confd.conf, just modify replaySupport to false for your stream - right now, you obviously have this there:
```
  <replaySupport>true</replaySupport>
```
Limit the replay size; there are two parameters affecting that, maxSize and maxFiles. I’m not really familiar with how they work, you might need to experiment to find reasonable values.
Implement your own custom replay where you can limit replays as you please.
When creating a subscription, either leave out startTime paramenter (without the paramenter the NETCONF server does not do any replay) or limit it to a not-so-distant past so that the number of notifications is small.

The feature is well documented in the ConfD user guide, and the NETCONF notification example shows how to use it.

jhu · July 4, 2022, 2:50am

@mvf , thanks a lot for your elaboration. I got understanding of notification replay now. Yes, there are a couple of configuration options on our system:

 <notifications>
    <eventStreams>
      <stream>
        <name>exa-events</name>
        <description>EXA notifications</description>
        <replaySupport>true</replaySupport>
        <builtinReplayStore>
          <enabled>true</enabled>
          <dir>/tmp/confd/event-store</dir>
          <maxSize>1M</maxSize>
          <maxFiles>10</maxFiles>
        </builtinReplayStore>
      </stream>
    </eventStreams>
  </notifications>

Sorry to reply so late, for I was busy with other features and now I am back to this issue.
One question here:
when a NETCONF client A sends a create-subscription message with parameter , the CONFD will read the history notifications from /tmp/confd/event-store, right?
If at the same time, another session B triggers a notification, does CONFD write this notification into /tmp/confd/event-store and lock the notification to cause session A stuck? Is it possible ?
Thanks.

cohult · July 5, 2022, 1:33pm

Depends on the <startTime> parameter in the <create-subscription>. See RFC 5277 - NETCONF Event Notifications

Saw your closed Tail-f support ticket. In that log you had a socket failure on a send notification to a NETCONF notification subscriber, so ConfD is likely waiting for that notification to finish sending before replying to the create-subscription that asks for a replay of notifications in the built-in replay store.

Because of the flow control in SSH, ConfD uses /confdConfig/netconf/writeTimeout to terminate the irresponsive session so that ConfD doesn’t get blocked. It means if the notification subscriber is not handling notifications, ConfD will not be able to write to the SSH connection anymore, then this timeout kicks in and drops the subscription session.

You can use the examples.confd/netconf_notifications example, set /confdConfig/netconf/writeTimeout to “PT4S”, then suspended the NC subscriber and keep sending notifications. You will likely eventually get something like the following in netconf.log:

<INFO> 9-Oct-2018::21:03:01.007 Workstation confd[31119]: netconf id=13 sending notification {http://tail-f.com/ns/test/notif}linkUp
<INFO> 9-Oct-2018::21:03:05.010 Workstation confd[31119]: netconf id=13 failed to write to closed socket

jhu · July 8, 2022, 2:50am

@cohult , @mvf , I see. Thank your elaboration of this socket failure case.
We hit the “create-subscription” stuck issue in 2 cases so far. One case shows there is A session socket failure, and meanwhile, B session stuck. The other case shows there are lots of notifications like lots of onts arrival and departure posted by our EDP to confd, which will trigger confd to send these notifications to all subscribers, and at this time, a session “create-subscription” stuck.
So can I understand these two cases have the same root cause? I mean the root cause is just as what you said " ConfD is likely waiting for that notification to finish sending before replying to the create-subscription that asks for a replay of notifications in the built-in replay store."

If so, here are another 3 questions:
1#: In the case of confd receiving notifications from our EDP, Does confd send them to all subscribers before save them in the built-in replay store? or Does confd save them in the built-in replay store first and then send them to all subscribers?
2#: In the case of confd receiving “create-subscription” with “startTime” parameter, what’s the reply sequence? Does confd send “rpc-reply”, "sending notifications that have been created since that startTime value ", and then “sending notification replayComplete”? or other order?
3# In the case of confd receiving “create-subscription” with “startTime” parameter,
Our issue now is that the netconf client doesn’t receive “rpc-replay” for this “create-subscription” with “startTime” parameter on time in these two cases. I was wondering whether confd can send the “rpc-reply” immediately when confd receives in these two cases? , and can be sent later till confd finishes sending notifications or completes saving notifications in the built-in replay store.

Much appreciated your patient help.

cohult · July 9, 2022, 11:53am

1#
When a notification has been sent it is stored in the replay store.

2#

rpc-reply
replay notifications
replayComplete

3#
ConfD does send the rpc-reply to the create-subscription before sending the replay notifications. In the log you provided earlier however, there is another session that notifications are being replayed to, which is why there are notifications being sent between the create-subscription and the rpc-reply.

jhu · July 11, 2022, 1:44am

@cohult , thank your confirmation.
Regarding 3#, our issue is why the rpc-reply is stuck 3m and is not sent immediately when confd receives the “create-subscription” which causes the netconf client 11 timeout. Based on the pervious comments, My understanding is：
There are notifications sent by our EDP to confd and confd sends these notifications to other subscribers like session 12, so confd has to wait for these notifications sent completely and save them to the built-in replay store, and then sent “rpc-reply” to session 11. So the “rpc-reply” is stuck. Is my understanding correct? If correct, my question is whether “rpc-reply” can be sent immediately rather than stuck? Thanks.

jhu:

<INFO> 23-Apr-2022::10:29:05.460 eolus confd[9732]: **netconf id=11** got rpc: {urn:ietf:params:xml:ns:netconf:notification:1.0}**create-subscription** attrs: message-id="201"
<INFO> 23-Apr-2022::10:29:05.463 eolus confd[9732]: netconf id=11 create-subscription stream='exa-events' attrs: message-id="201"
<INFO> 23-Apr-2022::10:32:09.556 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:09.563 eolus confd[9732]: netconf id=11 sending rpc-reply, attrs: message-id="201"

cohult · July 11, 2022, 7:14pm

So the “rpc-reply” is stuck. Is my understanding correct?

No, not “stuck”, waiting for the write timeout that you configure in confd.conf under /confdConfig/netconf/writeTimeout as I wrote previously.

jhu · July 13, 2022, 1:36am

@cohult , I’m a bit confused now. My understand is /confdConfig/netconf/writeTimeout only affect a netconf client which can’t complete within the writeTimeout, like in my case , it only impacts on session 13. But now the question is session 14 in below log is stuck for more than 4m . Do you mean this writeTimeout impacts on all netconf clients? Does session 14 also need to wait for session 13 notification completion?

<INFO> 18-Apr-2022::22:44:16.016 eolus confd[9548]: netconf id=14 create-subscription stream='exa-events' attrs: message-id="199"
<INFO> 18-Apr-2022::22:48:51.635 eolus confd[9548]: netconf id=13 failed to write to closed socket
<INFO> 18-Apr-2022::22:48:51.636 eolus confd[9548]: netconf id=13 failed to write to closed socket
<INFO> 18-Apr-2022::22:48:51.645 eolus confd[9548]: netconf id=14 sending rpc-reply, attrs: message-id="199"

Please looking at the below logs, Before and after confd sends “rpc-reply”, there are lots of notifications sent to session 12. So how to explain why session 11 “create-subscription” is stuck for 3m in this case?

<INFO> 23-Apr-2022::10:29:05.460 eolus confd[9732]: netconf id=11 got rpc: {urn:ietf:params:xml:ns:netconf:notification:1.0}create-subscription attrs: message-id="201"
<INFO> 23-Apr-2022::10:29:05.463 eolus confd[9732]: netconf id=11 create-subscription stream='exa-events' attrs: message-id="201"
<INFO> 23-Apr-2022::10:32:09.556 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:09.563 eolus confd[9732]: netconf id=11 sending rpc-reply, attrs: message-id="201"
<INFO> 23-Apr-2022::10:32:09.565 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:09.571 eolus confd[9732]: netconf id=11 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}edit-config attrs: message-id="202"
<INFO> 23-Apr-2022::10:32:09.577 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
<INFO> 23-Apr-2022::10:32:09.589 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:09.599 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
<INFO> 23-Apr-2022::10:32:09.614 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival

There is no /confdConfig/netconf/writeTimeout value in our device, which means we use the default value 0s.

/confdConfig/netconf/writeTimeout (xs:duration) [PT0S]
Maximum time for a write operation towards a client to complete. If the time is exceeded, the
NETCONF session is terminated. The default value is 0, which means no timeout.
Modification of this value will only affect connections that are established after the modification has
been done

Please help to clarify again . Thanks a lot.

cohult · July 13, 2022, 2:54pm

Yes, once the notification from session 13 is sent or fail to be sent the rpc-reply for session 14 is sent as the log you provided read.

ConfD is likely working on sending a notification that has a log entry before the “got rpc:” log entry that you did not include in the log you provided.

“0s” means no timeout, i.e. ConfD will wait for the notification send until successful or an error such as a TCP socket close due to failing to write occurs.

jhu · July 14, 2022, 1:18am

@cohult , there are notifications sent to session 12 at 10:28:28.368 before “session 11 :got rpc create-subscription” at 10:29:05.460 . But after that notification, there are some netconf request/reply for this same session 12. So that notification should be sent completely, right? so why session 11 “create-subscription” still get stuck for 3m? I can’t understand this case and don’t know how to explain it to our customer.

23-Apr-2022::10:28:27.879 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-177”
23-Apr-2022::10:28:27.996 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-178”
23-Apr-2022::10:28:27.999 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-178”
23-Apr-2022::10:28:28.011 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-178”
23-Apr-2022::10:28:28.092 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-179”
23-Apr-2022::10:28:28.095 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-179”
23-Apr-2022::10:28:28.100 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-179”
23-Apr-2022::10:28:28.368 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
23-Apr-2022::10:28:28.375 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-link
23-Apr-2022::10:28:28.388 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-180”
23-Apr-2022::10:28:28.390 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-180”
23-Apr-2022::10:28:28.397 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-180”
23-Apr-2022::10:28:28.548 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-181”
23-Apr-2022::10:28:28.551 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-181”
23-Apr-2022::10:28:28.559 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-181”
23-Apr-2022::10:28:28.795 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-182”
23-Apr-2022::10:28:28.797 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-182”
23-Apr-2022::10:28:28.801 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-182”
23-Apr-2022::10:28:29.019 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-183”
23-Apr-2022::10:28:29.021 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-183”
23-Apr-2022::10:28:29.026 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-183”
23-Apr-2022::10:28:29.126 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-184”
23-Apr-2022::10:28:29.128 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-184”
23-Apr-2022::10:28:29.134 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-184”
23-Apr-2022::10:28:29.326 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-185”
23-Apr-2022::10:28:29.329 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-185”
23-Apr-2022::10:28:29.333 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-185”
23-Apr-2022::10:28:29.559 eolus confd[9732]: netconf id=12 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id=“m-62642771afa4e6201529724f-186”
23-Apr-2022::10:28:29.562 eolus confd[9732]: netconf id=12 get attrs: message-id=“m-62642771afa4e6201529724f-186”
23-Apr-2022::10:28:29.564 eolus confd[9732]: netconf id=12 sending rpc-reply, attrs: message-id=“m-62642771afa4e6201529724f-186”
23-Apr-2022::10:29:05.102 eolus confd[9732]: netconf id=11 got rpc: {http://www.calix.com/ns/exa/base}show-alarm-instances-active-range attrs: message-id=“198”
23-Apr-2022::10:29:05.127 eolus confd[9732]: netconf id=11 action name={‘http://www.calix.com/ns/exa/base’}'show-alarm-instances-active-range’ instance-identifier=, attrs: message-id=“198”
23-Apr-2022::10:29:05.175 eolus confd[9732]: netconf id=11 sending rpc-reply, attrs: message-id=“198”
23-Apr-2022::10:29:05.198 eolus confd[9732]: netconf id=11 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}edit-config attrs: message-id=“199”
23-Apr-2022::10:29:05.259 eolus confd[9732]: netconf id=11 edit-config target=running attrs: message-id=“199”
23-Apr-2022::10:29:05.335 eolus confd[9732]: netconf id=11 sending rpc-reply, attrs: message-id=“199”
23-Apr-2022::10:29:05.354 eolus confd[9732]: netconf id=11 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}edit-config attrs: message-id=“200”
23-Apr-2022::10:29:05.422 eolus confd[9732]: netconf id=11 edit-config target=running attrs: message-id=“200”
23-Apr-2022::10:29:05.450 eolus confd[9732]: netconf id=11 sending rpc-reply, attrs: message-id=“200”
23-Apr-2022::10:29:05.460 eolus confd[9732]: netconf id=11 got rpc: {urn:ietf:params:xml:ns:netconf:notification:1.0}create-subscription attrs: message-id=“201”
23-Apr-2022::10:29:05.463 eolus confd[9732]: netconf id=11 create-subscription stream=‘exa-events’ attrs: message-id=“201”
23-Apr-2022::10:32:09.556 eolus confd[9732]: netconf id=12 sending notification {http://www.calix.com/ns/exa/gpon-interface-base}ont-arrival
23-Apr-2022::10:32:09.563 eolus confd[9732]: netconf id=11 sending rpc-reply, attrs: message-id=“201”

cohult · July 15, 2022, 3:07pm

Again, not “stuck”- waiting for the notification at 10:28:28.375 to be sent over the TCP connection to the client.

The notifications are handled by a different “process” than the <edit-config> <get>, etc. operations.

That is why you do not see any further notifications being sent or replies to create subscription requests while a notification is being sent until it is successfully sent or a timeout on the SSH or TCP level occurs as the NETCONF level write timeout is set to infinite PT0S in confd.conf.

jhu · July 18, 2022, 7:42am

@cohult , Got it now. Thanks a lot.
So how shall we address such issue? so for, this issue doesn’t re-occur any more. If it re-occurs, what information do we need to collect to look into further what’s the reason causing confd to wait notification completion?
Thanks.

cohult · July 18, 2022, 9:58am

As our support suggested, enable more ConfD logs. E.g. developer, audit, error logs.

Here the TCP connection is failing, which is a network issue. Why the TCP connection fail is outside the scope of ConfD since the TCP connection is setup by the underlying OS, e.g. Linux sockets. See the Linux documentation for your Linux version on how to debug such network issues.

If you want to give up on a failing connection faster or verify that the issue indeed is a write timeout you can set the timeout on the NETCONF level using /confdConfig/netconf/writeTimeout to, for example, PT60S.

jhu · July 19, 2022, 3:02am

There is no any information in confd logs when hitting this issue. Setting the timeout on the NETCONF level using /confdConfig/netconf/writeTimeout after it reoccurs, may not help since the issue does not always happen.

cohult · July 19, 2022, 9:26am

If there are no logs, not even a “failed to write to closed socket”, that likely means that the TCP packets was sent successfully before a timeout or other TCP connection failure occurred.

Why not set a confdConfig/netconf/writeTimeout that you always have in your configuration that is slightly shorter than the time you, on average, observe it takes for the suspected notification TCP/SSH level timeout or other connection failure to occur?