Failed get operation because of External_timeout

pavanfds · May 13, 2016, 3:20pm

We are coming across this error. I was wondering if you have some feed back as to why this kind of error can occur.
The callpoint and call back is present for all the leaves in the yang model and it works fine in general conditions. But due to some abrupt behavior this error occurred.(Days is the first leaf in the container).

Looks like the control socket timed out suddenly in the middle, please provide some input as to why this could occur

> May 13 13:46:04 orion confd[690]: netconf id=21 sending rpc-reply, attrs: message-id="m-0"
> May 13 13:46:04 orion confd[690]: netconf id=21 got rpc: {urn:ietf:params:xml:ns:netconf:base:1.0}get attrs: message-id="m-0"
> May 13 13:46:04 orion confd[690]: netconf id=21 get attrs: message-id="m-0"
> May 13 13:46:04 orion confd[690]: netconf id=21 sending rpc-reply, attrs: message-id="m-0"
> May 13 13:46:08 orion confd[690]: netconf id=21 ssh transport closed
> May 13 13:46:08 orion confd[690]: audit user: admin/0 Logged out ssh <local> user
> May 13 13:46:11 orion confd[690]: audit user: admin/0 logged in over ssh from 192.168.100.3 with authmeth:password
> May 13 13:46:11 orion confd[690]: devel-c Control socket request timed out daemon bristol_daemon id 0
> May 13 13:46:11 orion confd[690]: - Daemon bristol_daemon timed out
> May 13 13:46:11 orion confd[690]: devel-c new_usess error {external_timeout, ""}
> May 13 13:46:11 orion confd[690]: devel-c new_usess error {external_timeout, ""}
> May 13 13:46:11 orion confd[690]: devel-c new_usess error {external_timeout, ""}
> May 13 13:46:11 orion confd[690]: devel-c close_trans/"bristol_daemon" error {external_timeout, ""}
> May 13 13:46:11 orion confd[690]: devel-c get_elem error {external_timeout, ""} for callpoint 'system-status' path /bristol100:system/status/system-uptime/days
> May 13 13:46:11 orion confd[690]: devel-c get_elem error {external_timeout, ""} for callpoint 'system-status' path /bristol100:system/status/system-uptime/days
> May 13 13:46:11 orion confd[690]: devel-c get_elem error {external_timeout, ""} for callpoint 'system-status' path /bristol100:system/status/system-uptime/days
> May 13 13:46:11 orion confd[690]: devel-c get_elem error {external_timeout, ""} for callpoint 'system-status' path /bristol100:system/status/system-uptime/days

Thanks Pavan

greg · May 13, 2016, 4:07pm

Have you looked at your data provider and whether it is getting the request on the control socket? The data provider is most likely still up and running, since ConfD thinks there is still a registered data provider for the callpoint, but if your data provider is not able to process the message coming in over the IPC, then I think that would lead to these timeouts

waitai · May 13, 2016, 4:20pm

All the callbacks that are invoked via these sockets are subject to timeouts configured in confd.conf, see confd.conf(5). The callbacks invoked via the control socket must generate a reply back to ConfD within the time configured for /confdConfig/capi/newSessionTimeout, the callbacks invoked via a worker socket within the time configured for /confdConfig/capi/queryTimeout. If either timeout is exceeded, the daemon will be considered dead, and ConfD will disconnect it by closing the control and worker sockets. If you have not altered the newSessionTimeout parameter, request via the control socket will timeout after 30 seconds.

To get more information you should increase the verbosity of the developer log. I.e. add trace to your confd.conf file under . You can also change the debug level parameter passed to confd_init() - the debug level can be increased by setting it to CONFD_DEBUG or CONFD_PROTO_TRACE. The latter is the most verbose.

If you have the control and worker(s) socket in the same thread you can cause a deadlock when invoking certain functions or that your callback functions are really busy doing something and fail to respond in time. A typical case is when one, for example, create a new maapi user session in one of the worker socket callbacks.

pavanfds · May 18, 2016, 1:49pm

After further investigation looks like we are not hitting the Confd callback timeout.

In our case we have control and worker socket in the same thread so we most likely are hitting the deadlock case. Could you please explain more as to why this deadlock can occur or point me to some other forum discussion where this is explained further. Is there any way to avoid this deadlock or resolve this issue.

Our main concern is if we don’t understand why this deadlock happens, we cannot resolve it by using multithreading.

This issue is very difficult to reproduce, but happens once in a while on a random device so it is very difficult to trace using debug logs.

Thanks
Pavan

waitai · May 18, 2016, 5:37pm

You can refer to Chapter 6.7, The Protocol and a Library Threads Discussion, of the ConfD User Guide for more information on single-threaded vs. multi-threaded implementation for your data provider application.