Callpoint daemon timeout and closes control socket

venkonan · April 8, 2020, 4:36pm

Hi Team,

I’ve scenario where control socket is closed and the client is not aware of it. Hence failed to re-register.

We have registered a daemon for operational data using dataprovider api’s. During application startup, daemon initialization happens and starts listening for opdata callbacks. When there is a external timeout like while querying the data from external database, confd closes the socket and throws exception to the client and client re-register the daemon again. So far no issues.
But for some reason (no clue why) when there is high load on CPU, Confd closes control socket (saying external timeout for new session) and the client is not able to determine whether the daemon is alive or died.
Could you please let us know how to determine the status of daemon and do reregistration of daemon again in this case.
Also is there any limit on number of callpoint processing at a time (callpoint of same or different)

Below is the Exception we got from confd. We have code to catch confException and do reregistration. But we got RuntimeException.

java.lang.RuntimeException: Error in DpTrans worker
at com.tailf.dp.DpTrans.run(DpTrans.java:387)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at com.tailf.dp.DpThread.run(DpThread.java:41)
Caused by: com.tailf.conf.ConfException: unexpected end of file
at com.tailf.conf.ConfInternal.readFill(ConfInternal.java:414)
at com.tailf.conf.ConfInternal.termRead(ConfInternal.java:184)
at com.tailf.conf.ConfInternal.termRead(ConfInternal.java:112)
at com.tailf.dp.DpTrans.read(DpTrans.java:424)
at com.tailf.dp.DpTrans.run(DpTrans.java:374)
… 4 more
Caused by: java.io.EOFException

Thanks,
-Venkat

cohult · April 8, 2020, 6:21pm

Hi,

See examples.confd/intro/java/9-threads/ThrDaemon.java main() and createDaemon() functions for an example.

Best regards

venkonan · April 9, 2020, 5:16am

Hi,

Can you please let me know why confd is throwing java.lang.RuntimeException: Error in DpTrans worker.
As I already mentioned, we have re-registration code in place for ConfException and IOException while listening (dp.read()). But in this confd throws RuntimeException and hence we failed to re-register.

Thanks,
-Venkat

mvf · April 9, 2020, 9:23am

External timeout for new session most likely means that your Dp instance fails to call Dp.read() before newSessionTimeout (defaults to 15 seconds) runs out. If that happens, ConfD closes all related sockets and running workers can expect all kinds of exceptions. Also, since your Dp.read() is not running at this moment, or is not currently waiting on the control socket event, you do not get the exception immediately; you would get it in the subsequent call to Dp.read() though.

venkonan · April 9, 2020, 6:30pm

We got RuntimeException, which is not expected as per the method signature of read().
Client is expecting only ConfException and IOException.

mvf · April 9, 2020, 8:11pm

But the RuntimeException you show does not come from dp.read(), or does it? The message you posted says:

So it is in the worker thread, not in the main thread.

venkonan · April 10, 2020, 4:24am

Yes, from DPTrans worker. When there is an error in worker thread, control socket is getting closed right.

cohult · April 10, 2020, 9:57am

You have the source code, so perhaps you will be able to demystify & understand the reason for the ConfD Java API throwing the runtime exception “Error in DpTrans worker” if you search for that string in conf-api-src/src/com/tailf/dp/DpTrans.java

$ cd $CONFD_DIR/java/jar/
$ mkdir confd-api-src && cd confd-api-src && jar -xvf ../confd-src-7.x.x.jar && cd -
$ cat conf-api-src/src/com/tailf/dp/DpTrans.java

mvf · April 14, 2020, 12:09pm

And that was my point - since the control socket is closed, your subsequent call to dp.read() would throw IOException; and as you write else where

your re-registration code should trigger. Do you observe something different?

Note that this is not the only approach; if things go so wrong it might be a bit safer to simply let the whole application terminate and have a watchdog restart it.