Confd status becomes 'failed' after 2-3 mins since it started

Recently I saw a very weird behavior of confd. Everytime I start it or restart it, at begining it works well. I can use confd_cli to connect to it and do interactive set/show anything.
But after around 2-3 mins, it will stop and the status becomes failed.
If I run

systemctl status confd -l
● confd.service - Tail-f Conf-d daemon
   Loaded: loaded (/usr/lib/systemd/system/confd.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sun 2019-07-14 16:02:11 EDT; 20min ago
  Process: 28889 ExecStop=/opt/confd/bin/confd --stop (code=exited, status=1/FAILURE)
  Process: 22435 ExecStart=/opt/confd/bin/confd (code=exited, status=0/SUCCESS)
 Main PID: 22439 (code=exited, status=25)

Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: devel-cdb Compacted CDB journal file: 1 ms (104 nodes in memory, disk size 1.66 KiB -> 1.61 KiB)
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: confd embedded apps in phase1:
                                                                          []
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: - Enabling SNMP log
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: - Changing destination of daemon log to undefined
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Started Tail-f Conf-d daemon.
Jul 14 16:02:10 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service: main process exited, code=exited, status=25/n/a
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com confd[28889]: connection refused (stop)
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service: control process exited, code=exited status=1
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Unit confd.service entered failed state.
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service failed.

I enabled confd.log, I didn’t see any errors or something there.
I checked error code and I found
#define ENOTTY 25 /* Not a typewriter */

In computing, “Not a typewriter” or ENOTTY is an error code defined in the errno.h found on many Unix systems. This code is now used to indicate that an invalid ioctl (input/output control) number was specified in an ioctl system call.

Any idea the confd keep stoping and the status become failed ? What I can do in order to investigate further ?

Thanks.

I am assuming you are running on a x86_64 machine.
What does the ConfD developer log with log level set to “trace” say?

Looks like this kind of issue is often due to a missing Type “forking” configuration in the confd.service file. A description can be found in the man page for systemd.service.
Also, the $CONFD_DIR/bin/confd is just a script so you need to point to the ConfD exceutable located at $CONFD_DIR/lib/confd/erts/bin/confd

[root@localhost]# cat /etc/systemd/system/confd.service
[Unit]
Description=ConfD service
After=some.target

[Service]
ExecStart=/etc/init.d/confd start
Type=forking

[Install]
WantedBy=multi-user.target

[root@localhost]# systemctl status confd.service
confd.service - ConfD service
   Loaded: loaded (/etc/systemd/system/confd.service; enabled)
   Active: active (running) since Thu 2019-07-15 10:01:43 CEST; 2min 18s ago
 Main PID: 126155 (confd)
   CGroup: /system.slice/confd.service
           └─126155 /home/tailf/confd/lib/confd/erts/bin/confd -K false -B -A0 -MHe true -- -root /home/tailf/confd/lib...

Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Loading file /home/tailf/confd/etc/confd/confd.ccl
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for Internal IPC on 127.0.0.1:4565
Jul 15 10:01 :43 localhost.localdomain confd[126155]: devel-cdb Loaded schema file: /home/tailf/confd/var/confd/cdb/C.cdb...m 7.2)
Jul 15 10:01 :43 localhost.localdomain confd[126155]: devel-cdb Loaded configuration from /home/tailf/confd/var/confd/cdb....000s)
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for NETCONF SSH on 0.0.0.0:2022
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for NETCONF TCP on 127.0.0.1:2023
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for CLI SSH on 0.0.0.0:2024
Jul 15 10:01 :43 localhost.localdomain confd[126150]: Starting Tail-f ConfD: confd.
Jul 15 10:01 :43 localhost.localdomain systemd[1]: Started ConfD service.
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - ConfD started vsn: 7.2
Hint: Some lines were ellipsized, use -l to show in full.
[root@localhost]# ps -ef | grep -i confd
tailf   3287      1  0 Jul15 ?        00:00:00 /usr/libexec/gconfd-2
root     126155      1  0 10:01 ?        00:00:00 /home/tailf/confd/lib/confd/erts/bin/confd -K false -B -A0 -MHe true -- -root /home/tailf/confd/lib/confd -progname confd -- -home / -- -smp disable -code_path_cache -boot confd -delayed-detach -noshell -noinput -shutdown_time 30000 -conffile /home/tailf/confd/etc/confd/confd.conf -max_fds 1024 -detached-fd 4
root     126270 125471  0 10:01 pts/1    00:00:00 grep --color=auto -i confd

I did enable devel log before but not in trace level, I didn’t see anything.
This my confd.service

  1 [Unit]
  2 Description=Tail-f Conf-d daemon
  3 Wants=network.target
  4 Before=network.target
  5 ConditionPathExists=/opt/confd/etc/confd/confd.conf
  6 
  7 [Service]
  8 Type=forking
  9 ExecStart=/opt/confd/bin/confd
 10 ExecStop=/opt/confd/bin/confd --stop
 11 ExecReload=/opt/confd/bin/confd --reload
 12 Restart=on-abort
 13 
 14 [Install]
 15 WantedBy=multi-user.target

The weird thing is now the issue recovered. I don’t think I do any operations which may make the confd working properly.
The only thing I remember I do is I use “lsof -p” to check the FD that confd process opened and I saw 65 which is consistent.

I will keep monitoring and if the issue appear again, I will enable the trace in devel.log.

Now the issue appear again!
The only thing I did which I think trigger the issue is

  1. I compiled the new yang files and try to restart confd
  2. confd restart fail with upgrade error.
  3. I delete *.cdb files in below

/opt/confd/var/confd/cdb
-rw-r–r-- 1 root root 1653 Jul 16 11:35 A.cdb
-rw-r–r-- 1 root root 5442812 Jul 16 11:35 C.cdb
-rw-r–r-- 1 root root 4 Jul 16 11:35 O.cdb

  1. The confd can be restarted, but I saw below

systemctl status confd
● confd.service - Tail-f Conf-d daemon
Loaded: loaded (/usr/lib/systemd/system/confd.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-07-16 11:35:18 EDT; 23s ago
Process: 15984 ExecStop=/opt/confd/bin/confd --stop (code=exited, status=1/FAILURE)
Process: 16274 ExecStart=/opt/confd/bin/confd (code=exited, status=0/SUCCESS)
Main PID: 16278 (confd)

I check the fd of the process, I can see the socket number keep increasing

confd   16278 root   86u  IPv4 6262698       0t0      TCP localhost:down->localhost:59062 (ESTABLISHED)
confd   16278 root   87u  IPv4 6262701       0t0      TCP localhost:down->localhost:59064 (ESTABLISHED)
confd   16278 root   88u  IPv4 6262704       0t0      TCP localhost:down->localhost:59066 (ESTABLISHED)
confd   16278 root   89u  IPv4 6264240       0t0      TCP localhost:down->localhost:59068 (ESTABLISHED)
confd   16278 root   90u  IPv4 6262707       0t0      TCP localhost:down->localhost:59070 (ESTABLISHED)
confd   16278 root   91u  IPv4 6264243       0t0      TCP localhost:down->localhost:59072 (ESTABLISHED)
confd   16278 root   92u  IPv4 6262710       0t0      TCP localhost:down->localhost:59074 (ESTABLISHED)
confd   16278 root   93u  IPv4 6262713       0t0      TCP localhost:down->localhost:59076 (ESTABLISHED)
confd   16278 root   94u  IPv4 6262716       0t0      TCP localhost:down->localhost:59080 (ESTABLISHED)
confd   16278 root   95u  IPv4 6262719       0t0      TCP localhost:down->localhost:59082 (ESTABLISHED)
confd   16278 root   96u  IPv4 6264246       0t0      TCP localhost:down->localhost:59084 (ESTABLISHED)
confd   16278 root   97u  IPv4 6264248       0t0      TCP localhost:down->localhost:59086 (ESTABLISHED)
confd   16278 root   98u  IPv4 6264250       0t0      TCP localhost:down->localhost:59088 (ESTABLISHED)
confd   16278 root   99u  IPv4 6264252       0t0      TCP localhost:down->localhost:59090 (ESTABLISHED)
confd   16278 root  100u  IPv4 6264254       0t0      TCP localhost:down->localhost:59092 (ESTABLISHED)

Why confd keep open new socket ?

Note that you are still pointing to the script, not the real ConfD binary. You need to change that.
“$CONFD_DIR/bin/confd is just a script so you need to point to the ConfD exceutable located at $CONFD_DIR/lib/confd/erts/bin/confd”

Thanks. So I update the confd.service as below

  1 [Unit]
  2 Description=Tail-f Conf-d daemon
  3 Wants=network.target
  4 Before=network.target
  5 ConditionPathExists=/opt/confd/etc/confd/confd.conf
  6 
  7 [Service]
  8 Type=forking
  9 ExecStart=/opt/confd/lib/confd/erts/bin/confd
 10 ExecStop=/opt/confd/lib/confd/erts/bin/confd --stop
 11 ExecReload=/opt/confd/lib/confd/erts/bin/confd --reload
 12 Restart=on-abort
 13 
 14 [Install]
 15 WantedBy=multi-user.target

But now I can start confd

systemctl daemon-reload;systemctl restart confd
Job for confd.service failed because the control process exited with error code. See “systemctl status confd.service” and “journalctl -xe” for details.

status confd.service                                             
● confd.service - Tail-f Conf-d daemon
   Loaded: loaded (/usr/lib/systemd/system/confd.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2019-07-16 15:08:21 EDT; 38s ago
  Process: 21362 ExecStart=/opt/confd/lib/confd/erts/bin/confd (code=exited, status=1/FAILURE)
 Main PID: 16960 (code=exited, status=25)

Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Starting Tail-f Conf-d daemon...
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com confd[21362]: Environment variable BINDIR is not set
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service: control process exited, code=exited status=1
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Failed to start Tail-f Conf-d daemon.
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Unit confd.service entered failed state.
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service failed.

Looked into this again and I gave you some bad advice. Turns out you do want to use the confd script after all. Apologize for that.

Working example:

  1. Install ConfD under /opt/confd
  2. Put something like this script under /etc/init.d/
$ pwd
/etc/init.d
$ cat confd
#!/bin/sh
. /opt/confd/confdrc

case "$1" in
  start)
    echo -n "Starting Tail-f ConfD: "
    start-stop-daemon --start -x "/opt/confd/bin/confd"
    echo "confd"
    ;;
esac
exit 0
  1. The confd.service:
$ cat /etc/systemd/system/confd.service
[Unit]
Description=ConfD service
After=some.target

[Service]
ExecStart=/etc/init.d/confd start
Type=forking

[Install]
WantedBy=multi-user.target

Thanks. I will try this later.
In current case, I am 99% sure that the confd stop because running out of fd since the opening sockets keep increasing, as I shown in the previous post.
Then I try lsof again in the confd process before it stopped:

lsof -p 13269 | grep -i ipv4
confd   13269 root   39u  IPv4 9829452       0t0      TCP localhost:4565 (LISTEN)
confd   13269 root   42u  IPv4 9829189       0t0      TCP *:down (LISTEN)
confd   13269 root   43u  IPv4 9829197       0t0      TCP *:xinuexpansion4 (LISTEN)

4565 port make senses which is used for confd ipc. What is another two listening socket for ?
Ok, I figure out thru adding “-P” into lsof. down is 2022, which is configured for netconf ssh.

Ok, I think I figured out. It is an authentication issue which I mess the aaa configuration. I will keep watching to see whether the problem fully fixed.

Thanks.