Recently I saw a very weird behavior of confd. Everytime I start it or restart it, at begining it works well. I can use confd_cli to connect to it and do interactive set/show anything.
But after around 2-3 mins, it will stop and the status becomes failed.
If I run
systemctl status confd -l
â— confd.service - Tail-f Conf-d daemon
Loaded: loaded (/usr/lib/systemd/system/confd.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2019-07-14 16:02:11 EDT; 20min ago
Process: 28889 ExecStop=/opt/confd/bin/confd --stop (code=exited, status=1/FAILURE)
Process: 22435 ExecStart=/opt/confd/bin/confd (code=exited, status=0/SUCCESS)
Main PID: 22439 (code=exited, status=25)
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: devel-cdb Compacted CDB journal file: 1 ms (104 nodes in memory, disk size 1.66 KiB -> 1.61 KiB)
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: confd embedded apps in phase1:
[]
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: - Enabling SNMP log
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com confd[22439]: - Changing destination of daemon log to undefined
Jul 14 15:50:35 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Started Tail-f Conf-d daemon.
Jul 14 16:02:10 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service: main process exited, code=exited, status=25/n/a
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com confd[28889]: connection refused (stop)
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service: control process exited, code=exited status=1
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Unit confd.service entered failed state.
Jul 14 16:02:11 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service failed.
I enabled confd.log, I didn’t see any errors or something there.
I checked error code and I found #define ENOTTY 25 /* Not a typewriter */
In computing, “Not a typewriter” or ENOTTY is an error code defined in the errno.h found on many Unix systems. This code is now used to indicate that an invalid ioctl (input/output control) number was specified in an ioctl system call.
Any idea the confd keep stoping and the status become failed ? What I can do in order to investigate further ?
Looks like this kind of issue is often due to a missing Type “forking” configuration in the confd.service file. A description can be found in the man page for systemd.service.
Also, the $CONFD_DIR/bin/confd is just a script so you need to point to the ConfD exceutable located at $CONFD_DIR/lib/confd/erts/bin/confd
[root@localhost]# cat /etc/systemd/system/confd.service
[Unit]
Description=ConfD service
After=some.target
[Service]
ExecStart=/etc/init.d/confd start
Type=forking
[Install]
WantedBy=multi-user.target
[root@localhost]# systemctl status confd.service
confd.service - ConfD service
Loaded: loaded (/etc/systemd/system/confd.service; enabled)
Active: active (running) since Thu 2019-07-15 10:01:43 CEST; 2min 18s ago
Main PID: 126155 (confd)
CGroup: /system.slice/confd.service
└─126155 /home/tailf/confd/lib/confd/erts/bin/confd -K false -B -A0 -MHe true -- -root /home/tailf/confd/lib...
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Loading file /home/tailf/confd/etc/confd/confd.ccl
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for Internal IPC on 127.0.0.1:4565
Jul 15 10:01 :43 localhost.localdomain confd[126155]: devel-cdb Loaded schema file: /home/tailf/confd/var/confd/cdb/C.cdb...m 7.2)
Jul 15 10:01 :43 localhost.localdomain confd[126155]: devel-cdb Loaded configuration from /home/tailf/confd/var/confd/cdb....000s)
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for NETCONF SSH on 0.0.0.0:2022
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for NETCONF TCP on 127.0.0.1:2023
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - Starting to listen for CLI SSH on 0.0.0.0:2024
Jul 15 10:01 :43 localhost.localdomain confd[126150]: Starting Tail-f ConfD: confd.
Jul 15 10:01 :43 localhost.localdomain systemd[1]: Started ConfD service.
Jul 15 10:01 :43 localhost.localdomain confd[126155]: - ConfD started vsn: 7.2
Hint: Some lines were ellipsized, use -l to show in full.
[root@localhost]# ps -ef | grep -i confd
tailf 3287 1 0 Jul15 ? 00:00:00 /usr/libexec/gconfd-2
root 126155 1 0 10:01 ? 00:00:00 /home/tailf/confd/lib/confd/erts/bin/confd -K false -B -A0 -MHe true -- -root /home/tailf/confd/lib/confd -progname confd -- -home / -- -smp disable -code_path_cache -boot confd -delayed-detach -noshell -noinput -shutdown_time 30000 -conffile /home/tailf/confd/etc/confd/confd.conf -max_fds 1024 -detached-fd 4
root 126270 125471 0 10:01 pts/1 00:00:00 grep --color=auto -i confd
The weird thing is now the issue recovered. I don’t think I do any operations which may make the confd working properly.
The only thing I remember I do is I use “lsof -p” to check the FD that confd process opened and I saw 65 which is consistent.
I will keep monitoring and if the issue appear again, I will enable the trace in devel.log.
Note that you are still pointing to the script, not the real ConfD binary. You need to change that.
“$CONFD_DIR/bin/confd is just a script so you need to point to the ConfD exceutable located at $CONFD_DIR/lib/confd/erts/bin/confd”
systemctl daemon-reload;systemctl restart confd
Job for confd.service failed because the control process exited with error code. See “systemctl status confd.service” and “journalctl -xe” for details.
status confd.service
â— confd.service - Tail-f Conf-d daemon
Loaded: loaded (/usr/lib/systemd/system/confd.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2019-07-16 15:08:21 EDT; 38s ago
Process: 21362 ExecStart=/opt/confd/lib/confd/erts/bin/confd (code=exited, status=1/FAILURE)
Main PID: 16960 (code=exited, status=25)
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Starting Tail-f Conf-d daemon...
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com confd[21362]: Environment variable BINDIR is not set
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service: control process exited, code=exited status=1
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Failed to start Tail-f Conf-d daemon.
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: Unit confd.service entered failed state.
Jul 16 15:08:21 TPC-D8-03-007.phaedrus.sandvine.com systemd[1]: confd.service failed.
Thanks. I will try this later.
In current case, I am 99% sure that the confd stop because running out of fd since the opening sockets keep increasing, as I shown in the previous post.
Then I try lsof again in the confd process before it stopped:
4565 port make senses which is used for confd ipc. What is another two listening socket for ?
Ok, I figure out thru adding “-P” into lsof. down is 2022, which is configured for netconf ssh.
Ok, I think I figured out. It is an authentication issue which I mess the aaa configuration. I will keep watching to see whether the problem fully fixed.