Core dumped when running mulitple confdc command in parallel

hzpfly · October 27, 2020, 5:46am

The 11 commands are running the confdc command in parallel:

/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete-ext.yang --use-description -o /models-from-db/fxs/company-complete-ext.fxs --yangpath /tmp/yangArchive320787179 --no-features &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive749676220/company-alice.yang --use-description -o /models-from-db/fxs/company-alice.fxs --yangpath /tmp/yangArchive749676220 --no-features &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-alice.yang --use-description -o /models-from-db/fxs/company-alice.fxs --yangpath /tmp/yangArchive368855886 &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-grouping.yang --use-description -o /models-from-db/fxs/company-grouping.fxs --yangpath /tmp/yangArchive320787179 &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-bob.yang --use-description -o /models-from-db/fxs/company-bob.fxs --yangpath /tmp/yangArchive368855886 --no-features &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete-ext.yang --use-description -o /models-from-db/fxs/company-complete-ext.fxs --yangpath /tmp/yangArchive320787179 --no-features &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-bob.yang --use-description -o /models-from-db/fxs/company-bob.fxs --yangpath /tmp/yangArchive368855886 --no-features &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive749676220/company-bob.yang --use-description -o /models-from-db/fxs/company-bob.fxs --yangpath /tmp/yangArchive749676220 &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-alice.yang --use-description -o /models-from-db/fxs/company-alice.fxs --yangpath /tmp/yangArchive368855886 &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete-ext-ext.yang --use-description -o /models-from-db/fxs/company-complete-ext-ext.fxs --yangpath /tmp/yangArchive320787179 --no-features &
/opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete.yang --use-description -o /models-from-db/fxs/company-complete.fxs --yangpath /tmp/yangArchive320787179 --deviation /tmp/yangArchive320787179/company-complete-ext.yang --deviation /tmp/yangArchive320787179/company-complete-ext-ext.yang -F company-complete:implemented,some_cool_feature &

The result is as following:

bash-4.4$ Failed to create dirty io scheduler thread 8, error = 11

[3] Killed /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-alice.yang --use-description -o /models-from-db/fxs/company-alice.fxs --yangpath /tmp/yangArchive368855886
bash-4.4$
[1] Killed /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete-ext.yang --use-description -o /models-from-db/fxs/company-complete-ext.fxs --yangpath /tmp/yangArchive320787179 --no-features
[4] Aborted (core dumped) /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-grouping.yang --use-description -o /models-from-db/fxs/company-grouping.fxs --yangpath /tmp/yangArchive320787179
[5] Killed /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-bob.yang --use-description -o /models-from-db/fxs/company-bob.fxs --yangpath /tmp/yangArchive368855886 --no-features
bash-4.4$
[7] Done /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-bob.yang --use-description -o /models-from-db/fxs/company-bob.fxs --yangpath /tmp/yangArchive368855886 --no-features
bash-4.4$
[2] Done /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive749676220/company-alice.yang --use-description -o /models-from-db/fxs/company-alice.fxs --yangpath /tmp/yangArchive749676220 --no-features
[6] Done /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete-ext.yang --use-description -o /models-from-db/fxs/company-complete-ext.fxs --yangpath /tmp/yangArchive320787179 --no-features
[8] Done /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive749676220/company-bob.yang --use-description -o /models-from-db/fxs/company-bob.fxs --yangpath /tmp/yangArchive749676220
[9] Done /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive368855886/company-alice.yang --use-description -o /models-from-db/fxs/company-alice.fxs --yangpath /tmp/yangArchive368855886
[10]- Done /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete-ext-ext.yang --use-description -o /models-from-db/fxs/company-complete-ext-ext.fxs --yangpath /tmp/yangArchive320787179 --no-features
[11]+ Done /opt/confd/bin/confdc --yangpath /opt/adp/yang -c /tmp/yangArchive320787179/company-complete.yang --use-description -o /models-from-db/fxs/company-complete.fxs --yangpath /tmp/yangArchive320787179 --deviation /tmp/yangArchive320787179/company-complete-ext.yang --deviation /tmp/yangArchive320787179/company-complete-ext-ext.yang -F company-complete:implemented,some_cool_feature

There seems io scheduler error. Other error such as “Failed to create dirty cpu scheduler thread 3, error = 11” and “sys/common/erl_poll.c:440:wake_poller(): Failed to write to wakeup pipe fd=10: enomem (12)” also found in other tries. Confd 7.4 is used:

bash-4.4$ /opt/confd/bin/confdc --version
confd-7.4
bash-4.4$ /opt/confd/lib/confd/erts/bin/confd.smp -V
Erlang (SMP,ASYNC_THREADS) (BEAM) emulator version 10.7.1

In the script confdc of confd 7.3, we found the following lines which is removed in 7.4:

if [ -f ${BINDIR}/confd ]; then
smp=“-smp disable”
else
smp=“-smp enable”
fi

Does this mean SMP Erlang is run by default?
In the node which the docker container run, there are 8 cores:

$ grep processor /proc/cpuinfo | wc -l
8

And the resources allocated to this container by Kubernetes:

    limits:
        cpu: 1000m
        memory: 250Mi
    requests:
        cpu: 50m
        memory: 50Mi

Would you please help us on run confdc in parallel? How many confdc commands can run in parallel? Is it decided by physical core number? or cpu or memory size limited by kubernetes?

eaksu · October 27, 2020, 9:33am

Hello Michael,

Confd 7.4 uses Erlang OTP 22 and the non smp VM was removed on OTP 21. So smp is always enabled.

Each parallel confdc command starts an Erlang VM instance and it seems like the some of these instances are starving and cannot get resources from underlying system. Basically they can not get scheduled.

I am not knowledgable on Kubernetes but I would try not imposing limits but only provide requests and try again. If not possible, try to check how your Kubernetes and Docker configuration limits the use of schedulers and the number of os threads that can be scheduled.

Another option is to try limits: cpu: 8000m

Best regards,
Erdem

hzpfly · October 30, 2020, 1:32am

Thank you for your reply. I tried limits: cpu: 4000m and 4 confdc commands can run in parallel successfully.
My question is can we run more confdc commands in parallel on 4 cores?

eaksu · November 2, 2020, 8:35am

I do not think there is a limit on how many parallel confdc commands you can run. Every confdc command starts an Erlang VM and each Erlang VM is checking on how many cores it is running on. According to the number of cores (they all see 4 cores in your last setup), if the VM configuration is default, they try to create one CPU-bound dirty scheduler per core (This is a simplified explanation, it is actually implicitly related to number of cores) and also by default, they try to create 10 IO-bound dirty schedulers. In your failing case, one error message was that a dirty IO scheduler could not be created.
My guess is that, since you were limiting the availability of the cores to 1/8, some of the Erlang VMs were unable to start OS threads on some of the cores which are needed to create Erlang’s schedulers.
Depending on your use case there may be alternative approaches to achieve what you want to do. One way may be to limit the number of schedulers, see Erlang documentation about this topic: http://erlang.org/doc/man/erl.html#+SDio but we do not support this at the moment. I would recommend you to keep enough system resources available to Erlang VMs. Otherwise (just brain storming here) you may try to re-run failing commands after some finish successfully like a map-reduce+*(remap-reduce).

I hope this information would be helpful.

Best regards,
Erdem