Why Linux OS takes over 5 minutes during boot-up process

In this article, I would show you how to find which systemd unit might lead the boot-up times through `systemd-analyze`[1] command.

Why Linux OS takes over 5 minutes during boot-up process
Photo by Ben White / Unsplash

Problem

As the following output of journalctl command, we can view the network unit takes around 5 minutes between initial the LSB and failed to start:

  • Jun 01 15:59:45: Starting LSB: Bring up/down networking...
  • Jun 01 16:04:45: Failed to start LSB: Bring up/down networking.
$ journalctl --unit=network
-- Logs begin at Tue 2021-06-01 15:59:40 UTC, end at Tue 2021-06-01 16:05:37 UTC. --
Jun 01 15:59:45 ip-172-31-29-127.eu-west-1.compute.internal systemd[1]: Starting LSB: Bring up/down networking...
Jun 01 15:59:45 ip-172-31-29-127.eu-west-1.compute.internal network[666]: Bringing up loopback interface:  [  OK  ]
Jun 01 15:59:45 ip-172-31-29-127.eu-west-1.compute.internal network[666]: Bringing up interface eth0:
Jun 01 15:59:45 ip-172-31-29-127.eu-west-1.compute.internal dhclient[789]: DHCPREQUEST on eth0 to 255.255.255.255 port 67 (xid=0x7487dbae)
Jun 01 15:59:45 ip-172-31-29-127.eu-west-1.compute.internal dhclient[789]: DHCPACK from 172.31.16.1 (xid=0x7487dbae)
Jun 01 15:59:47 ip-172-31-29-127.eu-west-1.compute.internal NET[833]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
Jun 01 15:59:47 ip-172-31-29-127.eu-west-1.compute.internal network[666]: Determining IP information for eth0... done.
Jun 01 15:59:48 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 1020ms.
Jun 01 15:59:49 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 2120ms.
Jun 01 15:59:51 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 4450ms.
Jun 01 15:59:55 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 9080ms.
Jun 01 16:00:05 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 18980ms.
Jun 01 16:00:24 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 37130ms.
Jun 01 16:01:01 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 77810ms.
Jun 01 16:02:19 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 113620ms.
Jun 01 16:04:12 ip-172-31-29-127.eu-west-1.compute.internal dhclient[856]: XMT: Solicit on eth0, interval 35520ms.
Jun 01 16:04:45 ip-172-31-29-127.eu-west-1.compute.internal systemd[1]: network.service start operation timed out. Terminating.
Jun 01 16:04:45 ip-172-31-29-127.eu-west-1.compute.internal systemd[1]: Failed to start LSB: Bring up/down networking.
Jun 01 16:04:45 ip-172-31-29-127.eu-west-1.compute.internal systemd[1]: Unit network.service entered failed state.
Jun 01 16:04:45 ip-172-31-29-127.eu-west-1.compute.internal systemd[1]: network.service failed.
Jun 01 16:04:48 ip-172-31-29-127.eu-west-1.compute.internal network[666]: Determining IPv6 information for eth0... failed.
Jun 01 16:04:48 ip-172-31-29-127.eu-west-1.compute.internal network[666]: WARN      : [/etc/sysconfig/network-scripts/ifup-eth] Unable to obtain IPv6 DHCP address eth0.

Find the systemd unit that take a long time

In this case, we can use systemd-analyze blame command to show a list of all running units, ordered by the time they took to initialize.

$ sudo systemd-analyze blame
  5min 115ms network.service
      3.866s dev-xvda1.device
      3.101s cloud-init-local.service
      1.206s cloud-init.service

From the output message of systemd-analyze blame, we can know the network.service take around 5 min. Moreover, we also can check the configuration of network.service that defined the TimeoutSec=5min by default[2].

$ sudo systemctl edit --full network.service
...
...
[Service]
Type=forking
Restart=no
TimeoutSec=5min
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
ExecStart=/etc/rc.d/init.d/network start
ExecStop=/etc/rc.d/init.d/network stop

This means that network.service might be waiting for something and cause the timeout. Therefore, we should check the networking configurations.

$ cat /etc/sysconfig/network-scripts/ifcfg-ens5 

# Created by cloud-init on instance boot automatically, do not edit.
#
BOOTPROTO=dhcp
DEVICE=ens5
DHCPV6C=yes
IPV6INIT=yes
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

From this configuration of the interface ens5, we can know IPv6 address is enabled. However, the configurations in /etc/sysconfig/network-scripts are created by cloud-init on instance boot automatically.

By checking the configurations in /etc/cloud/cloud.cfg.d/, we can confirm the IPv6 address is enabled in cloud-init.

...
    network:
      version: 1
      config:
      - type: physical
        name: eth0
        subnets:
          - type: dhcp
          - type: dhcp6

In the meantime, I can view the VPC did not be enabled the IPv6 support[3].

Summary

If we enable the IPv6 at the Linux OS level, but we don't enable IPv6 support for your VPC and resources. During the boot-up process, network.unit would like to retrieve IPv6 IP address and wait for DHCP server for 5 min. After the timeout period, the systemd.unit would continue the process, and cause a slow boot-up time over 5 minutes.

References

  1. https://www.freedesktop.org/software/systemd/man/systemd-analyze.html
  2. https://www.freedesktop.org/software/systemd/man/systemd.service.html
  3. Get started with IPv6 for Amazon VPC - https://docs.aws.amazon.com/vpc/latest/userguide/get-started-IPv6.html