What is watchdog in linux

Christian’s Blog

Linux, programming, hacking, electronics, Python… These are the things I love.

Using the Watchdog Timer in Linux

The Software Watchdog

First: build the Linux kernel with watchdog support, the full guide is located here:

After a reboot with the new kernel there should be a /dev/watchdog file:

Next: you will need to install a watchdog daemon:

List the files that get installed by the watchdog package:

This looks interesting, /usr/lib/systemd/system/watchdog.service is a Systemd service file.

Starting and stopping the watchdog:

The watchdog gets automatically started once you open /dev/watchdog . To stop the watchdog, you will need to:

  • Write the character V into /dev/watchdog to prevent stopping the watchdog accidentally
  • Close the file /dev/watchdog unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled. When this option is enabled, the watchdog cannot be stopped at all.

After the watchdog has been enabled you have to reset the watchdog timer every 60 seconds, else your system gets rebooted. Resetting the timer will be done by the watchdog daemon if none of its tests fails.

Supported tests by the watchdog daemon to check the system status:

  • Is the process table full?
  • Is there enough free memory?
  • Are some files accessible?
  • Have some files changed within a given interval?
  • Is the average work load too high?
  • Has a file table overflow occurred?
  • Is a process still running? The process is specified by a pid file.
  • Do some IP addresses answer to ping?
  • Do network interfaces receive traffic?
  • Is the temperature too high? (Temperature data not always available.)
  • Execute a user defined command to do arbitrary tests.
  • Execute one or more test/repair commands found in /etc/watchdog.d. These commands are called with the argument test or repair.

The configuration file should be self-explanatory:

Now we will enable the watchdog daemon, currently it should be disabled:

For testing purpose I’ve added the following to my /etc/watchdog.conf :

So when my WiFi connection gets lost my system should reboot.

Start the watchdog daemon:

OK, then I will have to use the IP address because the watchdog daemon fails to start. The ping option of watchdog only supports numeric IPv4 addresses:

In general you are safer pinging your router, packages to an remote host can get lost or delayed, Googles IP may change or your IP gets blocked if you send 24/7 pinq requests to Google.

Now disconnect the WiFi and voila, after max. 60 seconds it will reboot:

Later we can enable the watchdog on boot when everything is working correctly:

The Hardware Watchdog

The software watchdog module is, of course, no protection against a kernel fault but hardware watchdog support is coming for the iMX233-OLinuXino.

Have a look at chapter 23 of the iMX233 Reference Manual (17,5 MB):

23.7 Watchdog Reset Function

The watchdog reset is a CPU-configurable device. It is programmed by software to generate a chip-wide reset after HW_RTC_WATCHDOG milliseconds. The watchdog generates this reset if software does not rewrite this register before this time elapses.

The watchdog timer decrements the register value once for every tick of the 1-kHz clock supplied from the RTC analog section (see Figure 23-1). The reset generated by the watchdog timer has no effect on the values retained in the master registers of the real-time clock seconds counter, alarm, or persistent registers (analog persistent storage).

The watchdog timer is initially disabled and set to count 4,294,967,295 milliseconds before generating a watchdog reset.

The watchdog timer does not run when the chip is in its powered-down state. Therefore, there is no master/shadow register pairing for the watchdog timer, and it must be reprogrammed after cycling power or resetting the block.

I’ve seen a kernel option ( Freescale STMP3XXX & i.MX23/28 watchdog ) on newer kernels and also some log messages:

Now I have 3 watchdog devices:

But which is the hardware watchdog?

So by default the hardware watchdog timer gets assigned to /dev/watchdog which makes sense. I haven’t tested it yet whether the hardware watchdog timer is working on the OLinuXino but I think so.

Источник

Watchdog (Linux)

Contents

Introduction

A watchdog on Linux is usually exported through a character device under /dev/watchdog. A simple API allows opening the device to enable the watchdog. Writing to it triggers the watchdog, and if the device is not cleanly closed, the watchdog will reboot the system.

However, a newer, more feature rich API using ioctrl is available too. We provide a small sample application called watchdog-test. For more information, see: http://git.toradex.com/cgit/linux-toradex.git/tree/Documentation/watchdog/watchdog-api.txt

The kernel configuration option WATCHDOG_NOWAYOUT («Disable watchdog shutdown on close») gives the userspace application no way to disable the watchdog. Once opened, the application has to trigger the watchdog forever. If the application closes (even it was a «clean close» with a magic character sent), the watchdog will not be disabled. The watchdog cannot be disabled using the «WDIOC_SETOPTIONS» ioctl. However, if this option is compiled in the kernel, it can be disabled again using the «nowayout» parameter. In our BSP, this option is not set by default.

Hardware Support

See information about specific Toradex modules in this section.

i.MX 6, i.MX 6ULL, i.MX 8M Mini and Vybrid based Modules

The NXP/Freescale i.MX6 and Vybrid SoC watchdog is the same hardware as in the i.MX2. Because of the i.MX2 having been released earlier, the driver is called imx2-wdt. The watchdog driver creates one device under /dev/watchdog. By default, the watchdog resets the system after 60 seconds.

Читайте также:  32гб ssd для windows 10
Parameter Description
nowayout Watchdog cannot be stopped once started
timeout Watchdog timeout in seconds (default=60)

Those parameters can be configured in the U-Boot environment bootargs, which is used to pass commands to the Linux kernel, with the driver name prepended (imx2-wdt):

Note: The respective boot environment needs to be configured on U-Boot since it passes some arguments to the Linux kernel using the bootargs environment variable.

i.MX 7 based Modules

On the NXP i.MX 7 based modules we make use of the PMIC watchdog, the driver is called rn5t618-wdt. The watchdog driver creates one device under /dev/watchdog. By default, the watchdog resets the system after 128 seconds.

Parameter Description
nowayout Watchdog cannot be stopped once started
timeout Watchdog timeout in seconds (default=128, possible values=1, 8, 32, 128)

Those parameters can be configured in the U-Boot environment bootargs, which is used to pass commands to the Linux kernel, with the driver name prepended (rn5t618-wdt):

Note: The respective boot environment needs to be configured on U-Boot since it passes some arguments to the Linux kernel using the bootargs environment variable.

Readout Reset Reason

Usually the reset reason of the module is shown either in U-Boot or Linux Kernel console messages, so user can identify if the SoM was reset by Watchdog or not. However SoM Colibri iMX7 is a special case due to its design and an errata (e10574) in the SOC.

To differentiate if the SoM was reset by usual Power Cycle or Watchdog, user should readout the register 0xAh of PMIC by the following commands in U-Boot:

The different values of this register shows the different states of the module:

The description of the register is also explained in the following image.

NVIDIA Tegra based Modules

NVIDIA Tegra T20/T30/TK1 based modules have a built-in hardware watchdog. When the watchdog isn’t triggered within the defined period of time (by default 60 seconds on T20/T30 and 80 seconds on TK1) the system will reset itself completely (all cores).

Images 20130305 (T20) and 20130820 (T30) and older contain the official NVidia Watchdog driver. This driver has a different behaviour compared to other Linux watchdog drivers (e.g. PXA aka SA1100 one). The driver resets the watchdog by itself, either writing to /dev/watchdog or doing a WDIOC_KEEPALIVE ioctl on it will not change anything. The userspace device (/dev/watchdog) was not really useful.

By default, our updated driver now behaves as described in the official Watchdog API.

The NVIDIA Tegra T20 based modules support one watchdog available under /dev/watchdog. If this watchdog is used by the kernel level heartbeat (TEGRA_WATCHDOG_ENABLE_HEARTBEAT), it can not be used from userspace.

The NVIDIA Tegra T30 based modules support four watchdogs on the hardware side. However, one watchdog is used kernel internally for suspend mode. Three watchdogs are exported to userspace (/dev/watchdog2). If the kernel level heartbeat (TEGRA_WATCHDOG_ENABLE_HEARTBEAT) is enabled, the first watchdog cannot be used from userspace.

The NVIDIA Tegra TK1 based modules support one watchdog available under /dev/watchdog0.

The driver has two parameters:

Parameter Description
nowayout Watchdog cannot be stopped once started
heartbeat Watchdog timeout in seconds (default=60 on T20/T30 default=80 on TK1)

Those parameters can be configured in the U-Boot environment bootargs, which is used to pass commands to the Linux kernel, with the driver name prepended (tegra_wdt):

Note: The respective boot environment needs to be configured on U-Boot since it passes some arguments to the Linux kernel using the bootargs environment variable.

If reboot was due to a watchdog timeout, you will find the following message on next boot:

Note: On T20/T30 the option CONFIG_TEGRA_WATCHDOG_ENABLE_ON_PROBE was renamed to TEGRA_WATCHDOG_ENABLE_HEARTBEAT, which re-enables the watchdog reset by the driver itself (kernel level heartbeat).

Note: When using the kernel level heartbeat, the Kernel will not necessarily reboot when a kernel panic occurs since interrupts might still be handled. In order to reboot on kernel panic, use the command line option panic= or the sysctrl.conf option «kernel.panic = «.

Software support

This section has information about generic watchdog support on Linux.

Systemd

Systemd supports hardware watchdogs through the system configuration options RuntimeWatchdogSec and ShutdownWatchdogSec (in /etc/systemd/system.conf). For more information see this blog article about Watchdog support for systemd.

C example

A simple C example which keeps the watchdog fed. After this program gets closed or killed, the system will reboot after the watchdog timeout has expired.

Источник

watchdog(8) — Linux man page

watchdog — a software watchdog daemon

Synopsis

Description

The Linux kernel can reset the system if serious problems are detected. This can be implemented via special watchdog hardware, or via a slightly less reliable software-only watchdog inside the kernel. Either way, there needs to be a daemon that tells the kernel the system is working fine. If the daemon stops doing that, the system is reset.

watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to it often enough to keep the kernel from resetting, at least once per minute. Each write delays the reboot time another minute. After a minute of inactivity the watchdog hardware will cause the reset. In the case of the software watchdog the ability to reboot will depend on the state of the machines and interrupts.

The watchdog daemon can be stopped without causing a reboot if the device /dev/watchdog is closed correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.

Tests

The watchdog daemon does several tests to check the system status: • Is the process table full?

• Is there enough free memory?

• Are some files accessible?

• Have some files changed within a given interval?

• Is the average work load too high?

• Has a file table overflow occurred?

• Is a process still running? The process is specified by a pid file.

• Do some IP addresses answer to ping?

• Do network interfaces receive traffic?

• Is the temperature too high? (Temperature data not always available.)

• Execute a user defined command to do arbitrary tests.

• Execute one or more test/repair commands found in /etc/watchdog.d. These commands are called with the argument test or repair. If any of these checks fail watchdog will cause a shutdown. Should any of these tests except the user defined binary last longer than one minute the machine will be rebooted, too.

Options

Available command line options are the following: -v, —verbose Set verbose mode. Only implemented if compiled with SYSLOG feature. This mode will log each several infos in LOG_DAEMON with priority LOG_INFO. This is useful if you want to see exactly what happened until the watchdog rebooted the system. Currently it logs the temperature (if available), the load average, the change date of the files it checks and how often it went to sleep. -s, —sync Try to synchronize the filesystem every time the process is awake. Note that the system is rebooted if for any reason the synchronizing lasts longer than a minute. -b, —softboot Soft-boot the system if an error occurs during the main loop, e.g. if a given file is not accessible via the stat(2) call. Note that this does not apply to the opening of /dev/watchdog and /proc/loadavg, which are opened before the main loop starts. -f, —force Force the usage of the interval given or the maximal load average given in the config file. -c config-file, —config-file config-file Use config-file as the configuration file instead of the default /etc/watchdog.conf. -q, —no-action Do not reboot or halt the machine. This is for testing purposes. All checks are executed and the results are logged as usual, but no action is taken. Also your hardware card or the kernel software watchdog driver is not enabled. Temperature checking is also disabled since this triggers the hardware watchdog on some cards.

Function

After watchdog starts, it puts itself into the background and then tries all checks specified in its configuration file in turn. Between each two tests it will write to the kernel device to prevent a reset. After finishing all tests watchdog goes to sleep for some time. The kernel drivers expects a write to the watchdog device every minute. Otherwise the system will be reset. As a default watchdog will sleep for only 10 seconds so it triggers the device early enough.

Under high system load watchdog might be swapped out of memory and may fail to make it back in in time. Under these circumstances the Linux kernel will reset the machine. To make sure you won’t get unnecessary reboots make sure you have the variable realtime set to yes in the configuration file watchdog.conf. This adds real time support to watchdog: it will lock itself into memory and there should be no problem even under the highest of loads.

Also you can specify a maximal allowed load average. Once this load average is reached the system is rebooted. You may specify maximal load averages for 1 minute, 5 minutes or 15 minutes. The default values is to disable this test. Be careful not to set this parameter too low. To set a value less then the predefined minimal value of 2, you have to use the -f option.

You can also specify a minimal amount of virtual memory you want to have available as free. As soon as more virtual memory is used action is taken by watchdog. Note, however, that watchdog does not distinguish between different types of memory usage. It just checks for free virtual memory.

If you have a watchdog card with temperature sensor you can specify the maximal allowed temperature. Once this temperature is reached the system is halted. The default value is 120. There is no unit conversion so make sure you use the same unit as your hardware. watchdog will issue warnings once the temperature increases 90%, 95% and 98% of this temperature.

When using file mode watchdog will try to stat(2) the given files. Errors returned by stat will not cause a reboot. For a reboot the stat call has to last at least one minute. This may happen if the file is located on an NFS mounted filesystem. If your system relies on an NFS mounted filesystem you might try this option. However, in such a case the sync option may not work if the NFS server is not answering.

watchdog can read the pid from a pid file and see whether the process still exists. If not, action is taken by watchdog. So you can for instance restart the server from your repair-binary.

watchdog will try periodically to fork itself to see whether the process table is full. This process will leave a zombie process until watchdog wakes up again and catches it; this is harmless, don’t worry about it.

In ping mode watchdog tries to ping the given IP addresses. These addresses do not have to be a single machine. It is possible to ping to a broadcast address instead to see if at least one machine in a subnet is still living.

Do not use this broadcast ping unless your MIS person a) knows about it and b) has given you explicit permission to use it!

watchdog will send out three ping packages and wait up to seconds for the reply with being the time it goes to sleep between two times triggering the watchdog device. Thus a unreachable network will not cause a hard reset but a soft reboot.

You can also test passively for an unreachable network by just monitoring a given interface for traffic. If no traffic arrives the network is considered unreachable causing a soft reboot or action from the repair binary.

watchdog can run an external command for user-defined tests. A return code not equal 0 means an error occured and watchdog should react. If the external command is killed by an uncaught signal this is considered an error by watchdog too. The command may take longer than the time slice defined for the kernel device without a problem. However, error messages are generated into the syslog facility. If you have enabled softboot on error the machine will be rebooted if the binary doesn’t exit in half the time watchdog sleeps between two tries triggering the kernel device.

If you specify a repair binary it will be started instead of shutting down the system. If this binary is not able to fix the problem watchdog will still cause a reboot afterwards.

If the machine is halted an email is sent to notify a human that the machine is going down. Starting with version 4.4 watchdog will also notify the human in charge if the machine is rebooted.

Soft Reboot

A soft reboot (i.e. controlled shutdown and reboot) is initiated for every error that is found. Since there might be no more processes available, watchdog does it all by himself. That means: 1.

Kill all processes with SIGTERM.

After a short pause kill all remaining processes with SIGKILL.

Record a shutdown entry in wtmp.

Save the random seed from /dev/urandom. If the device is non-existant or there is no filename for saving this step is skipped.

Turn off accounting.

Turn off quota and swap.

Unmount all partitions except the root partition.

Remount the root partition read-only.

Shut down all network interfaces.

Check Binary

If the return code of the check binary is not zero watchdog will assume an error and reboot the system. Be careful with this if you are using the real-time properties of watchdog since watchdog will wait for the return of this binary before proceeding. An positive exit code is interpreted as an system error code (see errno.h for details). Negative values are special to watchdog: -1

Reboot the system. This is not exactly an error message but a command to watchdog. If the return code is -1 watchdog will not try to run a shutdown script instead.

Reset the system. This is not exactly an error message but a command to watchdog. If the return code is -2 watchdog will simply refuse to write the kernel device again.

Maximum load average exceeded.

The temperature inside is too high.

/proc/loadavg contains no (or not enough) data.

The given file was not changed in the given interval.

/proc/meminfo contains invalid data.

Child process was killed by a signal.

Child process did not return in time.

Free for personal use.

Repair Binary

The repair binary is started with one parameter: the error number that caused watchdog to initiate the boot process. After trying to repair the system the binary should exit with 0 if the system was successfully repaired and thus there is no need to boot anymore. A return value not equal 0 tells watchdog to reboot. The return code of the repair binary should be the error number of the error causing watchdog to reboot. Be careful with this if you are using the real-time properties since watchdog will wait for the return of this binary before proceeding.

Test Directory

Executables placed in the test directory are discovered by watchdog on startup and are automatically executed. They are bounded time-wise by the test-timeout directive in watchdog.conf.

These executables are called with either «test» as the first argument (if a test is being performed) or «repair» as the first argument (if a repair for a previously-failed «test» operation on is being performed).

The as with test binaries and repair binaries, expected exit codes for a successful test or repair operation is always zero.

If an executable’s test operation fails, the same executable is automatically called with the «repair» argument as well as the return code of the previously-failed test operation.

For example, if the following execution returns 42:

The watchdog daemon will attempt to repair the problem by calling:

/etc/watchdog.d/my-test repair 42

This enables administrators and application developers to make intelligent test/repair commands. If the «repair» operation is not required (or is not likely to succeed), it is important that the author of the command return a non-zero value so the machine will still reboot as expected.

Note that the watchdog daemon may interpret and act upon any of the reserved return codes noted in the Check Binary section prior to calling a given command in «repair» mode.

None known so far.

Authors

The original code is an example written by Alan Cox , the author of the kernel driver. All additions were written by Michael Meskes . Johnie Ingram had the idea of testing the load average. He also took over the Debian specific work. Dave Cinege brought up some hardware watchdog issues and helped testing this stuff.

Источник

Читайте также:  Kali linux revealed book
Оцените статью