Memory Monitor
About the memory limitations in the EVE system
EVE uses cgroups to limit the memory usage of different components of the system. In particular, EVE creates a dedicated cgroup "eve" to limit the memory usage of its main components. This cgroup is further divided into several sub-cgroups, including "eve/services", and, in particular, "eve/services/pillar".
So, a usual memory hierarchy in the EVE system looks like this:
$ tree /sys/fs/cgroup/memory/eve -d
/sys/fs/cgroup/memory/eve <--- is set from dom0_mem kernel argument, usually 800Mb (or 8GB for ZFS)
├── containerd <-------------- is set from ctrd_mem kernel argument, usually 400Mb
└── services <---------------- is set from eve_mem kernel argument, usually 650Mb
├── <some services>
├── pillar <-------------- is set from eve_mem kernel argument, usually 650Mb
├── <some other services>
└── xen-tools
These limits are set within pkg/dom0-ztools/rootfs/etc/init.d/010-eve-cgroup
script. The "eve" cgroup limit is set according to the dom0_mem kernel
argument, and the "pillar" cgroup (along with the parent "services" cgroup
and all other sub-cgroups of "services" cgroup) limit is set according the
eve_mem kernel argument.
In its turn, the "pillar" cgroup encapsulates several processes, the most
important of which is zedbox:
$ for pid in $(cat /sys/fs/cgroup/memory/eve/services/pillar/cgroup.procs); do echo "CMD: $(cat /proc/$pid/cmdline | tr '\0' ' ')"; done
CMD: /bin/sh /init.sh
CMD: /bin/sh /opt/zededa/bin/device-steps.sh
CMD: /opt/zededa/bin/zedbox <--- zedbox process
CMD: /bin/sh /opt/zededa/bin/device-steps.sh
CMD: cat
CMD: dhcpcd: eth0 [ip4] [ip6]
CMD: dhcpcd: eth1 [ip4] [ip6]
CMD: /usr/sbin/ntpd -p pool.ntp.org
CMD: /opt/zededa/bin/dnsmasq -C /run/zedrouter/dnsmasq.bn1.conf
CMD: sleep 300
The other processes in the "pillar" cgroup are DHCP clients, NTP daemon, DNS server, and some other services.
RSS, cache, and the cgroup memory usage
Each cgroup contains a file that shows the memory usage of the cgroup:
/sys/fs/cgroup/memory/*/memory.usage_in_bytes. The same information is
used by the cgroup events subsystem to trigger events. The memory usage of
the cgroup is calculated as the sum of the memory usage of all processes
in the cgroup.
It is important to note that this value also includes cached memory. Cache is usually easy to reclaim, so it is not considered a problem if the cgroup memory usage is close to the limit. The problem arises when the Resident Set Size (RSS) of the process is close to the limit.
Unfortunately, the cgroup memory threshold events are triggered based on the
cgroup memory usage, which includes cache. During the threshold events handling,
we take into account the cache size, not to report false positives.
However, for better results we also monitor the RSS of the zedbox process
separately in a dedicated thread, reading the /proc/<zedbox_pid>/statm file
every 5 seconds.
Process Stall Information (PSI)
The PSI metrics are a set of metrics that can be used to detect when the system is under memory pressure. They are tracked and exposed by the kernel. The metrics show how much time the system is stalled on memory reclaim.
The PSI metrics are available in the /proc/pressure. In fact the PSI
metrics are available for CPU, memory, and IO, but in the scope of the
memory monitor we are interested in the memory PSI metrics.
Format of the PSI metrics
The PSI metrics are available in the /proc/pressure/memory file. The file
contains the following fields:
some avg10 avg60 avg300 total
full avg10 avg60 avg300 total
The some line shows the percentage of time at least one process in the system
is stalled on memory reclaim for the last 10, 60, and 300 seconds. The full
line shows the same, but for the whole system. The total field shows the
total number of the corresponding stall events.
Additional information on the PSI metrics in EVE
More information about the PSI metrics can be found in the kernel documentation.
Also, EVE provides a tool psi-collector that can be used to collect the PSI
statistics. Documentation on the psi-collector tool can be found in the
psi-collector README.
Another tool that can be used to visualize the PSI metrics is psi-visualizer.
The documentation on the psi-visualizer tool can be found in the
psi-visualizer README.
What the tool is monitoring?
The memory monitor is designed to track and respond to specific memory-related events related to the zedbox process or the entire EVE system.
Here are the specific memory events that are being monitored:
- zedbox process memory usage: The memory usage of the zedbox process is monitored. If the Resident Set Size (RSS) of the Zedbox process exceeds a predefined threshold, a handler script is triggered. The RSS is checked every 5 seconds.
- eve cgroup memory usage: The memory usage of the eve cgroup is monitored in two ways:
- Threshold Event: If the memory usage of the eve cgroup exceeds a certain percentage (98% by default) of its memory limit, a handler script is triggered.
- Pressure Event: If the memory pressure level of the EVE cgroup reaches the "medium" level, a handler script is triggered. Memory pressure level is a measure of how the system is reclaiming memory from the cgroup. We set the level to "medium", not "low", to avoid triggering the event too often, for example, when the system reclaims cache memory.
- Pillar Cgroup Memory Usage: The memory usage of the pillar cgroup is also monitored in two ways:
- Threshold Event: If the memory usage of the pillar cgroup exceeds a predefined threshold in bytes, a handler script is triggered. Here we exclude cache from the memory usage calculation.
- Pressure Event: If the memory pressure level of the pillar cgroup reaches the "low" level, a handler script is triggered.
- PSI Metrics: The PSI metrics are monitored. If the
full avg10value of the PSI metrics exceeds a predefined threshold (90% by default), a handler script is triggered.
The full_avg10 metric is particularly useful for detecting situations where
all processes are stalled due to memory pressure, making it an effective
indicator of imminent OOM, especially in cases of rapid memory exhaustion
("fast" OOMs). We focus on full_avg10 because some_avg10 can spike when
only some processes are experiencing memory pressure, which doesn’t
necessarily lead to an OOM. Using some_avg10 would be too sensitive and
could result in false alarms, so we prioritize full_avg10 for more accurate
detection. Longer intervals like 60- and 300-second averages are not as
reactive and may miss fast-approaching OOM conditions.
The handler script that is triggered in response to these events performs various actions to log and manage memory usage. It can, for example, dump memory allocation sites, trigger a heap dump, and log memory usage details.
The tool integration
The memory monitor is started as a daemon on the EVE system. It runs in the background and monitors the memory usage. It's started in a dedicated container "memory-monitor" that is created by the containerd service.
The monitor binary and the handler script are located in the container version
of the /sbin/ directory. The handler script is executed by the monitor binary
when a memory event is triggered and expected to be in the same directory.
The default configuration of the memory monitor is stored in the container version
of the /etc/ directory.
Note, that these files are not seen in the host system, as they are located in the container filesystem.
The container and its file system is available for investigation with the following command:
eve enter memory-monitor
The tool than creates the output files in /persist/memory-monitor/output.
The persistent storage is mounted to the container, so the output files are
available in the host system.
Enabling memory monitor
By default, the memory monitor is disabled: its container starts automatically
with the EVE boot but remains paused. To activate memory monitoring, set the
global configuration option memory-monitor.enabled to true. When Pillar
receives this update, it simply resumes the paused container. Conversely, if
you need to disable it again, setting the option to false will pause the
container.
This setup allows you to toggle memory monitoring on and off, which can be especially useful if you suspect that the memory monitor's handler script might be consuming excessive CPU or memory resources.
As the memory monitor is a new component in the EVE system, and very few real-world testing was done with it, is important to be on the safe side, so we disable the memory monitor by default. It can be enabled by the user if needed.
Internals of the build and startup process
To deploy the memory monitor to the EVE system, we create a corresponding
container image. The container image is built using the Dockerfile. In the
Dockerfile, we copy the necessary files to the container image and build the
tool using the make dist command. The Makefile can be also used for building
the tool locally, for example, to test it on a local setup.
Later the container image is used by LinuxKit to deliver it to the EVE rootfs.
We start the container in the "services" section of LinuxKit configuration.
It means that the LinuxKit will start the container automatically when the EVE
system boots. And it will expect the process in the container to run as
foreground process and do not return control. Otherwise, the service will be
considered failed and the container will be stopped. To achieve this, the memory
monitor binary is started with the -f flag.
Apparmor profile
There is an idea to use the apparmor profile for the memory monitor handler. It
should add a layer of security to the handler script that is executed by the
memory monitor by the system() call. The apparmor profile is not implemented
yet, but its template is located in the sbin.memory-monitor-handler file. It's
also copied to the dist directory, but not to the final container image.
Note on the memory monitor and the memory cgroups
During the initialization the tool removes itself from the current memory cgroup and moves to the root one, so it does not affect the memory usage of the eve cgroup.
Before the tool starts the handling script, part of which is to dump the memory usage of the pillar, accessing the internal golang debugging http server, it temporarily increases the memory limit of the pillar cgroup by 50Mb. This is done to avoid the situation when the memory usage of the pillar cgroup is close to the limit, and the handler script cannot run because of the lack of memory. After the handler script finishes, the memory limit of the pillar cgroup is restored to the original value.
Run as a standalone tool on older versions of EVE
On the older versions of EVE, the memory monitor is not integrated into the system as a container. But it can be run as a standalone tool.
To run it, you need to build the tool using the Makefile:
$ make dist
It will create a dist directory with the memory monitor binary, handler script, and configuration file. To run the memory monitor, you need to copy the dist directory to the EVE system and run the binary:
$ ./memory-monitor
The memory monitor will run daemonized in the background. To stop it, you can
send the SIGTERM signal to the process:
$ pkill -SIGTERM memory-monitor
The tool will create the output files in the /persist/memory-monitor/output.
Output of the memory monitor
The memory events handling results are logged to the
/persist/memory-monitor/output directory. The output directory is created
automatically if it does not exist.
The output directory contains:
- Subdirectory for the last event triggered. The name of the subdirectory is the timestamp of the event. The subdirectory contains:
event_info.txt: A text file that contains metadata about the event. It includes the type of the event and the EVE version that was running when the event was triggered.heap_pillar.out: A heap dump of the zedbox process. It's collected using the built-in go tool. It's a binary file that can be analyzed using thepproftool. How to analyze the heap dump is described in the next section.memstats_pillar.out: A memory usage report of the pillar cgroup. It contains the total memory usage of the pillar cgroup according to the cgroup itself, including the cache and the total Resident Set Size (RSS) according to the smaps file of all processes in the cgroup. It also contains the RSS of all processes in the pillar cgroup.memstat_eve.out: A memory usage report of the EVE cgroup. It contains the total memory usage of the EVE cgroup according to the cgroup itself, including the cache and the total Resident Set Size (RSS) according to the smaps file. It also contains the RSS and per-mapping details of all the processes in the EVE cgroup.zedbox: A symlink to the zedbox binary that was used to collect the heap dump.- Tar archives of the previous event directories. The archives are created
automatically by the handler script when a new event is triggered. The
archives are created to save disk space (each event directory can be quite
large, ~5-15Mb, while the archive is usually ~1Mb). We keep not more than
100 Mb of archives: the oldest archives are deleted by the handler script
when the total size of the archives exceeds 100 Mb.
The archive does not contain the
zedboxsymlink. events.log: A log file that contains a timestamped list of all memory events, archives for which are still present in the output directory.memory-monitor-handler.log: A log file that contains the output of the handler script. It is useful for debugging the handler script if it fails. It's cleared if the handler script run is successful.
Logs of the memory monitor
The memory monitor logs its output to the syslog. Unfortunately, the syslog
logs are not shown with the logread command. Nevertheless, you can see them
in the logs files in /persist/newlogd directory. The memory monitor logs
are prefixed with the memory-monitor tag.
How to analyze the heap dump
The heap dump (heap_pillar.out) is a binary file that can be analyzed using
go tool pprof. This tool is a part of the Go toolchain, so it is not
necessary to install anything extra to use it. To run it, first copy the
heap_pillar.out file and the zedbox binary to the same directory on your
local machine. Then, run the following command:
$ go tool pprof /path/to/zedbox /path/to/output/<event_timestamp>/heap_pillar.out
It will start the pprof tool in the interactive mode. You can use the top
command to see the top memory allocations. The top command will show you the
top memory allocations in the heap dump. You can also use other commands to
analyze the heap dump.
Another useful command is list. It shows the source code that corresponds to
a function with a given as a regular expression name. The name can be taken from
the top command output. Do not forget to escape special characters in the
function name (for example, ., (, *, etc) with a backslash.
To make it possible to see the source code of the memory allocations, you need
to have the source code of the zedbox binary. You can provide the path to the
source code to the pprof tool using the -source_path and -trim_path flags.
The first flag is used to provide the path to the source code, and the second
flag is used to trim the path embedded in the binary as they correspond to the
path on the build machine (container) that originally had the source code under
the /pillar/ directory. Also, don't forget to use the source code of the
version of the zedbox binary that was used to collect the heap dump.
$ go tool pprof -source_path ../ -trim_path / /path/to/zedbox /path/to/output/<event_timestamp>/heap_pillar.out
The pprof tool can also generate an interactive graph of the memory
allocations. To do this, you can run the following command:
$ go tool pprof -source_path ../ -trim_path / -http=:8080 -call_tree -nodefraction=0 -lines /path/to/zedbox /path/to/output/<event_timestamp>/heap_pillar.out
Configuration of the memory monitor
The memory monitor is configured using the memory-monitor.conf file.
The default configuration file is located in the /etc/ directory of the
container image. But it can be overridden by the user by placing the
configuration file in the /persist/memory-monitor/ directory.
To copy the default configuration file to the /persist/memory-monitor/ for
editing, run the following command:
eve exec memory-monitor cp /etc/memory-monitor.conf /persist/memory-monitor/
To use the custom configuration values, the user should restart the memory
monitor tool by sending the SIGHUP signal to the memory monitor process. It
can be done by running the following command:
pkill -SIGHUP /sbin/memory-monitor
It can be also done by the eve cli command:
eve memory-monitor-update-config
The configuration file should contain the following fields:
CGROUP_PILLAR_THRESHOLD_MB=<threshold in MB>
CGROUP_EVE_THRESHOLD_PERCENT=<threshold in percent>
PROC_ZEDBOX_THRESHOLD_MB=<threshold in MB>
PSI_THRESHOLD_PERCENT=<threshold in percent>
The fields are:
CGROUP_PILLAR_THRESHOLD_MB: The threshold in megabytes for the memory usage of the Pillar cgroup. If the memory usage of the Pillar cgroup exceeds this threshold (we exclude cache), the handler script is triggered.CGROUP_EVE_THRESHOLD_PERCENT: The threshold for the memory usage of the EVE cgroup. The threshold then is calculated as a percentage of the memory limit of the eve cgroup, that is read from/sys/fs/cgroup/memory/eve/memory.limit_in_bytesin runtime.PROC_ZEDBOX_THRESHOLD_MB: The threshold in megabytes for the Resident Set Size (RSS) of the zedbox process. It will be compared every 5 second to the RSS of the zedbox process, read from the/proc/<zedbox_pid>/statmfile.PSI_THRESHOLD_PERCENT: The threshold for thefull avg10value of the PSI metrics. See the PSI section for more information.
If some of the fields are missing in the configuration file, the default values will be used. The default values are:
CGROUP_PILLAR_THRESHOLD_MB=400
CGROUP_EVE_THRESHOLD_PERCENT=98
PROC_ZEDBOX_THRESHOLD_MB=200
PSI_THRESHOLD_PERCENT=90
Makefile targets
You can find help on the Makefile targets by running:
$ make help
Worth mentioning that the Makefile has a list of targets that can be used to
build, deploy, get and analyze the memory monitor output in a local setup with
an instance of the EVE system running in a VM, accessible via SSH by local_eve
alias. All these targets are prefixed with local_.
Pressure tool
There is a tool pressure that can be used to allocate memory in the EVE system
to trigger the memory monitor events. The tool can be built using the Makefile:
$ make pressure
It will create a pressure binary in the bin directory. The binary can be
copied to the EVE system and run from any directory.
The tool takes amount of memory to allocate in MB as an argument.
For example, to allocate 100Mb of memory in the EVE system, run the following command:
pressure 100
It will allocate 100Mb of memory and release it after user presses Enter.
It's worth mentioning that if you run the tool from an ssh session, it will allocate memory in the ssh session cgroup, that is exactly a part of the eve cgroup. So, it will trigger the eve cgroup memory event.