MyGit

v1.44.0

netdata/netdata

版本发布时间: 2023-12-07 02:15:03

netdata/netdata最新发布版本:v1.45.3(2024-04-12 21:07:03)

Table of Contents

Steady to our schedule, this is another great Netdata release!

[!IMPORTANT]
Stay informed about upcoming changes and potential deprecations by reviewing the deprecation notice sections. This will help you plan for any necessary adjustments to ensure a smooth transition.

Netdata Growth

Release Summary

Release Highlights

Netdata beats Prometheus in all aspects

image

We tested Netdata and Prometheus at scale, both ingesting 2.7 million metrics per second. On the same workload, Netdata vs Prometheus needs:

Read the full performance comparison between Netdata and Prometheus.

To achieve these astonishing results, we made the following changes to Netdata since the previous release:

New SLOTS streaming protocol

A new streaming protocol, allows Netdata children and parents to share a common index of the metrics streamed, allowing the parents to receive metrics without consulting hashtables, reducing the overall overhead on parents by about 30%, without increasing the overhead on children (the children just number each metric).

The new protocol, called SLOTS, is automatically selected when both the child and the parent support it.

Streaming compression algorithms

Streaming now supports multiple compression algorithms. Previous Netdata releases supported only LZ4, which is known for its speed and average compression ratio. This release adds support for ZSTD, GZIP, and BROTLI.

ZSTD provides the best balance between compression ratio and CPU consumption, and therefore it is now the default.

The compression algorithms selection order can be configured on parents, in stream.conf, at the [API] section (parents), by setting compression algorithms order = zstd lz4 brotli gzip.

If you need to save most bandwidth at the expense of CPU utilization set this so that brotli or gzip appear first in the list, before zstd and lz4.

This also means that parents can now have a different compression order for each API key, allowing the use of different API keys based on the location of the child (i.e. children that are on billable egress bandwidth can use an API key that prefers the best compression, like brotli and gzip, while children on non-billable egress bandwidth can use an API key that prefers the best CPU utilization, like zstd or lz4).

Gorilla compression beta

Gorilla compression is a time series data compression technique, developed by Facebook for their time series database, Gorilla. It's particularly efficient for compressing data that changes incrementally over time, which is a common characteristic of time series data.

This release of Netdata includes an adaptation of Gorilla compression, which once enabled, provides 30% additional memory reduction to Netdata.

This was not ready when we compared Netdata and Prometheus, so the Gorilla compression benefits weren't accounted in the comparison. By enabling Gorilla compression, Netdata memory reduction is 70%+ compared to Prometheus.

To try Gorilla compression, edit netdata.conf and set at the [db] section, dbengine page type = gorilla.

Keep in mind that enabling Gorilla compression changes the dbegnine file format to Gorilla compressed metrics. This version of Netdata can read Gorilla-compressed data from dbengine even if Gorilla compression is not enabled, but previous versions of Netdata cannot read it. So, enable Gorilla, only if you don't plan to switch back to a previous version of Netdata.

Our plan is to have Gorilla compression enabled by default at the next release of Netdata.

systemd-journal logs

Our systemd-journal.plugin was already quite faster (10x) than journalctl, but still it was slow when the journal databases is huge (e.g. at journals centralization points where hundreds or thousands of nodes push their logs).

In this release, we introduce several changes to allow the plugin to work promptly in such environments.

Sampling and estimations

The biggest performance issue with systemd-journal logs is the query performance when dealing with huge logs databases.

To overcome this performance issue and provide prompt responses to queries, Netdata now uses the following strategy:

  1. The latest 500k log entries read from journal files work like before: we read all of them and all the values for all their fields, so that we can have accurate histograms and counters per field value at the filters.
  2. Once we hit the 500k log entries limit on a single query, we turn on sampling and estimations.
  3. Sampling distributes 500k more log entries to all the journal files to be read, so that the total log entries queried for their field values will be 1M. This means that if we have to read 100 files, 10k log entries per file will be sampled and 10k log entries more will be unsampled. Since files are usually spread over time, this provides a good sample across time.
  4. When the sampling threshold is hit, Netdata continues reading more log entries without querying the values of the fields. These log entries appear as [unsampled] at the histogram. We know these log entries are there, but the value counters on the field filters do not include them.
  5. When the [unsampled] threshold is hit, and we have read more than 1% of each file, Netdata estimates the number of entries that will be read from the file and skips the rest of it. This estimation appears as [estimated] in the histogram.

The above process allows Netdata to provide a histogram of the logs in a timely manner, even when the number of log entries in the visible timeframe is several dozen million.

A similar process is usually used by log management systems, including Grafana Loki and Elasticsearch. However, Netdata takes a much bigger sample of the data (other systems usually sample only a few thousand log entries, while Netdata usually samples more than a million) and the visualization allows exposing the exact sampling and estimations made at the histogram.

Image showing [unsampled] and [estimated] on a systemd journal system that collects about 10k nginx log entries per second: image

Read more about journals query performance.

journals scan

On busy logs centralization servers, the number of journal files available in /var/log/journal/remote can grow significantly, slowing down directory listing (even ls -l is very slow on them).

To overcome this issue, Netdata now uses inotify events and sorts the files to be scanned from the latest to the oldest.

These changes allow Netdata to present the logs user interface for the most recent journals, immediately after a Netdata restart, while the journals database is scanned in the background.

Logs UI is now available when using Netdata docker images

We switched Netdata docker images from Alpine Linux to Debian, so that libsystemd will be available inside the docker image, allowing systemd-journal.plugin to be compiled and shipped with Netdata docker images.

Using Netdata docker images, Netdata can now query the host system journal files, while running inside the container.

MESSAGE_ID support

systemd-journal has a nice feature where certain events of common interest are given a specific MESSAGE_ID. Several such MESSAGE_IDs have been assigned to track common events, like coredumps, units start/stop events, VMs start/stop events, time changes, etc. In total, we found more than 50 total unique events that are tracked this way.

This version if systemd-journal.plugin automatically tracks and annotates these MESSAGE_IDs using their names allowing quick spotting of events of common interest.

This feature is available at the MESSAGE_ID field filter, at the right side of the dashboard.

log2journal, a new tool on your quiver for managing logs

log2journal is a new utility allowing the conversion of log files into structured systemd-journal log entries. This is currently in beta.

The utility allows processing logs like this:

tail -F /var/log/nginx/access.log |\
   log2journal -c nginx-combined |\
   systemd-cat-native

The above builds a basic pipeline for converting the access.log of an Nginx web server into structured log entries in the local systemd-journal.

Read more here.

Image showing structured nginx logs into systemd-journal: image

Netdata now logs to systemd-journal

The logging layer of Netdata has been rewritten, so that Netdata logs now go to the systemd-journal, in a namespace called netdata.

The obvious outcome is that now you can monitor Netdata logs, using Netdata's systemd-journal.plugin user interface and thanks to journal namespaces, this does not pollute the system logs. But this is just the beginning...

Netdata utilizes the MESSAGE_ID feature of systemd-journal to register:

This means that the systemd-journal.plugin user interface, and journalctl can now be used to list all such events uniformly.

Screenshot of Netdata alert transitions in systemd-journals: image

All Netdata logs are now structured. Netdata can also log in json or logfmt formats. We introduced a lot of new fields to track every aspect of Netdata, in a uniform and consistent way. Read more here.

Furthermore, we introduced a new tool called systemd-cat-native allowing any application or shell script to send structured logs to systemd-journal. Read more here.

Functions, power up your troubleshooting toolkit!

Several new Functions have been added to help us in our troubleshooting journeys. On top of processes, streaming and systemd-journal, we are leveraging the wide range of collectors and metrics Netdata has and bring data in a different visual representation.

The updated list can be found on our documentation here, and you can find a summary of the currently available functions with the corresponding CLI tool it relates to:

Function Description Alternative to CLI tools plugin - module
block-devices Disk I/O activity for all block devices, offering insights into both data transfer volume and operation performance. iostat proc
containers-vms Insights into the resource utilization of containers and QEMU virtual machines: CPU usage, memory consumption, disk I/O, and network traffic. docker stats, systemd-cgtop cgroups
ipmi-sensors Readings and status of IPMI sensors. ipmi-sensors freeipmi
mount-points Disk usage for each mount point, including used and available space, both in terms of percentage and actual bytes, as well as used and available inode counts. df diskspace
network interfaces Network traffic, packet drop rates, interface states, MTU, speed, and duplex mode for all network interfaces. bmon, bwm-ng proc
processes Real-time information about the system's resource usage, including CPU utilization, memory consumption, and disk IO for every running process. top, htop apps
systemd-journal Viewing, exploring and analyzing systemd journal logs. journalctl systemd-journal
systemd-list-units Information about all systemd units, including their active state, description, whether or not they are enabled, and more. systemctl list-units systemd-journal
systemd-services System resource utilization for all running systemd services: CPU, memory, and disk IO. systemd-cgtop cgroups
streaming Comprehensive overview of all Netdata children instances, offering detailed information about their status, replication completion time, and many more.

In the short-term, we will keep adding more (hopefully) helpful Functions but have longer-term plan where we will want to expand this functionality to potentially allow taking and storing snapshots of the results based on: triggered alerts, or periodical configuration.

In case you have suggestions we have a running GitHub Discussion open here.

New Alert Notification Integrations to Netdata Cloud

We've been working on adding more Alert Notification Integrations to Netdata Cloud and recently added the following new ones:

image

The full list of Alert Notification Integrations from Netdata Cloud can be found on our documentation here.

Acknowledgments

Contributions

Collectors

Improvements
  • Add more cases for megacli adapter degraded state (python.d/megacli) (#16522, @ClaraCrazy)
  • Improve estimations accuracy (systemd-journal.plugin) (#16467, @ktsaou)
  • Implement estimations (systemd-journal.plugin)(#16445, @ktsaou)
  • Improve startup time (systemd-journal.plugin) (#16443, @ktsaou)
  • Implement sampling (systemd-journal.plugin) (#16433, @ktsaou)
  • Add cgroup current pids metric (cgroups.plugin) (#16369, @ilyam8)
  • Add Ipmi-sensors function (freeipmi.plugin) (#16363, @ilyam8)
  • Add UPS status code metric (charts.d/apcupsd) (#16361, @thomasbeaudry)
  • Add Mount-points function (diskspace.plugin) (#16345, @ilyam8)
  • Add Block-devices function (proc/diskstats) (#16338, @ilyam8)
  • Add UsedBy field to Network-interfaces function (proc/proc_net_dev) (#16337, @ilyam8)
  • Add various improvements to Network-interfaces function (proc/proc_net_dev)(#16336, @ilyam8)
  • Add Network-interfaces function (proc/proc_net_dev) (#16334, @ilyam8)
  • Add Systemd-list-units function (systemd-journal.plugin) (#16318, @ktsaou)
  • Add Containers-vms function (cgroups.plugin) (#16314, @ktsaou)
  • Add UPS selftest status metric (charts.d/apcupsd) (#16286, @thomasbeaudry)
  • Add a configuration option to set private cleanup timeout (statsd.plugin) (#16269, @MrZammler)
  • Add container_device label to network interfaces (cgroups.plugin) (#16261, @ilyam8)
  • Add selecting multiple sources support (systemd-journal.plugin) (#16252, @ktsaou)
  • Add total LBAs written/read metrics (python.d/smartd_log) (#16245, @watsonbox)
  • Add Erlang to apps_groups.conf (apps.plugin) (#16231, @andyundso)
  • Add support for Proxmox vms/containers name resolution in Docker (cgroups.plugin) (#16193, @ilyam8)
  • Add nested JSON support to log parser (go.d/weblog) (#1416, @ilyam8)
Bug fixes

Bug Fixes

  • Fix configuration loading (charts.d.plugin ) (#16471, @ilyam8)
  • Fix an issue where systemd-journal would stop trying different socket paths after the first failure (systemd-journal.plugin) (#16458, @ktsaou)
  • Fix parsing PD without NCQ status (python.d/adaptec_raid) (#16400, @ilyam8)
  • Fix Systemd-list-units function expiration time (#16393, @ilyam8)
  • Fix lack of system.net when running inside LXC (#16364, @ilyam8)
  • Fix memory leak in Systemd-list-units function (systemd-journal.plugin) (#16333, @ktsaou)
  • Fix server status parsing and add MAINT status chart (python.d/haproxy) (#16253, @seniorquico)
Other

Other

  • Skip timestamp when logging to journald (python.d.plugin) (#16516, @ilyam8)
  • Mute stock jobs logging during check() (python.d.plugin) (#16515, @ilyam8)
  • Improvement performance of the plugin (systemd-journal.plugin) (#16509, @ktsaou)
  • Don't create runtime disk config by default (proc/diskspace, proc/diskstats) (#16503, @ilyam8)
  • Don't create runtime device config by default (proc/proc_net_dev) (#16501, @ilyam8)
  • Disable netdata monitoring section by default (#16480, @MrZammler)
  • Change apps oom and net charts order (ebpf.plugin) (#16395, @thiagoftsm)
  • Fix "differ in signedness" warn in cgroups plugin (#16391, @ilyam8)
  • Fix throttle_duration chart context (cgroups.plugin) (#16367, @ilyam8)
  • Hide summary columns in network and block devices functions (proc/diskstats, proc/proc_net_dev) (#16347, @ktsaou)
  • Fix crash when a container has no CPU/mem metrics in Containers-vms function (cgroups.plugin) (#16331, @ilyam8)
  • Add tcp v6 connect calls to Ebpf_socket function (ebpf.plugin) (#16316, @thiagoftsm)
  • Update journal sources once per minute (systemd-journal.plugin) (#16298, @ktsaou)
  • Minor updates and cleanup (systemd-journal.plugin) (#16267, @ktsaou)
  • Stop using deprecated distutils module (python.d.plugin) (#16259, @MrZammler)
  • Remove charts.d/nut (#16230, @ilyam8)
  • Don't log an error opening cgroup.procs/tasks if it does not exist (cgroups.plugin) (#16196, @ilyam8)
  • Improve exposing metrics by creating a chart for each app group (ebpf.plugin) (#16139, @thiagoftsm)
  • Skip timestamp when logging to journald (go.d.plugin) (#1418, @ilyam8)
  • Replace logger with structured logger (go.d.plugin) (#1418, @ilyam8)
  • Use SHOW REPLICA STATUS for MySQL v8.0.22+ (go.d/mysql) (#1392, @vobruba-martin)
  • Use performance_schema instead of information_schema for MySQL v8.0.22+ (go.d/mysql) (#1390, @vobruba-martin)

Packaging/Installation

All changes

Documentation

All changes

Other Notable Changes

Improvements
Bug Fixes
Other

Deprecation notice

Changed in this release

In accordance with our previous deprecation notice, the following items in this release have been changed:

Other unannounced changes:

Will be changed in the next release

Netdata Release Meetup

Join the Netdata team on the 11th of December at 16:30 UTC for the Netdata Release Meetup.

Together we’ll cover:

RSVP now - we look forward to meeting you.

Support options

As we grow, we stay committed to providing the best support ever seen from an open-source solution. Should you encounter an issue with any of the changes made in this release or any feature in the Netdata Agent, feel free to contact us through one of the following channels:

相关地址:原始地址 下载(tar) 下载(zip)

1、 integrations.js 4.24MB

2、 netdata-aarch64-latest.gz.run 52.39MB

3、 netdata-aarch64-v1.44.0.gz.run 52.39MB

4、 netdata-armv7l-latest.gz.run 57.18MB

5、 netdata-armv7l-v1.44.0.gz.run 57.18MB

6、 netdata-latest.gz.run 61.2MB

7、 netdata-latest.tar.gz 45.23MB

8、 netdata-ppc64le-latest.gz.run 52.64MB

9、 netdata-ppc64le-v1.44.0.gz.run 52.64MB

10、 netdata-v1.44.0.gz.run 61.2MB

11、 netdata-v1.44.0.tar.gz 45.23MB

12、 netdata-x86_64-latest.gz.run 61.2MB

13、 netdata-x86_64-v1.44.0.gz.run 61.2MB

14、 sha256sums.txt 1.29KB

查看:2023-12-07发行的版本