logs and metrics made easy with OVH

For several years I have been running a mail server for me and my family plus some other services. The lack of monitoring has never really been an issue until recently.

After migrating to a new VPS the clamd (antivirus) service would be killed by the kernel regularly. Nothing in the logs would indicate the cause of the issue. I needed to know what ressources were available minutes before the issue occurred. I knew nagios by name, through my work I am aware of kibana/elasticsearch and grafana. Those solutions would work but require a significant amount of time to setup and possibly a lot of ressources that I don’t have.

After some research I discovered that my provider OVH can provide all the tools to ingest and visualise logs and metrics (https://www.ovh.com/fr/data-platforms/ only in French. The metric platform is not available on the uk website). A state of the art stack similar to what I use at Hmrc. Not only it is available but also affordable for a personal use around 2€/month. No additional software to maintain on the server. You send your logs/metrics and OVH does everything for you. This platform is not only available from within OVH but can also receive your logs and metrics from outside their network.

Logs

Logs are paid per use and starts at 0.99€/month there is no monthly fees. There are also several options like Kibana that are out of scope for personal use (i.e. as cheap as possible).

The creation of the account is free and two locations are available at the time of writing Canada and France. The logs streams can be purchased individually within the newly created account.

Once the account created you will need to create a password and create a data stream. The websocket allows you to see the logs real time. I did not notice any noticeable delays in the ingestion so far so its use is rather limited. The indexation gives you access to graylog. I am not sure what would be the use of a datastream without this option. The retention is really important if you want a fixed monthly fee as it cannot be changed. You need to adjust the retention time so all the logs fit within the maximum storage you want to afford. I also chose to pause the indexation if it reaches the maximum storage to avoid any surprises.

The logs data platform supports several logs formats with the list below taken from their website (https://docs.ovh.com/gb/en/logs-data-platform/quick-start)

  • GELF: This is the native format of logs used by Graylog. This JSON format will allow you to send logs really easily. See: http://docs.graylog.org/en/latest/pages/gelf.html. The GELF input only accept a null (\0) delimiter.
  • LTSV: this simple format is very efficient and is still human readable. you can learn more about it here. LTSV has two inputs that accept a line delimiter or a null delimiter.
  • RFC 5424: This format is commonly used by logs utility such as syslog. It is extensible enough to allow you to send all your data. More information about it can be found at this link: RFC 5424.
  • Cap’n’Proto: The most efficient log format. this is a binary format that allows you to maintain a low footprint and high speed performance. For more information, check out the official website: Cap’n’Proto.
  • Beats: A secure and reliable protocol used by the beats family in the Elasticsearch ecosystem (Ex: FilebeatMetricbeatWinlogbeat).

I could not find in journald, the default logging service on Fedora, to forward the logs to the OVH logs data platform. I therefore installed syslog-ng and used the RFC 5424 protocol. It is what OVH describes in their guide just to make my life easier(https://docs.ovh.com/gb/en/logs-data-platform/how-to-log-your-linux/). There is one issue that I found is the time resolution is the second which is far from enough to get a reliable timeline of your logs. I added the option frac-digits(6); in the options section in the syslog-ng config files in order to get sub-second resolution.

# /etc/syslog-ng/syslog-ng.conf
options {
    flush_lines (0);
    time_reopen (10);
    log_fifo_size (1000);
    chain_hostnames (off);
    use_dns (no);
    use_fqdn (no);
    create_dirs (no);
    keep_hostname (yes);
    frac-digits(6);
};

System logs are not the only logs I was interested in. I am running nginx to host this blog, my genealogy, a seafile instance and a webmail. nginx can push logs to syslog-ng using a socket configured as such in syslog-ng:

# /etc/syslog-ng/conf.d/listen.conf
source s_network {
    network(
	ip("127.0.0.1")
        transport("udp")
    );
};

The s_network source is then added to the correct destination with ovhPaaSLogs defined as in the OVH guide (https://docs.ovh.com/gb/en/logs-data-platform/how-to-log-your-linux/)

log {
     source(s_network);
     destination(ovhPaaSLogs);
};

In nginx I then use access_log syslog:server=127.0.0.1 main; and I tweaked the log format to include the http_host:

    log_format  main  '$remote_addr - $remote_user [$time_local] ($http_host) "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

Now nginx logs are available to search in graylog

Metrics

Similarly to logs, metrics can also be sent to OVH. However this is only available on the French version of their website (https://www.ovh.com/fr/data-platforms/metrics/). Again, you are not limited by a single provider and protocol but a large variety is available

In the case of metrics, the choice I made is directly to what I use at work: graphite for ingestion and grafana for visualisation.

Similar to the logs, the service is pay as you go starting with a free trial limited to 10 series and 1 point per 5 minutes period with a month retention. This has really limited use so quickly choose their first paid option 100 series with one year retention at 0.99€/month. It also have a 288k points limitation. Given 100 series it’s a data point every 30s.

To gather and push the metric I chose collectd (https://collectd.org/). It is available in Fedora’s repositories and there are large number of plugins already available to collect cpu usage, load, io, memory, Nginx, mySQL and so on. Some plugins like MySQL creates a large number of metrics (about 60 per database) so I cannot use them without using a more expensive larger offer.

To push the metrics to graphite the corresponding graphite plugin is configured as such:

LoadPlugin write_graphite
<Plugin write_graphite>
        <Node "graphite">
                Host "graphite.gra1.metrics.ovh.net"
                Port "2003"
                Protocol "tcp"
                LogSendErrors true
                Prefix "OVHTOKEN@."
                Postfix ""
                StoreRates true
                AlwaysAppendDS false
                EscapeCharacter "_"
                ReconnectInterval 120
        </Node>
</Plugin>

The platform is not restricted to metric from within OVH. I am also sending metrics from my media centre at home using the same collectd service. The option ReconnectInterval was added because metrics would stop been received after a while (https://collectd.org/documentation/manpages/collectd.conf.5.shtml#plugin_write_graphite).

After configuring the graphite datasource in the grafana instance provided by OVH and a few visualisations the end result is finally here.

Using ElasticSearch as a data source in grafana

It is possible use ElasticSearch as a source in grafana. The ElasticSearch instance from OVH is no exception. It however necessary to create an alias or index before hand. An index being a paid option I created an alias and added the datastream to that alias.

The elasticSearch access contains the information needed to add it as a source. The first part of the url is the URL to use and the second par tis the name of the index.

Below an an example querying nginx logs from grafana and using an elasticSearch query as annotation (OOM kills)

Alerting

Given the amount of data now available it is possible to make good use of it by setting up some alerting. The solution built in OVH is rather limited. It is possible to trigger an alert based on logs but not on metrics. An alert can be based on the number of message, the result of an aggregation of a field or the content of a field (https://docs.ovh.com/gb/en/logs-data-platform/alerting/#configuring-a-message-count-alert-condition_1). The alert however is only sent to the email account that cannot be changed. It is not possible to push the alert to a system like pagerduty.

To interact with pagerduty or similar you need to install ElastAlert (https://docs.ovh.com/gb/en/logs-data-platform/elastalert/)

Conclusion

The goal for getting this logs and metrics was to help me find the source of my problem where clamd was regularly killed. It did help, I quickly notice every time freshclam was updating the virus database the reload of clamd would cause a sharp increase in memory consumption.

The reason behind it is that clamd to avoid a brief period of unavailability first load the new database while keeping the old one in use. In my case this would require 2GB of memory for the reload which is too much. See https://blog.clamav.net/2020/09/clamav-01030-released.html for reference. The use of the ConcurrentDatabaseReload fixes the problem by restoring the old behaviour.

Comments are closed.