Swarm Telemetry Install Guide
- 1 Environment Prerequisites
- 2 Configuration
- 2.1 VMware Network Configuration
- 2.2 Time Synchronization
- 2.3 Prometheus Master Configuration
- 2.4 Gateway Node Exporter Configuration
- 2.5 SCS Node Exporter Configuration
- 2.6 Elasticsearch Exporter Configuration
- 2.7 Prometheus Retention Time
- 2.8 Prometheus Security
- 2.9 Grafana Configuration
- 2.10 Alertmanager Configuration
- 3 Dashboards on Grafana
- 4 General Advice Around Defining New Alerts
Important
The Swarm Telemetry VM allows quick deployment of a single instance Prometheus/Grafana installation in combination with Swarm 15 and higher.
Environment Prerequisites
The following infrastructure services are needed in the customer environment:
Swarm 15 solution stack installed and configured, with
nodeExporter
enabled andnodeExporterFrequency
set to 120 (Do not set it too fast; this number is in seconds).Enabled Swarm metrics:
metrics.nodeExporterFrequency = 120
metrics.enableNodeExporter = True
For deploying a telemetry solution, use the following software versions:
Node_exporter 1.6.0
Prometheus 2.45.0
Grafana server 9.3.2
Elasticsearch_exporter 1.5.0
DHCP server for first-time boot and configuration.
DNS server (recommended if you do not want to see IP addresses in your dashboards, but can also be solved by configuring static entries in
/etc/hosts
on this VM). This step is optional.
Configuration
VMware Network Configuration
Verify that the VM can reach the Swarm storage nodes directly via port 9100 before proceeding with configuring Grafana and Prometheus.
By default, the VM uses a single Nic configured with DHCP.
For a deployed dual-network SCS/Swarm configuration, select the appropriate "storage vlan" for the second virtual network card.
Boot the VM and configure the second virtual network card inside the OS.
Edit /etc/sysconfig/network-scripts/ifcfg-ens160 and modify/add the following to it:
ONBOOT=yes NETMASK=255.255.255.0 (Match the same netmask as your storage vlan) IPADDR=Storage VLAN IP (Picked from 3rd party range to avoid conflicts) BOOTPROTO=none GATEWAY=SCS IP (Usually, this is the gateway in the Swarm vlan)
Type the following to enable it:
ifdown ens160 ifup ens160
Verify the new IP is coming up correctly with "ip a".
Important
Sometimes, CentOS7 renames interfaces. If this happens, rename the matching /etc/sysconfig/network-scripts/ifcfg-xxxx files with a new name displayed with "ip a".
Also, rename the config parameter inside the ifcfg-xxx file "NAME" and "DEVICE".
The second network device is currently hardcoded to 172.29.x.x, so change it to fit the Swarm storage network.
It is recommended to assign a static IP for the Swarm storage network facing nic.
Time Synchronization
Prometheus requires correct time synchronization for it to work and present data to Grafana.
The following is already applied on SwarmTelemetry VM, but mentioning it here in case you need to reapply it.
timedatectl set-timezone UTC
Edit /etc/chrony.conf and add server 172.29.0.3 iburst (set to your SCS IP), if missing.
Prometheus Master Configuration
Tell Prometheus to collect metrics from Swarm storage nodes.
Inside the /etc/prometheus/prometheus.yml file, a list of swarm nodes is displayed to modify the following section:
Verify that targets are changed to match your Swarm storage node IP's.
If a Content Gateway is available in the deployment, add that Content Gateway to prometheus.yml as follows:
Use a human-friendly job_name displayed in the gateway dashboard.
Modify the swarmUI template in /etc/prometheus/alertmanager/template/basic-email.tmpl. This is used for the HTML email template, which shows a button to the chosen URL.
Change the part in bold.
Modify the gateway job name in /etc/prometheus/alertmanager/alertmanager.yml. It must match what you chose in prometheus.yml.
Modify the gateway job name in /etc/prometheus/alert.rules.yml.
To restart the service, type:
To enable it for reboots, type:
Open a browser and go to http://YourVMIP:9090/targets to test if Prometheus is up. This page shows which targets it is currently collecting metrics for and if they are reachable.
Use the below command to test the same from a terminal:
Gateway Node Exporter Configuration
Starting from Swarm 15.3, the gateway dashboard needs to run the node_exporter service on the gateways.
Configure the systemd service to listen on port 9095 because Gateway metrics component uses the default port 9100.
Put the node_exporter golang binary in the /usr/local/bin directory. (Example: systemd config file for the node exporter)
Enable and configure the service.
Add a job definition for it in the Prometheus master configuration file.
Example:
SCS Node Exporter Configuration
Starting from Swarm 15.3, the SCS requires the node_exporter service to monitor partition capacity information which is exposed at the end of the Swarm Node View dashboard.
Use the same systemd script provided by the gateway and the default listen port of 9100. SCS 1.5.1 has been modified to add a firewall rule for port 9100 on the Swarm storage network.
Put the node_exporter golang binary in the /usr/local/bin directory.
Enable and configure the service.
Add a job definition for it in the Prometheus master configuration file.
Example:
Elasticsearch Exporter Configuration
The Swarm Search v7 Dashboard requires a new elasticsearch_exporter service that runs locally on the Telemetry VM.
Modify the systemd script to tell the IP address of your Elasticsearch nodes.
Modify /usr/lib/systemd/system/elasticsearch_exporter.service if the Elasticsearch node IP is different.
The --uri needs to be pointing at the IP address of one of your elasticsearch nodes. It will auto-discover other nodes from the metrics.
The new elasticsearch_exporter needs its job and replaces the old way of scraping metrics from Elasticsearch nodes via plugins.
Add the job if it is missing in /etc/prometheus/prometheus.yml using the following script:
Verify the Elasticsearch exporter is running and configured to start on a reboot.
Prometheus Retention Time
By default, Prometheus keeps metrics for 15 days (It can be modified to store for 30 days). To change the duration, follow the below instructions:
Edit the /root/prometheus.service file and select your default retention time for the collected metrics. (30 days are sufficient for POC's and demo's. Modify the flag --storage.tsdb.retention.time=30d to get new desired retention time.)
The rule of thumb is 600MB of disk space for 30 days per Swarm node. This VM comes with a 50 GB dedicated vmdk partition for Prometheus. (This means Swarm can handle up to 32 chassis for 30 days).
If the retention time is modified, then commit the change:
Prometheus Security
It may be desirable to restrict the Prometheus server to only allow queries from the local host since the grafana-server is running on the same VM. To do so, edit the prometheus.service file and add the flag --web.listen-address=127.0.0.1:9090.
Grafana Configuration
Modify the /etc/grafana/grafana.ini file to set up the IP address and the server should be listening too. By default, it binds to all local IPs on port 80.
Review the admin_password parameter.
Grafana has several authentication options including google-auth / oAuth / ldap. The default option is basic http auth. See https://docs.grafana.org/ for more details.
To start the service, type "service grafana-server start" or "systemctl start grafana-server".
To enable it for reboots, type "systemctl enable grafana-server".
Alertmanager Configuration
Four alerts are defined in /etc/prometheus/alert.rules.yml including”
Service_down: Triggered if any Swarm storage node is down for more than 30 minutes.
Gateway_down: Triggered if the cloudgateway service is down for more than 2 minutes.
Elasticsearch_cluster_state: Triggered if the cluster state is changed to "red" after 5 minutes.
Swarm_volume_missing: Triggered if the reported drive count is decreasing in 10 minutes.
/etc/prometheus/prometheus.yml file contains a section that points to the alertmanager service on port 9093 as well as which alert.rules.yml file to use.
The configuration to send alerts to, is defined in /etc/prometheus/alertmanager/alertmanager.yml.
By default, the route is disabled as it requires manual input from the environment (smtp server, user, pass, etc.). An example of a working route to email alerts via Gmail:
Example configuration for a local SMTP relay in the enterprise environment.
Once the configuration is completed, restart the alertmanager.
Verify the alertmanager.yml has the correct syntax using the below command:
This returns the following output:
Run the below command to show a list of active alerts.
Run the below command to show which alert route is enabled:
Example Email Alert:
The easiest way to trigger an alert for testing purposes is to shut down one gateway.
Dashboards on Grafana
DashBoard ID | Dashboard Name |
---|---|
16545 | DataCore Swarm AlertManager v15 |
16546 | DataCore Swarm Gateway v7 |
16547 | DataCore Swarm Node View |
16548 | DataCore Swarm System Monitoring v15 |
17057 | DataCore Swarm Search v7 |
19456 | DataCore Swarm Health Processor v1 |
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.