Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel1
maxLevel2
outlinefalse
typelist
printablefalse
Info

Important

The Swarm Telemetry VM allows quick deployment of a single instance Prometheus/Grafana installation in combination with Swarm 15 and higher.

Environment Prerequisites

The following infrastructure services are needed in the customer environment: 

  • Swarm 15 solution stack installed and configured, with nodeExporter enabled and nodeExporterFrequency set to 120 (Do not set it too fast; this number is in seconds). 

  • Enabled Swarm metrics:  

    1. metrics.nodeExporterFrequency = 120 

    2. metrics.enableNodeExporter = True    

...

  • For deploying a telemetry solution,

...

  • use the following software versions:

    1. Node_exporter 1.6.0

    2. Prometheus 2.45.0

    3. Grafana server 9.3.2

    4. Elasticsearch_exporter 1.5.0

  • DHCP server

...

  • for first-time boot and configuration. 

  • DNS server (recommended if you do not want to see IP addresses in your dashboards, but can also be solved by configuring static entries in /etc/hosts on this VM).

...

  • This step is optional.

Configuration

VMware Network Configuration

...

  1. Verify that the VM can reach the Swarm storage nodes directly via port 9100 before proceeding with configuring Grafana and Prometheus.

  2. By default, the VM uses a single Nic configured with DHCP. 

...

  1. For a deployed

...

  1. dual-network SCS/Swarm configuration

...

  1. , select the appropriate "storage vlan" for the second virtual network card. 

  2. Boot the VM and configure the second virtual network card inside the OS.  

  3. Edit /etc/sysconfig/network-scripts/ifcfg-ens160 and modify/add the following to it:

...

  1. Code Block
    ONBOOT=yes 

...

  1. 
    
    NETMASK=255.255.255.0 (Match the same netmask as your storage vlan) 

...

  1. 
    
    IPADDR=Storage VLAN IP (Picked from 3rd party range to avoid conflicts)  

...

  1. 
    
    BOOTPROTO=none

...

  1. 
    
    GATEWAY=SCS IP (Usually, this is the gateway in the Swarm vlan)

...

  1. Enable it by typing:

...

bgColor#DEEBFF

ifdown ens160

...

  1. Type the following to enable it:

    Code Block
    ifdown ens160
    
    ifup ens160
  1. Verify the new IP is coming up correctly with "ip a".

Info

Important

Sometimes, CentOS7

...

renames interfaces. If this happens, rename the matching /etc/sysconfig/network-scripts/ifcfg-

...

xxxx files with a new name

...

displayed with "ip a".

Also, rename the config parameter inside the ifcfg-xxx file "NAME" and "DEVICE".

 The second network device is currently hardcoded to 172.29.x.x, so change it to fit

...

the Swarm storage network.

Note

It is recommended to assign a static IP for the

...

Swarm storage network facing nic.

Time Synchronization

  1. Prometheus requires correct time synchronization for it to work and present data to Grafana. 

  2. The following

...

  1. is already

...

  1. applied on SwarmTelemetry VM, but mentioning it

...

  1. here in case you need to

...

  1. reapply it.

...

  1. Code Block
    timedatectl set-timezone UTC
  1. Edit /etc/chrony.conf and add server 172.29.0.3 iburst (

...

  1. set to your SCS IP), if missing. 

...

  1. Code Block
    systemctl stop chronyd 

...

  1. 
    
    hwclock --systohc 

...

  1. 
    
    systemctl start

...

  1.  chronyd

Prometheus Master Configuration

...

  1. Tell Prometheus to collect metrics from Swarm storage nodes.

...

  1. Inside the /etc/prometheus/prometheus.yml file,

...

  1. a list of swarm nodes is displayed to modify the following section: 

    Code Block
    - job_name: 'swarm' 
       scrape_interval: 30s 
       static_configs: 
         - targets: ['10.10.10.84:9100','10.10.10.85:9100','10.10.10.86:9100']

...

  1. Verify that targets are changed to match your Swarm storage node IP's.

Note

...

Use DNS names in the absence of a DNS server.

...

First, modify /etc/hosts with the desired names for each Swarm storage node and then use those names in the configuration file.

...

It is

...

recommended to avoid showing IP addresses on potential public dashboards.

  1. If

...

  1. a Content Gateway is available in the deployment,

...

  1. add

...

  1. that Content Gateway to prometheus.yml as follows: 

    Code Block
    - job_name: 'swarmcontentgateway' 
       scrape_interval: 30s 
       static_configs: 
        - targets: ['10.10.10.20:9100','10.10.10.21:9100' ] 
    relabel_configs: 
      - source_labels: [__address__] 
      regex: "([^:]+):\\d+" 
      target_label: instance 

Note

...

In case of multiple Content Gateways, add them to the targets' list.

...

  1. Use a human-friendly job_name displayed in the gateway dashboard

...

  1. Modify the swarmUI template in /etc/prometheus/alertmanager/template/basic-email.tmpl. This

...

  1. is used for the HTML email

...

  1. template

...

  1. , which shows a button to the chosen URL.

  2. Change the part in bold

...

  1. .

Panel
bgColor#DEEBFF

{{ define "__swarmuiURL" }}https://172.30.10.222:91/_admin/storage/{{ end }} 

  1. Modify the gateway job name in /etc/prometheus/alertmanager/alertmanager.yml. It must match what you chose in prometheus.yml.

    Code Block
    routes: 
      - match: 
      	job: swarmcontentgateway 
  2. Modify the gateway job name in /etc/prometheus/alert.rules.yml.

    Code Block
    - alert: gateway_down 
    expr: up{job="swarmcontentgateway"} == 0 
  3. To restart the service, type:

...

  1. Code Block
    systemctl restart prometheus
  1. To enable it for reboots, type:

...

  1. Code Block

...

bgColor#DEEBFF
  1. systemctl enable prometheus

...

  1. Open a browser and

...

  1. go to http://YourVMIP:9090/targets to test if Prometheus is up. This page

...

  1. shows which targets it is currently collecting metrics for and if they are reachable.

...

  1.  
    Use the below command to test the same from a terminal

...

  1. :

Panel
bgColor#DEEBFF

curl YOURVMIP:9090/api/v1/targets

Gateway Node Exporter Configuration

  1. Starting from Swarm 15.3, the gateway dashboard

...

  1. needs to run the node_exporter service on the gateways.

...

  1. Configure the systemd service

...

  1. to listen on port 9095

...

  1. because Gateway metrics component uses the default port 9100

...

  1. .

...

  1. Put the node_exporter golang binary in the /usr/local/bin directory. (Example: systemd config file for the node exporter

...

  1. )

    Code Block
    [Unit]
    Description=Node Exporter
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=root
    Group=root
    Type=simple
    ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9095 --collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z])\\d+$ --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker)($|/) --collector.filesystem.ignored-fs-types=^/(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)($|/) --collector.meminfo_numa --collector.ntp --collector.processes --collector.tcpstat --no-collector.nfs --no-collector.nfsd --no-collector.xfs --no-collector.zfs --no-collector.infiniband --no-collector.vmstat --no-collector.textfile --collector.conntrack --collector.qdisc --collector.netclass
    
    [Install]
    WantedBy=multi-user.target
    
  2. Enable and configure the service.

...

  1. Code Block
    systemctl enable node_exporter

...

  1. 
    
    systemctl start node_exporter
  1. Add a job definition for it in the Prometheus master configuration file.

Example:

Code Block
- job_name: 'gateway-node-exporter' 
   scrape_interval: 30s 
   static_configs: 
    - targets: ['10.10.10.20:9095'] 
relabel_configs: 
  - source_labels: [__address__] 
  regex: "([^:]+):\\d+" 
  target_label: instance

SCS Node Exporter Configuration

  1. Starting from Swarm 15.3, the SCS requires the node_exporter service to monitor partition capacity information which is exposed at the end of the Swarm Node View dashboard.

  2. Use the same systemd script provided by the gateway

...

  1. and the default listen port of 9100. SCS 1.5.1 has been modified to add a firewall rule for port 9100 on the

...

  1. Swarm storage network.

...

  1. Put the node_exporter golang binary in the /usr/local/bin directory.

  2. Enable and configure the service.

...

  1. Code Block
    systemctl enable node_exporter

...

  1. 
    
    systemctl start node_exporter

...

  1. Add a job definition for it in the Prometheus master configuration file.

Example:

Code Block
- job_name: 'scs-node-exporter' 
   scrape_interval: 30s 
   static_configs: 
    - targets: ['10.10.10.2:9100'] 
relabel_configs: 
  - source_labels: [__address__] 
  regex: "([^:]+):\\d+" 
  target_label: instance

Elasticsearch Exporter Configuration

  1. The Swarm Search v7 Dashboard requires a new elasticsearch_exporter service that runs locally on the Telemetry VM.

...

  1. Modify the systemd script to tell

...

  1. the IP address

...

  1. of

...

  1. your

...

  1. Elasticsearch nodes.

  2. Modify /usr/lib/systemd/system/elasticsearch_exporter.service if the

...

  1. Elasticsearch node IP is different.

  2. The --uri needs to be pointing at the IP address of one of your elasticsearch nodes. It will auto-discover

...

  1. other nodes from the metrics. 

  2. The new elasticsearch_exporter

...

  1. needs its job and replaces the old way of scraping metrics from

...

  1. Elasticsearch nodes via plugins. 

...

  1. Add the job if it is missing in /etc/prometheus/prometheus.

...

  1. yml using the following script:

    Code Block
    - job_name: 'elasticsearch' 
        scrape_interval: 30s 
        static_configs: 
        - targets: ['127.0.0.1:9114']      
        relabel_configs: 
        - source_labels: [__address__] 
          regex: "([^:]+):\\d+" 
          target_label: instance 

...

  1. Verify the

...

  1. Elasticsearch exporter is running and configured to start on a reboot. 

...

  1. Code Block
    systemctl enable elasticsearch_exporter 

...

  1. 
    
    systemctl start elasticsearch_exporter

Prometheus Retention Time

  1. By default, Prometheus

...

  1. keeps metrics for 15 days (It can be modified to store for 30 days).

...

  1. To change the duration, follow the below instructions: 

    1. Edit the /root/prometheus.service file and select your default retention time for the collected metrics.

...

Tip

    1. (30 days

...

    1. are sufficient for POC's and demo's. Modify the flag --storage.tsdb.retention.time=

...

    1. 30d to

...

    1. get new desired retention time.)

    2. The rule of thumb is 600MB of disk space for 30 days per Swarm

...

    1. node. This VM comes with a 50 GB dedicated vmdk partition for Prometheus. (This means

...

    1. Swarm can handle up to 32 chassis for 30 days). 

    2. If

...

    1. the retention time is modified, then

...

    1. commit the change: 

...

    1. Code Block

...

bgColor#DEEBFF
    1. cp /root/prometheus.service /usr/lib/systemd/

...

systemctl daemon-reload 

...

    1. system 
      
      systemctl daemon-reload 
      
      promtool check config /etc/prometheus/prometheus.

...

systemctl restart prometheus 

Info

Tip

30 days is more than enough for POC's and demo's. Modify the --storage.tsdb.retention.time=30d  flag to your new desired retention time.

    1. yml 
      
      systemctl restart prometheus 

Prometheus Security

It may be desirable to restrict the Prometheus server to only allow

...

queries from the local host

...

since the grafana-server is running on the same VM.

...

To do so, edit the prometheus.service file and

...

add the flag --web.listen-address=127.0.0.1:9090.

Note

Warning

...

Not allowed to access the prometheus bultin UI on port 9090 remotely, if

...

binding only to localhost.

Grafana Configuration 

...

  1. Modify the /etc/grafana/grafana.ini file

...

  1. to

...

  1. set up the IP address

...

  1. and the server should be listening too. By default, it

...

  1. binds to all local

...

  1. IPs on port 80.  

  2. Review the admin_password parameter.

Note

The default admin password is "datacore" for Grafana.

  1. Grafana has several authentication options including google-auth / oAuth / ldap

...

  1. . The default option is basic http auth. See https://docs.grafana.org/ for more details. 

  2. To start the service, type "service grafana-server start" or "systemctl start grafana-server".

  3. To enable it for reboots, type "systemctl enable grafana-server".

Alertmanager Configuration

...

  1. Four alerts are defined in /etc/prometheus/alert.rules.yml including”

    1. Service_down: Triggered if any

...

    1. Swarm storage node is down for more than 30 minutes.

    2. Gateway_down: Triggered if the cloudgateway service is down for more than 2 minutes.

    3. Elasticsearch_cluster_state: Triggered if the cluster state is changed to "red" after 5 minutes.

    4. Swarm_volume_missing: Triggered if the reported drive count is decreasing

...

    1. in 10 minutes.

  1. /etc/prometheus/prometheus.yml

...

  1. file contains a section that points to the alertmanager service on port 9093 as well as which alert.rules.yml file to use. 

  2. The configuration

...

  1. to send alerts to, is defined in /etc/prometheus/alertmanager/alertmanager.yml.

  2. By default, the route is disabled as it requires manual input from

...

  1. the environment (smtp server, user, pass, etc.).

...

  1.  An example of a working route to email alerts via Gmail: 

Code Block
- name: 'swarmtelemetry' 
  email_configs: 
  - to: swarmtelemetry@gmail.com 
    from: swarmtelemetry@gmail.com 
    smarthost: smtp.gmail.com:587 
    auth_username: swarmtelemetry@gmail.com 
    auth_identity: swarmtelemetry@gmail.com 
    auth_password: YOUGMAILPASSWORD or APPPASSWORD 
    send_resolved: true

Note

...

Configure the alertmanager for the swarmtelemetry and gatewaytelemetry

...

routes.

...

These are defined separately

...

as they use their own custom email templates.

Note

Warning

Prometheus alertmanager does not support SMTP NTLM authentication, hence, you cannot use it to send authenticated emails directly to Microsoft Exchange. Alternatively,

...

configure the smarthost to connect to localhost:25 without authentication, where the default

...

CentOS postfix server is running. It will know how to send the email to your corporate relay (auto-discovered via DNS). You will need to add require_tls: false to the email definition config section in alertmanager.yml.

Example configuration for a local SMTP relay in

...

the enterprise environment.

Code Block
- name: 'emailchannel' 
  email_configs: 
  - to: admin@acme.com 
    from: swarmtelemetry@acme.com
    smarthost: smtp.acme.com:25 
    require_tls: false
    send_resolved: true
  1. Once the configuration

...

  1. is completed, restart the alertmanager

...

  1. .

...

  1. Code Block

...

bgColor#DEEBFF
  1. systemctl restart

...

  1.  alertmanager

...

  1. Verify the alertmanager.yml has the correct syntax

...

bgColor#DEEBFF

...

  1. using the below command:

    Code Block
    amtool check-config /etc/prometheus/alertmanager/alertmanager.

...

  1. yml

    This returns the following output:

...

  1. Code Block
    Checking '/etc/prometheus/alertmanager/alertmanager.yml'  SUCCESS 
    Found: 
     - global config 
     - route 
     - 1 inhibit rules 
     - 2 receivers 
     - 1 templates 
      SUCCESS

...

...

  1. Run the below command to show a list of active alerts

...

Panel
bgColor#DEEBFF

amtool alert

...

  1. .

    Code Block
    amtool alert
  1. Run the below command to show which alert route is enabled

...

  1. :

...


...

  1. Code Block

...

bgColor#DEEBFF
  1. amtool config routes

...

  1.  show
    Routing tree:

...

  1. 
    └── default-route  receiver:

...

  1.  disabled

    Example Email Alert: 

Image Modified
  1. The easiest way to trigger an alert for testing purposes is to

...

  1. shut down one gateway.

Info

Important

If you are aware of an alert and know that the resolution will take several days or weeks to resolve,

...

then silence alerts via the alert manager GUI on port 9093.

...

Image Added

Dashboards on Grafana

DashBoard ID

Dashboard Name

16545

DataCore Swarm AlertManager v15

16546

DataCore Swarm Gateway v7

16547

DataCore Swarm Node View

16548

DataCore Swarm System Monitoring v15

17057

DataCore Swarm Search v7

19456

DataCore Swarm Health Processor v1

Info

General Advice Around Defining New Alerts

  • Pages should be urgent, important, actionable, and real. 

  • They

...

  • represent either ongoing or imminent problems with your service. 

...

  • Error on

...

  • removing noisy alerts

...

  • ; over-monitoring is a harder problem to solve than under-monitoring. 

...

  • Classify the problem into one of the following:

    • Availability and basic functionality

    • Latency

    • Correctness (completeness, freshness, and durability of data) and

    • Feature-specific problems

  • Symptoms are a better way to capture

...

  • excessive problems

...

  • comprehensively and robustly with less effort. 

  • Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes. 

  • The further up your serving stack you go, the more distinct problems you catch in a single rule. But

...

  • do not go so far that you

...

  • cannot distinguish what

...

  • is going on. 

  • If you want a quiet on-call rotation,

...

  • then use a system for dealing with things that need timely response, but are not imminently critical.