Using Grafana with Prometheus on wolf
Reference README.md in daos src tree https://github.com/daos-stack/daos/tree/master/utils/grafana
Use daos admin node to install and run Grafana
In order to run Grafana, daos server must have telemetry port defined in daos_server.yml (9191)
Install Prometheus on wolf daos admin node
mkdir ~/prometheus dmg telemetry config -i /home/mjean/prometheus Downloading and installing Prometheus... fetching prometheus/prometheus v2.28.1 Installed prometheus to /home/mjean/prometheus/prometheus Configuring Prometheus for DAOS monitoring... Wrote DAOS monitoring config to /home/mjean/.prometheus.yml)
To collect data from the server nodes; they will need to be added to the ~/.prometheus.yml
ex:
- targets: - localhost:9191 to - targets: - wolf-118:9191 - wolf-119:9191 - wolf-120:9191 - wolf-121:9191
(mjmac) It's worth noting here that if you add the hosts to your ~/.daos_control.yml or some other control config file that you use with dmg -o /path/to/config.yml, they will automatically be added to your prometheus configuration.
Starting Prometheus
due to DAOS-8104 ; prometheus should be manually started with
./prometheus --config_file=~/.prometheus.yml;
Only use the cmd below after DAOS-8104 is fixed
dmg telemetry run -i /home/mjean/prometheus Downloading and installing Prometheus... fetching prometheus/prometheus v2.28.1 Installed prometheus to /home/mjean/prometheus/prometheus Configuring Prometheus for DAOS monitoring... Wrote DAOS monitoring config to /home/mjean/.prometheus.yml) Starting Prometheus monitoring... level=info ts=2021-07-16T12:59:06.771Z caller=main.go:389 msg="No time or size retention was set so using the default time retention" duration=15d level=info ts=2021-07-16T12:59:06.771Z caller=main.go:443 msg="Starting Prometheus" version="(version=2.28.1, branch=HEAD, revision=b0944590a1c9a6b35dc5a696869f75f422b107a1)" level=info ts=2021-07-16T12:59:06.771Z caller=main.go:448 build_context="(go=go1.16.5, user=root@2915dd495090, date=20210701-15:20:10)" level=info ts=2021-07-16T12:59:06.771Z caller=main.go:449 host_details="(Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 x86_64 wolf-80.wolf.hpdd.intel.com wolf.hpdd.intel.com)" level=info ts=2021-07-16T12:59:06.771Z caller=main.go:450 fd_limits="(soft=1048576, hard=1048576)" level=info ts=2021-07-16T12:59:06.771Z caller=main.go:451 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2021-07-16T12:59:06.775Z caller=web.go:541 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2021-07-16T12:59:06.776Z caller=main.go:824 msg="Starting TSDB ..." level=info ts=2021-07-16T12:59:06.777Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false level=info ts=2021-07-16T12:59:06.778Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626298627558 maxt=1626350400000 ulid=01FANHRSJ125WPM5XQP3JN7548 level=info ts=2021-07-16T12:59:06.779Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626415202422 maxt=1626422400000 ulid=01FAQ8PQVY9T7Q6YHB3Q6T2N3H level=info ts=2021-07-16T12:59:06.780Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626350400000 maxt=1626415200000 ulid=01FAQ8PR2JC5TWCEXPH51DKRNA level=info ts=2021-07-16T12:59:06.781Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626422402421 maxt=1626429600000 ulid=01FAQFJF3YM6BH9GX6DM87M53F level=info ts=2021-07-16T12:59:06.805Z caller=head.go:780 component=tsdb msg="Replaying on-disk memory mappable chunks if any" level=info ts=2021-07-16T12:59:06.805Z caller=head.go:794 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=49.128µs level=info ts=2021-07-16T12:59:06.805Z caller=head.go:800 component=tsdb msg="Replaying WAL, this may take a while" level=info ts=2021-07-16T12:59:06.809Z caller=head.go:826 component=tsdb msg="WAL checkpoint loaded" level=info ts=2021-07-16T12:59:06.888Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=18 maxSegment=23 level=info ts=2021-07-16T12:59:06.971Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=19 maxSegment=23 level=info ts=2021-07-16T12:59:07.047Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=20 maxSegment=23 level=info ts=2021-07-16T12:59:07.108Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=21 maxSegment=23 level=info ts=2021-07-16T12:59:07.119Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=22 maxSegment=23 level=info ts=2021-07-16T12:59:07.120Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=23 maxSegment=23 level=info ts=2021-07-16T12:59:07.120Z caller=head.go:860 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=3.35109ms wal_replay_duration=311.254676ms total_replay_duration=314.690735ms level=warn ts=2021-07-16T12:59:07.122Z caller=main.go:849 fs_type=NFS_SUPER_MAGIC msg="This filesystem is not supported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about supported filesystems." level=info ts=2021-07-16T12:59:07.122Z caller=main.go:854 msg="TSDB started" level=info ts=2021-07-16T12:59:07.122Z caller=main.go:981 msg="Loading configuration file" filename=/home/mjean/.prometheus.yml level=info ts=2021-07-16T12:59:07.124Z caller=main.go:1012 msg="Completed loading of configuration file" filename=/home/mjean/.prometheus.yml totalDuration=2.234767ms remote_storage=5.111µs web_handler=456ns query_engine=1.803µs scrape=1.283008ms scrape_sd=36.14µs notify=1.113µs notify_sd=2.39µs rules=2.893µs level=info ts=2021-07-16T12:59:07.124Z caller=main.go:796 msg="Server is ready to receive web requests." level=info ts=2021-07-16T13:00:02.439Z caller=compact.go:518 component=tsdb msg="write block" mint=1626429600000 maxt=1626436800000 ulid=01FAQPE1FGXW6QWCK6C40X2GQG duration=22.37061ms level=info ts=2021-07-16T13:00:02.450Z caller=head.go:967 component=tsdb msg="Head GC completed" duration=2.069077ms level=info ts=2021-07-16T13:00:02.452Z caller=checkpoint.go:97 component=tsdb msg="Creating checkpoint" from_segment=18 to_segment=20 mint=1626436800000 level=info ts=2021-07-16T13:00:02.464Z caller=head.go:1064 component=tsdb msg="WAL checkpoint complete" first=18 last=20 duration=13.152763ms
Install Grafana on wolf daos admin node
https://grafana.com/docs/grafana/latest/installation/
Download package from the grafana web site
Downloads: https://grafana.com/grafana/download For Centos: wget https://dl.grafana.com/oss/release/grafana-8.0.6-1.x86_64.rpm sudo yum install grafana-8.0.6-1.x86_64.rpm For SLES wget https://dl.grafana.com/oss/release/grafana-8.0.6-1.x86_64.rpm sudo rpm -i --nodeps grafana-8.0.6-1.x86_64.rpm
After it is installed, start grafana services on daos admin node:
sudo systemctl daemon-reload sudo systemctl start grafana-server sudo systemctl status grafana-server Configure the Grafana server to start at boot: sudo systemctl enable grafana-server
Prometheus does not start on boot so will need to manaually re-start
Starting Grafana on a wolf daos admin node:
In order to monitor metrics on wolf and running grafana from windows you will need to setup port forwarding. Install Firefox on windows machine and setup a manual proxy in network settings Port must match the forwarded port in putty setup below
Use putty to setup port forwarding/tunnel
Open the connection to wolf and login as you would normally log into wolf
Open a connection to the wolf node running the Grafana service using the FireFox browser
Startup Grafana dashboard using FireFox
http://wolf-xx:3000 Login with admin/admin. It will prompt you to change the admin password
Before adding the daos dashboard to grafana; you will need to add prometheus data source. (It should prompt you to add the source initially)
NOTE: Prometheus uses wolf-xx:9090
Add prometheus data source
import grafana metrics in the Prometheus Dashbord
import DAOS-Grafana-Dashboard.json from github https://github.com/daos-stack/daos/tree/master/utils/grafana
Open the daos dashboard and the monitor should start collecting metrics when the daos servers have started
This is a snapshot of daos metrics while soak was runing on shared cluster 8 servers (4 nodes) and 5 clients