Using Grafana with Prometheus on wolf

Reference README.md in daos src tree  https://github.com/daos-stack/daos/tree/master/utils/grafana

Use daos admin node to install and run Grafana

In order to run Grafana,  daos server must have telemetry port defined in daos_server.yml  (9191)


Install Prometheus on wolf daos admin node


mkdir ~/prometheus
dmg telemetry config -i /home/mjean/prometheus
		Downloading and installing Prometheus...                                                                                                                                                                                                 
		fetching prometheus/prometheus v2.28.1                                                                                                                                                                                                   
		Installed prometheus to /home/mjean/prometheus/prometheus                                                                                                                                                                                
		Configuring Prometheus for DAOS monitoring...                                                                                                                                                                                            
		Wrote DAOS monitoring config to /home/mjean/.prometheus.yml)



To collect data from the server nodes; they will need to be added to the ~/.prometheus.yml

ex: 

  - targets:
    - localhost:9191
 to 
  - targets:
    - wolf-118:9191
    - wolf-119:9191
    - wolf-120:9191
    - wolf-121:9191


(mjmac) It's worth noting here that if you add the hosts to your ~/.daos_control.yml or some other control config file that you use with dmg -o /path/to/config.yml, they will automatically be added to your prometheus configuration.

Starting Prometheus


due to DAOS-8104 ; prometheus should be manually started with

./prometheus --config_file=~/.prometheus.yml;


Only use the cmd below after DAOS-8104 is fixed

dmg telemetry run -i /home/mjean/prometheus
		Downloading and installing Prometheus...                                                                                                                                                                                              
		fetching prometheus/prometheus v2.28.1                                                                                                                                                                                                
		Installed prometheus to /home/mjean/prometheus/prometheus
		Configuring Prometheus for DAOS monitoring...
		Wrote DAOS monitoring config to /home/mjean/.prometheus.yml)
		Starting Prometheus monitoring...
		level=info ts=2021-07-16T12:59:06.771Z caller=main.go:389 msg="No time or size retention was set so using the default time retention" duration=15d
		level=info ts=2021-07-16T12:59:06.771Z caller=main.go:443 msg="Starting Prometheus" version="(version=2.28.1, branch=HEAD, revision=b0944590a1c9a6b35dc5a696869f75f422b107a1)"
		level=info ts=2021-07-16T12:59:06.771Z caller=main.go:448 build_context="(go=go1.16.5, user=root@2915dd495090, date=20210701-15:20:10)"
		level=info ts=2021-07-16T12:59:06.771Z caller=main.go:449 host_details="(Linux 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Apr 8 19:51:47 UTC 2021 x86_64 wolf-80.wolf.hpdd.intel.com wolf.hpdd.intel.com)"
		level=info ts=2021-07-16T12:59:06.771Z caller=main.go:450 fd_limits="(soft=1048576, hard=1048576)"
		level=info ts=2021-07-16T12:59:06.771Z caller=main.go:451 vm_limits="(soft=unlimited, hard=unlimited)"
		level=info ts=2021-07-16T12:59:06.775Z caller=web.go:541 component=web msg="Start listening for connections" address=0.0.0.0:9090
		level=info ts=2021-07-16T12:59:06.776Z caller=main.go:824 msg="Starting TSDB ..."
		level=info ts=2021-07-16T12:59:06.777Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
		level=info ts=2021-07-16T12:59:06.778Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626298627558 maxt=1626350400000 ulid=01FANHRSJ125WPM5XQP3JN7548
		level=info ts=2021-07-16T12:59:06.779Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626415202422 maxt=1626422400000 ulid=01FAQ8PQVY9T7Q6YHB3Q6T2N3H
		level=info ts=2021-07-16T12:59:06.780Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626350400000 maxt=1626415200000 ulid=01FAQ8PR2JC5TWCEXPH51DKRNA
		level=info ts=2021-07-16T12:59:06.781Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1626422402421 maxt=1626429600000 ulid=01FAQFJF3YM6BH9GX6DM87M53F
		level=info ts=2021-07-16T12:59:06.805Z caller=head.go:780 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
		level=info ts=2021-07-16T12:59:06.805Z caller=head.go:794 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=49.128µs
		level=info ts=2021-07-16T12:59:06.805Z caller=head.go:800 component=tsdb msg="Replaying WAL, this may take a while"
		level=info ts=2021-07-16T12:59:06.809Z caller=head.go:826 component=tsdb msg="WAL checkpoint loaded"
		level=info ts=2021-07-16T12:59:06.888Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=18 maxSegment=23
		level=info ts=2021-07-16T12:59:06.971Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=19 maxSegment=23
		level=info ts=2021-07-16T12:59:07.047Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=20 maxSegment=23
		level=info ts=2021-07-16T12:59:07.108Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=21 maxSegment=23
		level=info ts=2021-07-16T12:59:07.119Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=22 maxSegment=23
		level=info ts=2021-07-16T12:59:07.120Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=23 maxSegment=23
		level=info ts=2021-07-16T12:59:07.120Z caller=head.go:860 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=3.35109ms wal_replay_duration=311.254676ms total_replay_duration=314.690735ms
		level=warn ts=2021-07-16T12:59:07.122Z caller=main.go:849 fs_type=NFS_SUPER_MAGIC msg="This filesystem is not supported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about supported filesystems."
		level=info ts=2021-07-16T12:59:07.122Z caller=main.go:854 msg="TSDB started"
		level=info ts=2021-07-16T12:59:07.122Z caller=main.go:981 msg="Loading configuration file" filename=/home/mjean/.prometheus.yml
		level=info ts=2021-07-16T12:59:07.124Z caller=main.go:1012 msg="Completed loading of configuration file" filename=/home/mjean/.prometheus.yml totalDuration=2.234767ms remote_storage=5.111µs web_handler=456ns query_engine=1.803µs scrape=1.283008ms scrape_sd=36.14µs notify=1.113µs notify_sd=2.39µs rules=2.893µs
		level=info ts=2021-07-16T12:59:07.124Z caller=main.go:796 msg="Server is ready to receive web requests."
		level=info ts=2021-07-16T13:00:02.439Z caller=compact.go:518 component=tsdb msg="write block" mint=1626429600000 maxt=1626436800000 ulid=01FAQPE1FGXW6QWCK6C40X2GQG duration=22.37061ms
		level=info ts=2021-07-16T13:00:02.450Z caller=head.go:967 component=tsdb msg="Head GC completed" duration=2.069077ms
		level=info ts=2021-07-16T13:00:02.452Z caller=checkpoint.go:97 component=tsdb msg="Creating checkpoint" from_segment=18 to_segment=20 mint=1626436800000
		level=info ts=2021-07-16T13:00:02.464Z caller=head.go:1064 component=tsdb msg="WAL checkpoint complete" first=18 last=20 duration=13.152763ms
		


Install Grafana on wolf daos admin node 

https://grafana.com/docs/grafana/latest/installation/


Download package from the grafana web site


	Downloads:
	https://grafana.com/grafana/download
	
	For Centos:
	
	wget https://dl.grafana.com/oss/release/grafana-8.0.6-1.x86_64.rpm
	sudo yum install grafana-8.0.6-1.x86_64.rpm
	
	For SLES
	
	wget https://dl.grafana.com/oss/release/grafana-8.0.6-1.x86_64.rpm
	sudo rpm -i --nodeps grafana-8.0.6-1.x86_64.rpm


After it is installed, start grafana services on daos admin node:


	sudo systemctl daemon-reload
	sudo systemctl start grafana-server
	sudo systemctl status grafana-server
	
	Configure the Grafana server to start at boot:
	sudo systemctl enable grafana-server 

Prometheus does not start on boot so will need to manaually re-start


Starting Grafana on a wolf daos admin node:


In order to monitor metrics on wolf and running  grafana from windows you will need to setup port forwarding.  Install Firefox on windows machine and setup a manual proxy in network settings   Port must match the forwarded port in putty setup below



Use putty to setup port forwarding/tunnel




Open the connection to wolf and login as you would normally log into wolf


Open a connection to the wolf node running the Grafana service  using the FireFox browser


Startup Grafana dashboard using FireFox 


 http://wolf-xx:3000

 Login with admin/admin.   It will prompt you to change the admin password


Before adding the daos dashboard to grafana; you will need to add prometheus data source. (It should prompt you to add the source initially)


 NOTE: Prometheus uses wolf-xx:9090


Add prometheus  data source




import grafana metrics in the Prometheus Dashbord






 import DAOS-Grafana-Dashboard.json from github https://github.com/daos-stack/daos/tree/master/utils/grafana





Open the daos dashboard and the monitor should start collecting metrics when the daos servers have started

This is a snapshot of daos metrics while soak was runing on shared cluster 8 servers (4 nodes) and 5 clients