Table of Contents
Introduction
The following instructions detail how to install DAOS and then setup and start DAOS servers and clients on 2 or more nodes. The sections include instructions for both openSUSE/SLES and CentOS. For more details reference the DAOS administration guide: https://daos-stack.github.io/admin/hardware/
Requirements
The following steps require 2 or more hosts which will be divided up into admin, client, and server roles. One node can be used for both the admin and client node. All nodes must have:
- sudo access configured
- password-less ssh configured
- pdsh installed (or some other means of running multiple remote commands in parallel)
In addition the server nodes should also have:
- an InfiniBand network adapter configured
- one or more NVMe devices
- IOMMU is enabled https://daos-stack.github.io/admin/predeployment_check/#enable-iommu-optional
For the use of the commands outlined on this page the following shell variables will need to be defined:
- ADMIN_NODE
- CLIENT_NODES
- SERVER_NODES
- ALL_NODES
For example, if one wanted to use node-1 as their admin node, node-2 and node-3 as client nodes, and node-[4-6] as their server nodes then these variables would be defined as:
ADMIN_NODE=node-1 CLIENT_NODES=node-2,node-3 SERVER_NODES=node-4,node-5,node-6 ALL_NODES=$ADMIN_NODE,$CLIENT_NODES,$SERVER_NODES
If a client node is also serving as an admin node then exclude $ADMIN_NODE from the ALL_NODES assignment to prevent duplication, e.g.
ALL_NODES=$CLIENT_NODES,$SERVER_NODES
RPM Installation
In this section the required RPMs will be installed on the each of nodes based upon their role. Admin and client nodes require the installation of the daos-client RPM and server nodes require the installation of the daos-server RPM.
Configure access to the DAOS package repository at https://packages.daos.io/v1.2.
# For Centos only pdsh -w $ALL_NODES 'sudo wget -O /etc/yum.repos.d/daos-packages.repo https://packages.daos.io/v1.2/CentOS7/packages/x86_64/daos_packages.repo' # For SUSE only pdsh -w $ALL_NODES 'sudo zypper ar https://packages.daos.io/v1.2/Leap15/packages/x86_64/ daos_packages'
Import GPG key on all nodes:
pdsh -w $ALL_NODES 'sudo rpm --import https://packages.daos.io/RPM-GPG-KEY'
Perform the additional OS-specific steps:
# For Centos Only pdsh -w $ALL_NODES 'sudo yum install -y epel-release' # For SUSE Only pdsh -w $ALL_NODES 'sudo zypper --non-interactive refresh'
Install the DAOS server RPMs on the server nodes
# For Centos Only pdsh -w $SERVER_NODES 'sudo yum install -y daos-server' # For SUSE Only pdsh -w $SERVER_NODES 'sudo zypper install -y daos-server'
Install the DAOS client RPMs on the client and admin nodes
# For Centos Only pdsh -w $ALL_NODES -x $SERVER_NODES 'sudo yum install -y daos-client' # For SUSE Only pdsh -w $ALL_NODES -x $SERVER_NODES 'sudo zypper install -y daos-client'
(Optionally) Install the DAOS test RPMs on the client nodes - typically not required
# For Centos Only pdsh -w $ALL_NODES -x $SERVER_NODES 'sudo yum install -y daos-tests' # For SUSE Only pdsh -w $ALL_NODES -x $SERVER_NODES 'sudo zypper install -y daos-tests'
Hardware Provisioning
In this section PMem (Intel(R) Optane(TM) persistent memory) and NVME SSDs will be prepared and configured to be used by DAOS. PMem preparation is required once per DAOS installation.
For OpenSUSE 15.2 installation, update ipmctl to latest package available from https://build.opensuse.org/package/binaries/hardware:nvdimm/ipmctl/openSUSE_Leap_15.2
Prepare the pmem devices on Server nodes
pdsh -w $ALL_NODES -x $SERVER_NODES daos_server storage prepare --scm-only Preparing locally-attached SCM... Memory allocation goals for SCM will be changed and namespaces modified, this will be a destructive operation. Please ensure namespaces are unmounted and locally attached SCM & NVMe devices are not in use. Please be patient as it may take several minutes and subsequent reboot maybe required. Are you sure you want to continue? (yes/no) yes A reboot is required to process new SCM memory allocation goals.
- Reboot the server node
Re run the prepare cmdline again
pdsh -w $ALL_NODES -x $SERVER_NODES daos_server storage prepare --scm-only Preparing locally-attached SCM... SCM namespaces: SCM Namespace Socket ID Capacity ------------- --------- -------- pmem0 0 3.2 TB pmem1 0 3.2 TB
Prepare the NVME devices on Server nodes
pdsh -w $ALL_NODES -x $SERVER_NODES daos_server storage prepare --nvme-only -u root Preparing locally-attached NVMe storage...
Scan the available storage on the Server nodes
pdsh -w $ALL_NODES -x $SERVER_NODES daos_server storage scan Scanning locally-attached storage... NVMe PCI Model FW Revision Socket ID Capacity -------- ----- ----------- --------- -------- 0000:5e:00.0 INTEL SSDPE2KE016T8 VDV10170 0 1.6 TB 0000:5f:00.0 INTEL SSDPE2KE016T8 VDV10170 0 1.6 TB 0000:81:00.0 INTEL SSDPED1K750GA E2010475 1 750 GB 0000:da:00.0 INTEL SSDPED1K750GA E2010475 1 750 GB SCM Namespace Socket ID Capacity ------------- --------- -------- pmem0 0 3.2 TB pmem1 1 3.2 TB
Generate certificates
In this section certificates will be generated and installed for encrypting DAOS control plane communications.
Administrative nodes require the following certificate files:
- CA root certificate (daosCA.crt) owned by the current user
- Admin certificate (admin.crt) owned by the current user
- Admin key (admin.key) owned by the current user
Client nodes require the following certificate files:
- CA root certificate (daosCa.crt) owned by the current user
- Agent certificate (agent.crt) owned by the daos_agent user
- Agent key (agent.key) owned by the daos_agent user
Server nodes require the following certificate files:
- CA root certificate (daosCA.crt) owned by the daos_server user
- Server certificate (server.crt) owned by the daos_server user
- Server key (server.key) owned by the daos_server user
- A copy of the Client certificate (client.crt) owned by the daos_server user
See https://daos-stack.github.io/admin/deployment/#certificate-configuration for more inforamation.
The following commands are run from the $ADMIN_NODE.
Generate a new set of certificates.
cd /tmp /usr/lib64/daos/certgen/gen_certificates.sh
These files should be protected from unauthorized access and preserved for future use.
Copy the certificates to a common location on each node in order to be able to move them to the final location
pdsh -S -w $ALL_NODES -x $(hostname -s) scp -r $(hostname -s):/tmp/daosCA /tmp
Copy the certificates to their default location (/etc/daos) on each admin node
pdsh -S -w $ADMIN_NODE sudo cp /tmp/daosCA/certs/daosCA.crt /etc/daos/certs/. pdsh -S -w $ADMIN_NODE sudo cp /tmp/daosCA/certs/admin.crt /etc/daos/certs/. pdsh -S -w $ADMIN_NODE sudo cp /tmp/daosCA/certs/admin.key /etc/daos/certs/.
If the /etc/daos/certs directory does not exist on the admin nodes then use the following command to create it:
pdsh -S -w $ADMIN_NODES sudo mkdir /etc/daos/certs
Copy the certificates to their default location (/etc/daos) on each client node
pdsh -S -w $CLIENT_NODES sudo cp /tmp/daosCA/certs/daosCA.crt /etc/daos/certs/. pdsh -S -w $CLIENT_NODES sudo cp /tmp/daosCA/certs/agent.crt /etc/daos/certs/. pdsh -S -w $CLIENT_NODES sudo cp /tmp/daosCA/certs/agent.key /etc/daos/certs/.
If the /etc/daos/certs directory does not exist on the client nodes then use the following command to create it:
pdsh -S -w $CLIENT_NODES sudo mkdir /etc/daos/certs
Copy the certificates to their default location (/etc/daos) on each server node
pdsh -S -w $SERVER_NODES sudo cp /tmp/daosCA/certs/daosCA.crt /etc/daos/certs/. pdsh -S -w $SERVER_NODES sudo cp /tmp/daosCA/certs/server.crt /etc/daos/certs/. pdsh -S -w $SERVER_NODES sudo cp /tmp/daosCA/certs/server.key /etc/daos/certs/. pdsh -S -w $SERVER_NODES sudo cp /tmp/daosCA/certs/agent.crt /etc/daos/certs/clients/agent.crt
Set the ownership of the admin certificates on each admin node
pdsh -S -w $ADMIN_NODE sudo chown $USER:$USER /etc/daos/certs/daosCA.crt pdsh -S -w $ADMIN_NODE sudo chown $USER:$USER /etc/daos/certs/admin.*
Set the ownership of the client certificates on each client node
pdsh -S -w $CLIENT_NODES sudo chown $USER:$USER /etc/daos/certs/daosCA.crt pdsh -S -w $CLIENT_NODES sudo chown daos_agent:daos_agent /etc/daos/certs/agent.*
Set the ownership of the server certificates on each server node
pdsh -S -w $SERVER_NODES sudo chown daos_server:daos_server /etc/daos/certs/daosCA.crt pdsh -S -w $SERVER_NODES sudo chown daos_server:daos_server /etc/daos/certs/server.* pdsh -S -w $SERVER_NODES sudo chown daos_server:daos_server /etc/daos/certs/clients/agent.crt pdsh -S -w $SERVER_NODES sudo chown daos_server:daos_server /etc/daos/certs/clients
Create Configuration Files
In this section the daos_server, daos_agent, and dmg command configuration files will be defined. Examples are available at https://github.com/daos-stack/daos/tree/release/1.2/utils/config/examples
First determine the addresses for the NVMe devices on the server nodes
pdsh -S -w $SERVER_NODES sudo lspci | grep -i nvme
Save the addresses of the NVMe devices to use with each DAOS server, e.g. "81:00.0", from each server node. This information will be used to populate the "bdev_list" server configuration parameter below.
Create a server configuration file by modifying the default /etc/daos/daos_server.yml file on the server nodes. Below is an example daos_server.yml. Copy modified server yaml file to all the server nodes at `/etc/daos/daos_server.yml.
More details on configuring the daos_server.yml file are available at Server configuration file details
name: daos_server access_points: ['node-4'] port: 10001 transport_config: allow_insecure: false client_cert_dir: /etc/daos/certs/clients ca_cert: /etc/daos/certs/daosCA.crt cert: /etc/daos/certs/server.crt key: /etc/daos/certs/server.key provider: ofi+verbs;ofi_rxm socket_dir: /var/run/daos_server nr_hugepages: 4096 control_log_mask: DEBUG control_log_file: /tmp/daos_server.log helper_log_file: /tmp/daos_admin.log engines: - targets: 8 nr_xs_helpers: 0 fabric_iface: ib0 fabric_iface_port: 31416 log_mask: INFO log_file: /tmp/daos_engine_0.log env_vars: - CRT_TIMEOUT=30 scm_mount: /mnt/daos0 scm_class: dcpm scm_list: [/dev/pmem0] bdev_class: nvme bdev_list: ["0000:81:00.0"] # generate regular nvme.conf - targets: 8 nr_xs_helpers: 0 fabric_iface: ib1 fabric_iface_port: 31416 log_mask: INFO log_file: /tmp/daos_engine_1.log env_vars: - CRT_TIMEOUT=30 scm_mount: /mnt/daos1 scm_class: dcpm scm_list: [/dev/pmem1] bdev_class: nvme bdev_list: ["0000:83:00.0"] # generate regular nvme.conf Copy modified server yaml file to all the server nodes at `/etc/daos/daos_server.yml`
- Create an agent configuration file by modifying the default /etc/daos/daos_agent.yml file on the client nodes. Below is an example daos_agent.yml. Copy modified agent yaml file to all the client nodes at `/etc/daos/daos_agent.yml. More details on configuring the daos_agent.yml file are available at Agent configuration file details
name: daos_server access_points: ['node-4'] port: 10001 transport_config: allow_insecure: false ca_cert: /etc/daos/certs/daosCA.crt cert: /etc/daos/certs/agent.crt key: /etc/daos/certs/agent.key runtime_dir: /var/run/daos_agent log_file: /tmp/daos_agent.log
- Create a dmg configuration file by modifying the default /etc/daos/daos_control.yml file on the admin node. Below is an example daos_control.yml. More details on configuring the daos_control.yml file are available at DMG configuration file details
name: daos_server port: 10001 hostlist: ['node-4', 'node-5', 'node-6'] transport_config: allow_insecure: false ca_cert: /etc/daos/certs/daosCA.crt cert: /etc/daos/certs/admin.crt key: /etc/daos/certs/admin.key
Start daos servers
Start daos engines on server nodes
# start servers pdsh -S -w $SERVER_NODES "sudo systemctl daemon-reload" pdsh -S -w $SERVER_NODES "sudo systemctl start daos_server"
Check status and format storage
# check status pdsh -S -w $SERVER_NODES "sudo systemctl status daos_server" # if you see following format messages (depending on number of servers), proceed to storage format wolf-179: May 05 22:21:03 wolf-179.wolf.hpdd.intel.com daos_server[37431]: Metadata format required on instance 0 # format storage dmg storage format -l $SERVER_NODES --reformat
Verify that all servers have started
# system query from ADMIN_NODE dmg system query -v # all the server ranks should show 'Joined' STATE Rank UUID Control Address Fault Domain State Reason ---- ---- --------------- ------------ ----- ------ 0 604c4ffa-563a-49dc-b702-3c87293dbcf3 10.8.1.179:10001 /wolf-179.wolf.hpdd.intel.com Joined 1 f0791f98-4379-4ace-a083-6ca3ffa65756 10.8.1.179:10001 /wolf-179.wolf.hpdd.intel.com Joined 2 745d2a5b-46dd-42c5-b90a-d2e46e178b3e 10.8.1.189:10001 /wolf-189.wolf.hpdd.intel.com Joined 3 ba6a7800-3952-46ce-af92-bba9daa35048 10.8.1.189:10001 /wolf-189.wolf.hpdd.intel.com Joined
Start daos agents
Start the daos agents on the client nodes
# start agents pdsh -S -w $CLIENT_NODES "sudo systemctl start daos_agent"
(Optional) Check daos_agent status
# check status pdsh -S -w $CLIENT_NODES "cat /tmp/daos_agent.log" # Sample output depending on number of client nodes node-2: agent INFO 2021/05/05 22:38:46 DAOS Agent v1.2 (pid 47580) listening on /var/run/daos_agent/daos_agent.sock node-3: agent INFO 2021/05/05 22:38:53 DAOS Agent v1.2 (pid 39135) listening on /var/run/daos_agent/daos_agent.sock