Creating multiple SCM namespaces per CPU socket

The DAOS 2.4 release will support creating multiple SCM namespaces per NUMA node using the daos_server storage prepare --scm-only command (see ticket https://daosio.atlassian.net/browse/DAOS-9876 ). This mini-guide explains how to do this manually with the ipmctl and ndctl tools.

Perform the following steps to create two SCM namespaces per socket on a dual-socket host:

  1. Ensure a clean slate where PMem modules are configured in memory mode and have not yet been set to AppDirect mode.

  2. Create one PMem AppDirect (interleaved) region for each socket.

  3. Create two PMem namespaces per PMem AppDirect region.

  4. Validate that the PMem namespaces have been created and are visible as Linux block devices

Reset PMem into memory mode

To ensure a clean slate, configure PMem in memory mode to ensure all AppDirect configuration is removed:

  1. Stop the DAOS server: dmg system stop ; systemctl stop daos_server

  2. Ensure that all PMem-based filesystems are unmounted (df |grep pmem should not show anything, umount whatever is still mounted)

  3. daos_server storage prepare --scm-only --reset --force (this will remove the namespaces and set the PMem goal to MemoryMode)

  4. Reboot to apply the memory resource allocation goal changes in BIOS.

  5. ipmctl show -region

The output should be: There are no Regions defined in the system.

Create one PMem AppDirect region for each socket:

All the PMem modules attached to a specific socket will be combined in to a region.

  1. ipmctl create -f -goal PersistentMemoryType=AppDirect

  2. Reboot to apply the memory resource allocation goal changes in BIOS.

  3. ipmctl show -d PersistentMemoryType,FreeCapacity -region

Output should be of the form (the FreeCapacity value depends on the quantity and capacity of the PMem modules, this example is from a CLX server with six 512GiB PMem modules per socket):

---ISetID=0x2aba7f4828ef2ccc--- PersistentMemoryType=AppDirect FreeCapacity=3012.0 GiB ---ISetID=0x81187f4881f02ccc--- PersistentMemoryType=AppDirect FreeCapacity=3012.0 GiB

Create two PMem namespaces per PMem AppDirect region:

Each region will be divided into two PMem namespaces of equal size (in this example, 3012 GiB / 2 = 1506 GiB). The namespace size should be 2 MiB aligned and a multiple of the interleave-width (see the ndctl create-namespace command help for more details).

Run commands to create two namespaces on each of the two AppDirect regions:

GIB=3012 let SIZE=$GIB*1024*1024*1024/2 ; echo $SIZE # in this example, the output should be 1617055186944 ndctl create-namespace --region 0 --size $SIZE ndctl create-namespace --region 0 --size $SIZE ndctl create-namespace --region 1 --size $SIZE ndctl create-namespace --region 1 --size $SIZE

Each of the ndctl invocations should report the properties of the created namespace.

List PMem AppDirect region info:

After creating the namespaces, the PMem regions should have no free capacity left.

  1. ipmctl show -d PersistentMemoryType,FreeCapacity -region

Output should be of the form:

---ISetID=0x2aba7f4828ef2ccc--- PersistentMemoryType=AppDirect FreeCapacity=0.0 GiB ---ISetID=0x81187f4881f02ccc--- PersistentMemoryType=AppDirect FreeCapacity=0.0 GiB

List PMem namespaces:

New PMem namespace details should be available.

  1. ndctl list -N -v

Output should be of the form:

 

List the PMem block devices

Verify that the PMem namespaces are visible as Linux block devices:

  1. ls -al /dev/pmem*

  2. lsblk|grep -E "NAME|pmem"

On a 2-socket CLX server with six 128GiB PMem modules per socket, the output should be similar to this:

brw-rw---- 1 root disk 259, 0 Jun 22 13:12 /dev/pmem0
brw-rw---- 1 root disk 259, 1 Jun 22 13:12 /dev/pmem0.1
brw-rw---- 1 root disk 259, 2 Jun 22 13:13 /dev/pmem1
brw-rw---- 1 root disk 259, 3 Jun 22 13:13 /dev/pmem1.1

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
pmem0 259:0 0 372.1G 0 disk
pmem1 259:2 0 372.1G 0 disk
pmem0.1 259:1 0 372.1G 0 disk
pmem1.1 259:3 0 372.1G 0 disk

Note that the naming of the two PMem devices per socket is not symmetric: they are named pmem0, pmem0.1 on the first socket, and pmem1, pmem1.1 on the second socket.

 

This completes the manual creation of the PMem devices, which would normally be performed by the daos_server storage prepare --scm-only command. The PMem block devices can now be used in the daos_server.yml file to configure a total of four engines on two sockets.