New Logging and Error macros.

Now DAOS is a much larger established codebase it’s become apparent that the original logging macros are insufficient, and inconsistently used across the code, leading to poor user experience and difficulty in reporting or diagnosing issues.

The most commonly used debug levels are:

  • ERROR - This is an unexpected error, something that should not happen as it's either unexpected or a API violation.

  • WARNING - This is used much less and typically in cases where the code can take an alternative path and continue.

  • INFO - This is used for startup messages, configuration options applied etc. This will typically be quiet after the process has started but there are exceptions.

  • DEBUG - These are off by default, can be verbose and log progress through iterations. Various areas of the code are differently 'chatty' or verbose. DEBUG level logs are really intended for developers so are not covered by these new macros, they do not typically use the DF_RC macro and do not need to - it adds extra text to the source which is distracting when working on debugging or development.

The coding standards have always said that error logs should include the value of rc, and that it should be at the end of the line however this has been inconsistency applied. The addition of DF_RC to include errors strings as well as numbers was welcome but didn’t address the inconsistency issue, and means that the error messages take up more space in the source for largely boiler-plate text, in addition only some areas of the codebase have adopted them.

Some effort has been made to check for consistency in error output and looking for a standard format, however the depends on having tests that provoke error paths which is time consuming (although productive) and the fixes have involved adding boiler-plate text to the source - improving the quality of error reports for users, but making the code itself larger and harder to read.

The new error macros resolve this issue by taking the rc value as an option to the macro itself, this allows the boiler-plate text to go in the macro and the have it apply the formatting - other printf formatting can still be used. No whitespace, formatting or newlines should be included at the end of the new macros, this will be added automatically.

Preferred:

DL_ERROR(rc, "failed to create handle hash table"); DL_ERROR(rc, "Failed to replay %s", act_opc2str(act->ac_opc));

Old macros:

D_ERROR("failed to create handle hash table: "DF_RC"\n", DP_RC(rc)); D_ERROR("Failed to replay %s, "DF_RC"\n", act_opc2str(act->ac_opc), DP_RC(rc));

New macros, ERROR is shown here but they are all repeated for INFO and WARN.

  • DL_ERROR: Logs with an RC as enum daos_errno value

  • DHL_ERROR: Logs with an RC as enum daos_error value, and associated pointer/descriptor.

  • DS_ERROR: Logs with an RC as errno value

  • DHS_ERROR: Logs with an RC errno value and associated pointer/descriptor.

The system S variants should be used when a system error number is used, typically in the posix code or when interacting with the base OS.

The handle H variants should be used when there is an associated descriptor with which the error is associated, for example a rpc descriptor or hash table etc. For now these just include the pointer value in the logs after the function name although in the future we plan to be able to do more with this. We have the potential to auto-format messages based on the descriptor itself, for example include uuids/ranks etc.

DHS_INFO(dfuse_info, rc, "duns_resolve_path() on path");

Continuous integration checks.

As the new macros do not tolerate a newline at the end of the string but the old ones did (one will be added automatically if missing) and standard practice is to install one then there’s an obvious danger that developers will include one incorrectly, as such there is a new CI check to detect this and warn about them.

Logging without RC value

For cases where a RC value is not available there is no changes as a result of the new macros.

Further work

There are other macros used throughout the codebase for other reasons, the DFUSE macros are largely redundant as a result of these changes RPC_ERROR is similar to DL_ERROR however has some additional features which are not present yet, VOS_TX_LOG_FAIL also has some unique features which could be useful elsewhere.

Once a proportion of the code are using the new macros we can add metrics for error counts by error, and feed this into telemetry and test results etc. For areas of concern we could choose to proactively migrate code over to make use of this facility.

The use of the _Generic C macro tools mean that data should allow us to log data from and provided descriptor which could further reduce boilerplate code in the source.

Conclusion

Overall migrating to the new macros should improve both the user experience as well as making debugging and working with the codebase easier, the intention is that they should become the go-to macros that developers use when working on code. In time we should extend the tools to require their use for modified code in pull requests and look to making wholesale changes where appropiate.