FARGOS Development, LLC Utility Library

The FARGOS Development, LLC utility library and affiliated programs implement a variety of performance-sensitive capabilities and some novel functionality. These include routines for doing ultra-fast conversion of numeric data to-and-from ASCII, thread-safe buffers, O(1) rate counters, Bloom filters, fast tables, the famed IO_Processor, an embedded HTTP server, shared memory variables, unambiguous time representation and fast conversions, an extremely powerful, hyper-efficient logging facility.

There are two opposing development approaches encountered in practice. One approach, either by design or lack of vision, pursues apparent expediency in the belief that everything will perform acceptably when the individual components are combined. Rarely has an organization been able to compress a pile of less-than-optimal pieces into a shining diamond of robustness and performance, but hope springs eternal..

The other approach is to attempt to have the absolute best performance from each component. In general, the combination of such best-of-breed components retains its good attributes.

The design of this library has pursued the latter approach: if a routine exists in the library, it is intended to be there for performance reasons or to provide novel functionality.

Shared Memory Variables
Logging API
Self-Describing Meta Data
Protocol Specifications

Shared Memory Variable Reference

Overview of Shared Memory Variable Functionality

Several of the FARGOS Development, LLC utilities exploit the SharedMemoryVariable class. A SharedMemoryVariable usually realizes either a fundamental primitive type, such as an integer or floating point number, or a string, but also exposes the content in a shared memory segment from which it can be viewed or modified by external processes. If the shared memory segment is maintained as a memory-mapped file, the data also survives the termination of the process. Thus the facility provides the ability for non-intrusive monitoring, adjustments on the fly, and post-mortem analysis of statistics or the reason why a process was stopped. These benefits are obtained with overhead no greater than a pointer dereference.

The SharedMemoryVariableManager class is responsible for maintaining a collection of SharedMemoryVariables within a memory segment. It is possible to have more than one SharedMemoryVariableManager active within a process, but admittedly this usage is uncommon. There is always at least one SharedMemoryVariableManager object automatically constructed and it has the well-known name of DEFAULT_sharedMemoryVariableManager. Although automatically constructed, it does not allocate any space: the application must initialize this variable appropriately if the facilities are to be used.

As SharedMemoryVariables are constructed, they register themselves with their corresponding SharedMemoryVariableManager, which by default is the previously mentioned DEFAULT_sharedMemoryVariableManager object. If the SharedMemoryVariableManager being used has not yet been initialized, it remembers such registrations and will later revisit each of the SharedMemoryVariables when it has finally been given ownership of a memory segment. This removes the dependency upon any order-of-initialization. The SharedMemoryVariableManager can be passed an existing unprepared memory region (via initializeSegment()) or told to construct one on its own (using createAndAttachStandardSegment()). It is also possible to access a previously prepared region by using attachExistingSegment(); this routine is used by external processes, such as smv_query, to access the shared memory segment and manipulate the existing data. There is also a low-level static routine, createSegment(), that can be used to create a memory-mapped file with a name of one's choice. The routine normally used is createAndAttachSegment(), which in turn calls createSegment(). Below is an example of usage in a main()routine, where allocation of the segment was controlled by a Boolean variable makeSMVsegment.

  if (makeSMVsegment) {
      size_t segLen = 4096 * ((enableLog == true) ? 4096 : 16);
      DEFAULT_sharedMemoryVariableManager.createAndAttachStandardSegment(argv[0], &segLen);
  }

There are two forms of SharedMemoryVariables: the smaller-sized class assumes that the corresponding SharedMemoryVariableManager will have been initialized before the SharedMemoryVariable is ever accessed. Note that "accessed" is not the same as "registered". The SharedMemoryVariable may be created and registered before the SharedMemoryVariableManager has memory available, but it cannot be read from or written to until the memory segment has been configured. This type of variable is appropriate when the developer knows the application will have created a segment as part of the program's startup. In contrast, it is not a good choice for usage in libraries where the developer has no control over the choices made by the user of the library. In those scenarios, the standalone subclass variants of SharedMemoryVariable should be used. They do consume more storage in the regular memory address space (usually 8 bytes per variable or the maximum possible string length), but not the shared memory segment. The significant benefit is that they are unaffected by the absence of an actual shared memory segment or early usage before the segment has been allocated.

Note: if in doubt, use the standalone variants. The standalone variants also have one extra capability that their regular counterparts lack: they can be initialized with a value.

Numeric Shared Memory Variables

For numeric types, the subclasses of interest are either SMV_Numeric<> or the standalone variant SMV_StandaloneNumeric<>. The template argument is the type of the variable. So to declare a simple counter, one simply does::

SMV_StandaloneNumeric<uint32_t> a32BitCounter("nameInSegment");

It can sometimes be convenient to use a macro to declare variables if the text name in the shared memory segment is to be the same as the variable name itself, as illustrated by this example:

#define DECLARE_SMV_COUNTER(name) SMV_Numeric<uint32_t> name(#name)

A numeric SharedMemoryVariable can be used much as a normal variable:

#include <utils/shared_vars/shared_variable.hpp>
#include <iostream>

SMV_StandaloneNumeric<uint32_t> a32BitCounter("counter");
SMV_StandaloneString<128> aString("aString");

int main()
{
   a32BitCounter += 1;
   std::cout << "counter=" << a32BitCounter << std::endl;
   ++a32BitCounter;
   std::cout << "counter=" << a32BitCounter << std::endl;
   std::cout << "counter=" << a32BitCounter++ << std::endl;
   std::cout << "counter=" << ++a32BitCounter << std::endl;
   aString = "hi";
   std::cout << "string=\"" << aString << "\"" << std::endl;
   return (0);
}

yields output similar to:

counter=1
counter=2
counter=2
counter=4
string="hi"

Strings are a slightly more complicated situation because the maximum amount of space for the potential content needs to be allocated ahead of time. The classes SMV_StandaloneString and its peer SMV_String take a template argument that specifies the maximum possible length of the string. These standard variables all allocate a maximum of 128 bytes for their respective string content.

Recommended Standard Variables for Process Administration

   SMV_StandaloneString<128> stopReason("stopReason");
   SMV_StandaloneString<128> adminStopReason("adminStopReason");
   SMV_StandaloneString<128> adminProcessLabel("adminProcessLabel");
   SMV_StandaloneNumeric<uint32_t> OMEstopFlag("stopFlag");

Hierarchical Shared Memory Variable Names

Each SharedMemoryVariable has a simple name, which can lead to collisions if one wants to create a collection of objects and track the statistics of each individual object. The prototypical example is a set of counters on a price feed: one would like to be able to distinguish between packets being received on feed A from those being received on feed B. Rather than waste space with long variable names, a SharedMemoryVariable can be parented under a SharedMemoryVariableNode object. For the purposes of naming, the SharedMemoryVariableNode acts much like a directory.

The SharedMemoryVariableNode can itself be parented under another SharedMemoryVariableNode, which allows for arbitrary nesting. This tends to lead to class structures that appear similar to this fragment from IO_Processor_Statistics:

class IO_Processor_Statistics {
   public:
   SharedMemoryVariableNode        namingNode;
   // normally exposed as statistics variables
   SMV_StandaloneNumeric<uint64_t> bytesRead;          // updated by consume routine
   SMV_StandaloneNumeric<uint64_t> bytesProcessed;     // updated by processBlock routine
   // normally exposed as statistics variables
   SMV_StandaloneNumeric<uint32_t> packetsRead;        // updated by consume routine

   SMV_StandaloneNumeric<uint32_t> packetsProcessed;   // updated by process routine
   SMV_StandaloneNumeric<uint32_t> productiveReadSpins;// updated by recvRoutine
   SMV_StandaloneNumeric<uint32_t> maxReadTimeout;     // adjustable, used by recvRoutine
   //! max unproductive read attempts before thread blocks
   SMV_StandaloneNumeric<uint32_t> readAttemptsBeforeBlocking; // adjustable, used by recvRoutine

 IO_Processor_Statistics(SharedMemoryVariableNode *parentNode) :
   namingNode("io_stats", parentNode),
   bytesRead("bytesRead", &namingNode),
   bytesProcessed("bytesProcessed", &namingNode),
   packetsRead("packetsRead", &namingNode), 
   packetsProcessed("packetsProcessed", &namingNode),
   productiveReadSpins("productiveReadSpins", &namingNode),
   maxReadTimeout("maxReadTimeout", &namingNode),
   readAttemptsBeforeBlocking("readsBeforeBlock", &namingNode)
   {
   }

Caution: SharedMemoryVariables should not be used in scenarios where the variables are repeatedly allocated and then deleted as most of the BufferRegion managers do not handle deallocation.

Shared Memory Variable Utilities

smv_query

The smv_query utility is able to access a shared memory segment and display its content, or modify the content.

$ smv_query --help
usage: smv_query [-a | {-t -n -v} | -q} [-z] [--info] -f fileName [-f fileName] ... [varName[=val]] ...
  -t = output type label
  -n = output variable name
  -v = output value
  -a = output all (i.e., type, name, value)
  -q = quiet, no extraneous information displayed
  -z = do not output records with 0 or null string values
  --info = display segment statistics
  -f fileName = specifies file name of a segment
  If a filename ends in ".smv", the -f option flag can be omitted
  Variable names act as filters.
  Modification of an existing value is possible using the assignment syntax

smvq

There is a convenience cover script, smvq, that is easier to use for many cases because it will automatically find the memory-mapped file of interest (using recent_stats):

$ smvq --help
usage: smvq [smv_query arguments] [-l N] [-component componentName] [{-date YYYYMMDD} | -today] [-host hostName] [-p dir] [-user userName] ..
 or smvq [smv_query arguments] {[-f] fileName} ...

An example using the -z flag, which ignores any values which are zero or the null string:

$ smvq -z
smv_query -f ./vista-20200709_045042-raspberrypi-14991.smv -z
/stdout/io_stats/bytesRead = uint64 5953
/stdout/io_stats/bytesProcessed = uint64 5953
/stdout/io_stats/packetsRead = uint32 25
/stdout/io_stats/packetsProcessed = uint32 25
/stderr/io_stats/bytesRead = uint64 1513
/stderr/io_stats/bytesProcessed = uint64 1513
/stderr/io_stats/packetsRead = uint32 10
/stderr/io_stats/packetsProcessed = uint32 10
/vista/vista_logMask = uint64 960
nameSpaceTotal = uint32 3
methodTotal = uint32 955
totalObjectsCreated = uint32 18508
totalObjectsDeleted = uint32 18460
totalThreadsDelayed = uint32 89534
totalThreadsAllowed = uint32 32
totalThreadsCreated = uint32 250161
totalThreadsDeleted = uint32 250146
slicesOnKernelThread-0 = uint32 573506
cpusAvailable = uint32 4
minWorkForMultiprocessing = uint32 3
maximumCPUs = uint32 128
totalTimeEventCalls = uint32 211946
totalWaitForIOcalls = uint32 211946
/oil2/oil2_logMask = uint64 960
IOmaxReadBuffer = uint32 65536
IOtruncReadBuffer = uint32 79185
classTotal = uint32 135
IOmaxVectors = uint32 128
IOtotalDescriptorsCreated = uint32 61
IOtotalDescriptorsDeleted = uint32 52
IOtotalFilesCreated = uint32 34
IOtotalFilesDeleted = uint32 29
IOtotalFileReads = uint32 33716
IOtotalFileWriteVectors = uint32 17
IOtotalFileSelectRead = uint32 33687
IOtotalSocketsCreated = uint32 27
IOtotalSocketsDeleted = uint32 23
IOtotalSocketsAccepted = uint32 23
IOtotalSocketRecvs = uint32 33
IOtotalSocketWriteVectors = uint32 230645
IOtotalSocketSelectRead = uint32 59
IOtotalSocketSelectWrite = uint32 78576
totalArrayDeepCopies = uint32 33932
totalSetDeepCopies = uint32 44
totalSetFastCompares = uint32 31
totalAssocDeepCopies = uint32 194
totalOIDsCreated = uint32 71035
totalOIDsDeleted = uint32 70971
processID = uint32 14991
bootTime = uint64 851775245
millisecondsUp = uint64 69118983
elapsedCPUticks = uint64 1327095961768
hostName = text "raspberrypi.shado.fargos.net"
debugFlag = uint32 128
stopReason = text "unexpected termination"
adminProcessLabel = text "FARGOSwebServer"
vista_major_version = uint32 8
vista_minor_version = uint32 2
vista_release_version = uint32 2
vista_os = text "Linux"
vista_cpu = text "aarch64"
app_logMask = uint64 960
io_logMask = uint64 896
coalesceIOVEC = uint32 35
slicesOnKernelThread-1 = uint32 147
slicesOnKernelThread-2 = uint32 70
slicesOnKernelThread-3 = uint32 22

recent_stats

There is a low-level utility script, recent_stats, which can be used to locate the most recent SMV file associated with an application. It is used by smvq and invokeAndRestart, for example, but can also be used by monitoring/management scripts that do not want to use smvq directly.

$ recent_stats --help
usage:  [-c component] [{-d YYYYMMDD} | -today | -nodate] [{-t HHMMSS} | -notime] [{-host hostName} | -nohost] [-suffix fileSuffix=.smv] [{-p searchDir} ...] [-n maxWanted]

Recommended Administrative Variables

In contrast to the previously presented content that described available functionality, this section prescribes policy that is intended to provide structure that yields several benefits, but it can be ignored if those benefits are truly viewed as not worthwhile.

As displayed previously in the Recommended Standard Variables for Process Administration figure, there are a few strongly recommended variables that each process should maintain.

adminProcessLabel: This is a text string, usually set from a ‑‑label argument passed on the command line. This allows the administrator to inject text to uniquely identify the running process, both making it easier to identify it in process status listings as well as making it far easier to determine what instance of a process was associated with the content of a specific memory-mapped file.
stopReason: This variable holds text to describe why a process was terminated. It is often preloaded with the string "unexpected termination", which will leave that value in place if the process crashes unexpectedly.
adminStopReason: Similar to stopReason, the adminStopReason variable is intended to be set by an external entity to describe why a process was stopped. A typical scenario is when a monitored process needs to be restarted to update its configuration and is intentionally stopped by the system administrator. It is less than optimal for warnings to be triggered, with a resulting flood of messsages to support staff and other relevant folks in this situation. The invokeAndRestart script has support for examining the adminStopReason variable when it reports on a terminated process.
stopFlag: This integer variable is used to provoke a graceful termination of a process. Many programs are structured so that they are performing a work loop at a high level. Checking the value of stopFlag at regular intervals and terminating gracefully if its value is nonzero allows a program to be terminated using a utility like smvq or stopProcess and not being interrupted by a signal at an inopportune moment..

With these variables in place, typical usage will be similar to the fragments below. A signal handler is created to set the stopFlag and record that a signal was received in programStopReason:

#ifndef _WIN32
static void requestStop(int sig, siginfo_t *info, void *extra)
{
        switch (sig) {
        case SIGINT:
               programStopReason = "INT signal received";
               break;
        case SIGTERM:
                programStopReason = "TERM signal received";
                break;
        default:
                programStopReason = "signal received";
                break;
        }
        stopFlag = 1;
}
#endif

In main(), the argv list is examined for the ‑‑label argument and the adminProcessLabel is set, programStopReason is set with its initial value of "unexpected termination" and then a signal handler is set to alter stopFlag.:

    for (i=1; i < argc; i += 1) {
        if (commandOptionMatches("label", argv[i], true)) {
            adminProcessLabel = argv[i+1];
            i += 1;
            continue;
        }
        // ...
    }

    programStopReason = "unexpected termination";
#ifndef _WIN32
    struct sigaction stopSigData;
    stopSigData.sa_sigaction = requestStop;
    stopSigData.sa_flags = SA_SIGINFO;
    sigemptyset(&(stopSigData.sa_mask));
    sigaction(SIGTERM, &stopSigData, nullptr);
    sigaction(SIGINT, &stopSigData, nullptr);
#endif

The program might be doing some activity that causes it to intentionally terminate, such as reaching the end of the data stream it was processing. In that case, stopReason gets updated to indicate the termination is expected:

if (OME_EXPECT_FALSE(orderedProcessor->isQueueEmpty())) {
        stopFlag = 1;
        LOG_CERR(info) << "EOF reached on data files" << LOG_ENDLINE;
        programStopReason = "Normal Exit";
        orderedProcessor->processWorkInProgress(stopFlag.getAddress());
}

Near the end of main(), an announcement as to why the program is intentionally terminating is output:

    LOG_CERR(info) << "Program is terminating, reason=\"" << programStopReason << "\"";
    const char *admin = (const char *) adminStopReason;
    if (admin[0] != '\0') { // not null
        logObj << " adminReason=\"" << admin << "\"";
    }
    logObj << LOG_ENDLINE;

While the examples above are showing LOG_CERR() being used, normally the termination announcements are written to the log file using something like LOG_WHEN(). Production services usually log to both places.

Process Invocation and Termination

invokeAndRestart

The invokeAndRestart convenience script is able to launch programs under the control of a variety of monitoring and profiling utilities, watch for program termination and inform relevant parties. It comes close to being a do-all cover script for starting long-running service daemons. It makes use of the stopReason, adminStopReason and adminProcessLabel variables when generating termination reports.

$ invokeAndRestart --help
usage: invokeAndRestart [{--OPTIONS [--delay secs] [--attempts count] [--logdir dirName] [--name adminName] [--[no]syslog] [--symlink [prefix]] [--uniquedir [prefix]] [{--strace | --ltrace | --gdb | --gdbserver | --valgrind | --callgrind | --drd | -dhat | --helgrind | --massif}] --command] cmd [arg ...]
The following are recognized between --OPTIONS and --command:
  --logdir - specifies the root log directory.  Defaults to ./
  --delay - specifies seconds between restart attempts. Defaults to 13
  --attempts - specifies distinct number of invocation attempts.  Defaults to 1
  --name - specifies log file name prefix; normally inferred from executable name
  --merge - if present, merges standard out and error into single file.
  --symlink - enables automatically maintaining generic name for most recent log
  --syslog - enables sending unexpected termination notices via syslog
  --uniquedir - enables keeping log files in a unique directory per invocation
  This is useful when the program dumps unqualified file names into
  the current working directory
  The following are mutually exclusive. They are recognized and stripped
  without the need for the --OPTIONS prefix:
  --strace - start via strace to trace system calls
  --ltrace - start via ltrace to trace library calls
  --gdb - start in the debugger
  --undergdb - start under control of the debugger
  --gdbserver - start using gdbserver mode to enable remote debugger
  NOTE:  valgrind-related modes are selected with the mutually exclusive options below,
   but --gdbserver can be also used with the valgrind variants
   --valgrind - selected memcheck mode
   --callgrind  - selects callgrind mode for function profiling
   Best used with kcachegrind for visualization
  --massif - selects massif mode to identify storage allocation hotspots
  --cachegrind - selects cachgrind mode for cache profiling (not frequently used)
  --drd - selects data race detector mode
  --dhat - selects dynamic heap analysis mode
  --helgrind - selects thread error detection mode
  See: http://www.valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.gdbserver-gdb
Environment Variables:
  LIBC_FATAL_STDERR_ - default  is 1.  Set to 0 to inhibit the forcing
       of libc error to standard error rather than the TTY (not a great idea)
  CORE_LIMIT_LIMIT -  default is \"unlimited\".  Set to 0 to inhibit creation
     of core files.
  [NOTE: stopProcess automatically inhibits core file creation]
  INVOKE_AVOIDS_SYSLOG - default is 1.  Set to 0 to default to sending sylog
    messages after program termination. Overridden by --syslog or --nosyslog

stopProcess

Along with invokeAndRestart, there is a stopProcess command that makes use of the shared memory variable segments. It will set adminStopReason with the value provided by the ‑‑reason argument.

$ stopProcess --help
  usage: stopProcess [--display] [--schedstat] [{{--reason reasonMessage} | --eod | --halt | --test | --restart}] [--within microseconds] {{{-p pid} ...} | [{--name name}] [{--label adminLabel}]} ...
  --display just lists process ids
  --eod = End-Of-Day convenience reason
  --halt = halt of system convenience reason
  --test = testing system convenience reason
  --restart = restart convenience reason
  --schedstat = record per-task statistics before termination
  --name NAME = selects process by NAME
  --p PID = selects process by process id PID
  --label = selects process by administrative label

FARGOS Development Logging API Reference

Overview of the FARGOS Development Logging Facility

The FARGOS Development Logging API (henceforth refered to as "the Logging API") has been designed to be easy to use, but extremely efficient and able to handle a variety of use cases with several novel capabilities. It can be overwhelming at first glance due to the rich amount of function, so it is easier to start by examining the high-level functionality before investigating the lower-level internals. There is some influence from the trace facility in IBM's RSCT product; that particular feature was oriented towards deployments in supercomputer clusters where administrators wanted to be guaranteed that log files would never exceeed a certain size. Those trace files were also binary files, which used message catalogs to provide the templates needed to decode each recorded log line. While this does yield a signficant reduction in the size of the log files, it is also fair to remember the "I" in IBM stands for International and they were well-aware that not everyone used English as the universal language. The main downside of such an approach is that an external utility is needed to render the traces readable.

Not to belabor the point, but the Logging API can truly be used in several different ways, some of which are not obvious. Because an IO_Processor can be used to separate the preparation of log messages from their actual storage or transmission, an application thread can be prevented from blocking on I/O (this can be enabled using the +logthread command line option). The use of memory-mapped files to back auxillary data structures allows log messages that had not yet been written to be recovered (the extremely useful invokeAndRestart cover script will automatically use log_util to do such recovery). There are a plethora of convenience functions built upon very sophisticated, thread-aware facilities that do their best to minimize data movement and CPU cycles. It has been used to even to perform inter-application communication between processes to announce trade activity, for example, without requiring data to be moved between address spaces.

In the presentation below, the phrase "by default" will appear frequently. In general, this means that the default will be what most people want, but there exists a mechanism that can be overridden for an exotic special case. The Logging API provides functionality first-and-foremost, but there is a good deal of convenient policy layered on top of it. If required, one can usually bypass the conventions and access the very lowest layers and implement alternate behavior.

Components and Log Levels

The Logging API presented here uses the concept of components to implement one form of granularity of control over logging output. The other form of control is logging levels, discussed later. Two components are always defined: one is named "app" and the other is named "io". The "app" component name is intended to be used in files associated with the application and "io" is normally associated with I/O streams or used by lazy authors of library components.

Note: best practice is to always create an appropriate component name for a library element rather than fallback on the catch-all "io". One should normally never use "app" in a library component as it would be taking away the intended ability for the application writer to use it as they saw fit, but it can be done.

For convenience, interfaces are provided to many routines that assume a source file is associated with a given component, which eliminates the burden of having to specify the component of interest on every logging-related call. The THIS_COMPONENT macro is used to set the name of the desired default component for a given source file.

Note: "never" set the THIS_COMPONENT macro in a header file; unfortunately, if you think that was needed, you are probably using it wrong.

Declaring and Defining a Custom Component

Note: for the purposes of this discussion, the name customComponent will be used. Without loss of generality, it would be replaced with whatever name the developer desires. The name chosen must be suitable for use as a variable name—it is not a text string.

A new component must both be declared and defined—these are two distinct items. The declaration of a component is normally performed in a header file, which might be subsequently be included by many other source files. The definition of a component is only performed once and thus should appear within a single source file.

Declaring a Component

The simplest way to declare a component is to use the DECLARE_STANDARD_COMPONENT_VARS() macro in the relevant header file:

#include <utils/logging/logging_api.hpp>

DECLARE_STANDARD_COMPONENT_VARS(customComponent);

and use DEFINE_STANDARD_COMPONENT_VARS(customComponent) and AUTO_REGISTER_COMPONENT(customComponent) in the single source file chosen to implement it:

#include <utils/logging/logging_api.hpp>

DEFINE_STANDARD_COMPONENT_VARS(customComponent);
AUTO_REGISTER_COMPONENT(customComponent);

The AUTO_REGISTER_COMPONENT(customComponent) registers the name of the component and the corresponding log mask variable with the command line option parser, enabling the +logmask and ‑logmask options to be handled.

Rather than have customComponent duplicated in serveral places, it can be preferrable to define THIS_COMPONENT first and then use DEFINE_STANDARD_VARS_FOR_THIS_COMPONENT() and AUTO_REGISTER_THIS_COMPONENT(). This approach makes it easy to change the component name simply by altering the value of THIS_COMPONENT.

#define THIS_COMPONENT customComponent
DEFINE_STANDARD_VARS_FOR_THIS_COMPONENT();
AUTO_REGISTER_THIS_COMPONENT();

Remember that the component is only defined in one-and-only-one source file. The remaining source files in a typical library would include the header that declared the component, and simply define THIS_COMPONENT as customComponent. Thus a complete example using the LOG_CERR() macro to output on standard error would be:

#include <utils/logging/logging_api.hpp>

#define THIS_COMPONENT customComponent
DEFINE_STANDARD_VARS_FOR_THIS_COMPONENT;
AUTO_REGISTER_THIS_COMPONENT;

void aLibraryRoutine()
{
   LOG_CERR(error) << "Something went wrong!" << LOG_ENDLINE;
}

Logging Levels

The other unit of granularity provided by the Logging API is that of a logging level. By default, log levels are treated as being orthogonal rather than representing a strict hierarchy. Enabling or disabling a distinct logging level does not have any affect on the others. By default, the following level names are automatically declared:

trace: Intended for low-level tracing activity, such as function entry/exit. Normally inhibited.
debug: More detailed log information that would normally be inhibited
info: Regular information, normally always output. One very notable exception is that of the io component; it is normally configured to have information messages inhibited.
warn: Warning messages, unexpected states that appear to be tolerable. Normally always output.
error: Error messages, states that indicate an serious problem. Normally always output.
fatal: Unrecoverable error or condition. Normally always output.
user: User-defined messages. This can get very complicated with the ability to define new level names, introduce a strict hierarchy.of levels, etc.

Note: It is possible to make use of a conventional logging hierarchy and there are a range of userInfo1-userInfo6 and userDebug1-userDebug6 levels that are also defined for this purpose. Discussion on their usage is deferred until later.

Normally, output for the process is saved into a particular log file. Some sites like to make use of syslog to pick up and forward exceptional conditions, so there are a couple of logging levels that trigger special case treatment: the content can be simultaneously be logged to the log file, written to standard error or submitted to the syslog daemon.

warningAndSyslog, errorAndSyslog, fatalAndSyslog: Send the text to both the log file and the syslog daemon.
warningAndStderr, errorAndStderr, fatalAndStderr: Send the text to both the log file and the standard error stream.
warningAndBroadcast, errorAndBroadcast, fatalAndBroadcast: Send the text to both the log file, the syslog daemon and the standard error stream.

Initializing the Logging System and Creating a Log File

Command Line Arguments

Many of the logging facilties options can be manipulated via arguments passed in on the command line. If a program has options of its own to process, it can be convenient to first strip all of the logging-related options out argv and adjust the count in argc using a call to LogMessageRecord::process_commandline_log_flags().

int main(int argc, const char *argv[])
{
    // parse and strip log-related arguments
    LogMessageRecord::process_commandline_log_flags(&default_LogSubsystemInfo,
        argc, argv, &argc, argv);

If the call to LogMessageRecord::process_commandline_log_flags() is not made, the log-related arguments will still be processed by an eventual call to LogManager::initializeLogSubsystem(). The PROCESS_COMMAND_LINE_FLAGS() convenience macro performs the call to LogMessageRecord::process_command_line().

Initializing the Logging System

While not mandatory, the DEFAULT_SharedMemoryVariableManager is normally initialized so that external control of the logging facility can be performed using commands like setLogMask. The logging system is itself is initialized by a called to LogManager::initializeLogSubsystem(). This is done only once.

Logging Output Streams

A logging output stream is owned by a LogManager object. There are two LogManager objects created automatically: LogManager::STDOUT_LogManager is associated with standard output and LogManager::STDCERR_LogManager is associated with standard errror. The simplest way to access the two are via LOG_COUT(logLevel) and LOG_CERR(logLevel), respectively. For some applications, these may be all that are required, but many will want to have at least one log file of their own. There is provision for one log file to be considered the default log; this is maintained by the LogManager::DEFAULT_LogManager variable. It is possible to have many active log files, but that requires using APIs that do not assume the use of the default log file. Such usage is not particularly burdensome. For example, the implementation of LOG_CERR() is merely:

#define LOG_CERR(lvl) LOG_INTO(LogManager::STDERR_LogManager, lvl)

As had been previously noted, the THIS_COMPONENT macro is not intended to be defined within a header file, which poses an issue for inlined functions realized in header files. In those scenarios, the developer should fall back on specifying the appropriate component, often with one of the LOG_COMPONENT_ variants. This is illustrated in the fragment below that uses the app component:

  LOG_COMPONENT_CERR(app,error) << "Segment is only " << segmentLen <<
      " but at least " << minSize <<
      " bytes would be required for file to hold " << totalRecords <<
      " of length " << metaRecordLength << LOG_ENDLINE;

A new log file is normally opened using LogManager::newStandardLogFile(), but a LogManager can be associated with already an open file descriptor.

Putting it all together leads to code similar to the following:

  LOG_COUT(info) << argv[0] << " begins execution [stdout]" << LOG_ENDLINE;
  LOG_CERR(info) << argv[0] << " begins execution [stderr]" << LOG_ENDLINE;
  {
      size_t segLen = 4096 * 4096; // pages to allocate for all variables...
      // Initialize default manager
      DEFAULT_sharedMemoryVariableManager.createAndAttachStandardSegment("myAppName", &segLen);
      LogManager::initializeLogSubsystem(&argc, argv, argc, argv); // parse and strip log-related arguments
      LogManager::newStandardLogFile("myAppName", FILENAME_WITH_EVERYTHING);
      LOG_WHEN(info) << argv[0] << " begins execution" << LOG_ENDLINE;
  }

Note: the usage of LOG_COUT(info), LOG_CERR(info) and LOG_WHEN(info) to announce the startup of the process is recommended best practice for production services. These help enable the content of the log files to be programatically synchronized and merged if one needs to correlate activity that was unfortunately logged across different output streams; the mergeLogs utility is one example of an application that can be used for that purpose.

Using the Default Log Stream

As shown in the initialization example, LOG_WHEN() can be used to log output to the default LogManager; however, authors of library routines should not assume this will be initialized—the application author may have concluded they did not want a separate log file. Instead, IFLOG_WHEN() provides the additional check that the default log manager was created.

Often a particular log message will always be wanted, so having the ability to be able to disable it provides no benefit or may actually be detrimental. As an example of the latter, an application that is creating an audit trail would be well-advised to not permit the disabling of such log messages. For those scenarios, LOG_ALWAYS() and IFLOG_ALWAYS() will provide the desired functionality. In general LOG_WHEN()/IFLOG_WHEN()are preferred to LOG_ALWAYS()/IFLOG_ALWAYS()based on the principle that the LOG_ALWAYS() variants take away functionality (i.e., the ability to inhibit the output of a message); however, developers should always be guided by the needs of the application.

Log Masks

Usually log output can be inhibited, although there are intentional exceptions such as the use of LOG_ALWAYS() mentioned above. Using the previously presented examples, our program will have produced initial output similar to:

$ myApp
2020/07/15 00:32:08.942220 139950130459840 t.cpp:20 INFO myApp begins execution [stdout]
2020/07/15 00:32:08.942259 139950130459840 t.cpp:21 INFO myApp begins execution [stderr]

It can be seen from the output above that the default output format prefixes each line of output with the date, time, thread id, source file name and line number and message level. Each of these fields can be disabled by setting the logPrefixMask associated with the respective LogManager, such as via LogManager::setLogPrefixMask() or at initialization time via LogManager::initializeLogSubsystem(). Inhibiting the output of such information has its place, but best practice is to retain all of these fields, as the loss of information can never be recovered. The mergeLogs utility has the ability to strip out any of those fields, so it is usually a better idea to filter a log file through mergeLogs rather than never generate the information in the first place.

Note: the filename that will be visible is influenced by the build system in use. In most projects, it is far better to provide a complete path name to the source file so that a file can be conveniently located afterwards. As a concrete example, using vim's gF sequence while the cursor is positioned within the filename:lineNumber block can open up the relevant source file at the point where the log message was generated. If the build system in use only feeds the compiler the name of the source file berefit of any directories in the path name, the information will not be available for output. The Logging API is sandbox-aware and will automatically strip the prefixes to the src directories of the current sandbox and associated backing trees. The GCOV_PREFIX_STRIP environment variable is also honored.

The prototype application under discussion creates a log file and shared memory variable segment with names that indicate the application, time of invocation and host upon which it executed and process identifier. The default naming convention is designed to permit all log files to be collected in a single location without fear of name collisions. Examination of the shared memory variable segment shows:

$ smvq -z
smv_query -f ./myAppName-20200715_003327-development-67246.smv -z
/stdout/io_stats/bytesRead = uint64 159
/stdout/io_stats/bytesProcessed = uint64 159
/stdout/io_stats/packetsRead = uint32 2
/stdout/io_stats/packetsProcessed = uint32 2
/stderr/io_stats/bytesRead = uint64 89
/stderr/io_stats/bytesProcessed = uint64 89
/stderr/io_stats/packetsRead = uint32 1
/stderr/io_stats/packetsProcessed = uint32 1
/customComponent/customComponent_logMask = uint64 960
app_logMask = uint64 960
io_logMask = uint64 896
coalesceIOVEC = uint32 4
/LOG/myAppName/io_stats/bytesRead = uint64 80
/LOG/myAppName/io_stats/bytesProcessed = uint64 80
/LOG/myAppName/io_stats/packetsRead = uint32 1
/LOG/myAppName/io_stats/packetsProcessed = uint32 1

The customComponent_logMask, app_logMask and io_logMask variables expose the current mask for their respective components. These can be easily interpreted by the setLogMask command:

$ setLogMask
current customComponent logMask enables info warn error fatal
current app logMask enables info warn error fatal
current io logMask enables warn error fatal

One could disable informative message for the customComponent by doing:

$ setLogMask customComponent -info
now customComponent logMask enables warn error fatal

Log masks can be enabled or disabled on an application's command line using the +logmask or ‑logmask options, respectively. A log mask is named using the convention of [componentName.]levelName, where the initial componentName and period separator are optional if the intent is to not name a single component but instead affect all components. Given the previous example, this invocation would generate no output because the information messages were disabled:

$ myApp -logmask customComponent.info

Note: anything associated with a conditional message is not performed if the message is inhibited. These lines result in a different value for intVal depending on whether or not +logmask customComponent.debug was specified on the command line:

 int intVal = 0;
 LOG_COUT(debug) << "val debug=" << ++intVal << LOG_ENDLINE;
 LOG_COUT(info) << "val info=" << intVal << LOG_ENDLINE;

Without, one sees:

2020/07/15 14:47:01.608056 139623495235392 t.cpp:41 INFO val info=0

With +logmask customComponent.debug, one sees:

2020/07/15 14:47:32.520996 140024273564864 t.cpp:40 DEBUG val debug=1
2020/07/15 14:47:32.521005 140024273564864 t.cpp:41 INFO val info=1

In general, one should not take advantage of side-effects with condtional messages.

Using mergeLogs

The mergeLogs utility has the ability to follow files (somewhat similar to tail ‑f), watch directories for the appearance of new files that match patterns, and can efficiently merge the content of multiple files so that their lines are appropriately ordered by time. It can also skew the time in a file, which can be helpful if files from different hosts are to be merged and their clocks were not synchronized closely enough. As a trivial example, the three output streams generated by the prototype application will correctly be merged:

$ mergeLogs myAppName-20200715_003*67426.log stderr stdout
2020/07/15 00:33:27.772206 140374653130560 t.cpp:20 INFO a.out begins execution [stdout]
2020/07/15 00:33:27.772249 140374653130560 t.cpp:21 INFO a.out begins execution [stderr]
2020/07/15 00:33:27.788870 140374653130560 t.cpp:15 INFO Hello, world

Hint: skewing the clock can be used for more than just minor adjustments. For example, shifting the time of one file 24 hours can allow comparison of behavior on different days.

$ mergeLogs --help
usage: mergeLogs [-deftTSF] [--interval seconds] { [--date {YYYYMMDD|YYYY/MM/DD}] [--dir dirName] [--skew nanoseconds] [--reorder N] {{--watch pattern}| {[--seek offset] fileName}} } ...
  -d = strip date
  -e = strip error level
  -f = strip filename
  -t = strip thread Id
  -T = strip timestamp
  -S = generate symbolic thread Ids (table dumped at end)
  -F or --follow = follow files
 NOTE: -deft strips all but time
  --stdin recognized as filename
  --date specifies default date for subsequent files
  --dir specifies directory for subsequent relative file names
  --watch = specifies pattern for files that may appear (used with --follow)
  --skew shifts logical record times in next file
  --seek skips indicated number of bytes in next file
  -i or --interval = specifies interval used to check for new content when -F flag was specified
  --reorder specifies the maximum distance between out-of-order lines in a log file

Logging Threads

One of the problems with traditional log output is corruption in the face of multiple threads attempting to generate output simultaneously. The Logging API addresses this issue by protecting each line of output from interference from another thread; however, this is done in such as way as to not cause a thread to block waiting for another to finish generating its output line. The default output format also includes the thread identifier.

Under some circumstances, it may be viewed as important to not permit a thread to be delayed waiting for its output to be commited to stable storage. Because the Logging API uses the IO_Processor class, the actual processing of the log messages can be performed by a background thread. This feature can be enabled via the +logthread command line option. Having a background output thread increases the possibility for there to be uncommitted log messages if the application crashes unexpectedly. The +logmap command line option exposes the internal state via an auxillary memory-mapped file that will survive a crash. The unprocessed content, if any, can be retrieved using the log_util program and this is something done automatically by the super-useful invokeAndRestart cover script.

Logging Message Style

In small organizations, one often finds that no formal structure on messages has been imposed. That flexibility usually disappears in organizations that are producing products for an international customer base. For example, a minimal convention would be to enforce the output of a unique message identifier for every line of output. A classic convention was to use a 3-or-4-character prefix corresponding to a particular product or component and a 3-or-4-digit number allocated within that particular product offering (e.g., "XGM028"). One benefit is that the message would be readily identifiable regardless of the locale of the end user.

Small organizations doing bespoke work for themselves will typically forgo such formal structure. There is still a great benefit to be obtained from structuring the message content in a way so that it can be parsed programmatically. It is strongly suggested that one adopt a style of outputting attribute=value elements because this convention goes a long way towards making log output machine-readable. A program such as parseAttrVal can take such content and format and display it in a variety of ways or even convert it into a CSV file.

$ parseAttrVal --help
usage: parseAttrVal [{‑‑csv | ‑‑flat | ‑‑perrow | ‑‑tree}] [‑‑path [n]] [‑‑separator C] [‑‑equals C] [{{‑‑quote C} | ‑‑noquote}] [{‑‑all | ‑‑struct | ‑‑ignore C}] [{‑‑findfirstvar C | ‑‑nofindfirstvar}] [‑‑noheader] [‑‑date] [‑‑time] {[‑f] fileName} ...]
   If no file specified, reads from stdin
  --csv = output as CSV file
  --flat = output as attribute/value pairs
  --perrow = flat, field separator is newline
  --tree = output as tree
  --json = output as JSON
  --jsonin = input is JSON (see json2line filter)
  --path = output variable name as full absolute path
  --path N = output variable with at most N elements
  --separator C = use C as field separator
  --lastseparator C = use C as field separator after last element
  --groupseparator C = use C as field separator after end of group
  --equals C = use C as name / value separator
  --quote C = use C as quote character around text values
  --noquote = do not quote text fields
  --ignore C = skip leading characters until first 'C'
  --all = skip no leading characters
  --structs = skip leading characters until first '['
  --findfirstvar C = start with first apparent attribute name found using assignment character C
  --nofindfirstvar = disable search for first attribute name
  --skip N = skip N characters before beginning parse
  --noheader = do not include CSV header line or variable names
  --date = also output date from log file line
  --time = also output time from log file line
  Default is: --csv --structs --findfirstvar = --path

More Efficient Logging

The examples presented to this point have all made use of some form of operator<<() with LOG_COUT(), LOG_CERR(), and LOG_WHEN(). This usage gives the appearance that items are being formatted and appended to a buffer, which will subsequently be written to a file descriptor. That is not really what is occurring. One of the paramount requirements for performance is to minimize data movement. The default output mechanism (and, yes, this means it can be overridden when there is a special case need) uses writev() or the equivalent to copy relevant data directly from the source without the need for an intermediate copy. This is illustrated by the strace output below:

753546 writev(1, [{iov_base="2020/07/15 13:43:39.268352 14040"..., iov_len=43},
 {iov_base="t.cpp", iov_len=5},
 {iov_base=":", iov_len=1},
 {iov_base="17", iov_len=2},
 {iov_base=" ", iov_len=1},
 {iov_base="INFO", iov_len=4},
 {iov_base=" ", iov_len=1},
 {iov_base="Hello, world\n", iov_len=13}], 8) = 70

This also has a positive effect on the data caches: frequently consistent data, such as the filename, will be present and the avoidance of unneccesary writes helps inhibit the invalidation of cache lines and the corresponding interference it causes with other processors. There are a few convenience macros that allow one to pass blocks of memory that are to be given special treatment. These include: AS_TEXT_BUFFER(), AS_ESCAPED_TEXT_BUFFER(), AS_BINARY_BUFFER(), AS_HEXADECIMAL_BUFFFER(). Thus a line like:

LOG_COUT(info) << "hex=" << AS_HEXADECIMAL_BUFFER("0123", 4) << LOG_ENDLINE;

yields output like:

hex=30313233

Developers required to handle internationalization will be well-aware that the convenience mechanisms shown so far are not a great match: there is no concept of a message template nor positional arguments. One would have to render the message first, then output it.

The Logging API's lower-level interfaces allow one to specify a message template and provide the corresponding arguments. The most commonly used are EMIT_CONDITIONAL_MESSAGE() and its unconditional equivalent ALWAYS_EMIT_MESSAGE(). For example:

    EMIT_CONDITIONAL_MESSAGE(info);
        LOG_MESSAGE_FORMAT("arg1=%d arg2=%d hex=%s\n");
        LOG_ARG(11);
        LOG_ARG(22);
        LOG_ARG(AS_HEXADECIMAL_BUFFER("0123", 4));
    AS_TIMESTAMPED_TEXT;

yields output into the log file similar to the line below:

2020/07/15 14:33:57.347014 140050054018880 t.cpp:39 INFO arg1=11 arg2=22 hex=30113233

The AS_TIMESTAMPED_TEXT clause above triggers the actual rendering of the content, while minimizing any data movement. Best-of-breed routines, such as uint_to_ascii() and float_to_ascii() are used internally to convert native numeric types into ASCII characters. Time information is represented as a NanosecondTimePoint and converted with the aid of one or more TimePointCache objects, which, among other things, avoids repeated kernel calls to stat() /etc/localtime.

There is a AS_BINARY_RECORD clause that can be used in scenarios where the data should not be rendered into ASCII. There are two common reasons for choosing to do this:

Maximizing performance by eliminating computation and data movement: the minimal raw data is simply saved into the log file and no computation to render it into ASCII is performed.
As a machine-readable interchange format.

Note that space savings is not one of the listed items. While there are scenarios where a binary log record is smaller than the corresponding textual output, this is definitely not something to be guaranteed. There are some opportunities for savings, such as the binary timestamp only requiring 8 bytes while the formatted date and time typically use up 27 characters and thread identifiers tend to be fairly long as well. But when a log line is finally rendered into ASCII, a lot of additional semantic information can be discarded, such as whether a data element was to be viewed as an integer, floating-point number, string, converted to hexadecimal, etc. Those attributes are retained in the binary log so that the data can be properly rendered later and thus do consume space.

The previous example could be reworked to become:

    EMIT_CONDITIONAL_MESSAGE(info);
        LOG_MESSAGE_FORMAT("arg1=%d arg2=%d hex=%s\n");
        LOG_ARG(11);
        LOG_ARG(22);
        LOG_ARG(AS_HEXADECIMAL_BUFFER("0123", 4));
    AS_BINARY_RECORD;

and the resulting log file would look similar to this hexadecimal output:

00000000: 314c ff00 9800 0000 7dbb 4006 8c0b 2216  1L......}.@...".
00000010: 4057 fa05 527f 0000 2a00 0000 4000 0000  @W..R...*...@...
00000020: 0300 0000 3000 0000 3600 0000 5000 0000  ....0...6...P...
00000030: 742e 6370 7000 6172 6731 3d25 6420 6172  t.cpp.arg1=%d ar
00000040: 6732 3d25 6420 6865 783d 2573 0a00 0000  g2=%d hex=%s....
00000050: 0400 0000 0400 0000 0800 0000 1000 0000  ................
00000060: 0b00 0000 0000 0000 0400 0000 0400 0000  ................
00000070: 0800 0000 1000 0000 1600 0000 0000 0000  ................
00000080: 9800 0000 0400 0000 0800 0000 1000 0000  ................
00000090: 3031 3233 0000 0000                      0123....

Feeding the file to mergeLogs will render it into ASCII:

$ mergeLogs *.log
2020/07/15 17:51:54.070809 139990264338240 t.cpp:42 INFO arg1=11 arg2=22 hex=30313233

Specialized Output Routines

The SPLIT_ACROSS_LINES() macro is a convenience cover to the LogMessageRecord::splitLongLogLine() routine. The AS_TEXT_BUFFER() macro can be used to denote a block of text for output that is not null-terminated. The near-identical AS_BINARY_BUFFER() macro can be used to denote a block of binary data. For situations in which the binary data should be displayed as hexadecimal characters, the AS_HEXADECIMAL_BUFFER() macro has the ability to generate hexadecimal output in either upper or lowercase characters. For binary records, only the original data is recorded and the expansion to hexadecimal occurs at the time of the final rendering. The related convenience macro AS_ESCAPED_TEXT_BUFFER works much like AS_TEXT_BUFFER() but adds the capability to replace characters on output with a sequence of 0, 1 or more alternative characters. These usually represent an escape sequence. The corresponding operator<<() for LogMessageRecord is clever enough to maximize the number of contiguous bytes that are added to the LogMessageRecord.

The LOG_STREAM_OBJECT() macro can be used to create a temporary LogManager object and associated LogMessageRecord that act as a std::ostringstream. The built-up contents of the buffer can be transferred using LOG_APPEND_BUFFER().

The FORMAT_LOG_BUFFER() macro can be used in a fashion somewhat similar to sprintf(); it formats the contents of a LogMessageRecord as text and returns a char pointer to the resulting string. The hidden scope that is introduced needs to be closed by the RELEASE_LOG_BUFFER() macro.

Custom Log Levels

A great deal of utility can be realized by the standard orthogonal log levels such as trace, debug, warn, error, etc. There are, however, scenarios in which a developer desires to create their own log levels. One reason is to declare an alternate log level name for a component, for example, assigning new component-specific labels for trace and debug log levels. A more complicated scenario involves developing a hierarchy of user-defined log levels, where enabling a high-level log level implies all of the log levels beneath it are also enabled; this in contrast to the normal orthogonal treatment of the standard log levels.

The setLogMask command recognizes user0 through user6 as denoting 7 orthogonal user-defined logging levels. It also recognizes level1 through level6 as denoting 6 hierachical logging levels.

$ setLogMask ‑‑help
usage: setLogMask [‑‑value] [‑q] [‑c logComponentName] [logMaskName[,logMask2[,..]]] {+|‑}{trace|debug|info|warn|error|fatal|user0...user6|level1...level6} ... [smvq args]
  userN identifies user-specific equivalence class
  levelN identifies a user-specific level that includes all prior levels
  --value ignores current value and sets specified value
  -q inhibits displaying final value as expanded attributes
  -c NAME allows selecting most recent instance of a particular application
    For example:  -c app-instance1
 A comma-separated list of logMask names can be specified.
 If none are specified, all found in the segment are addressed

The class AutoRegisterLogLevelName provides a mechanism to automatically call LogMessageRecord::define_commandline_flag() to register a new logging level name. The DEFINE_COMPONENT_USER_LOG_MASK() macro is the high-level interface for defining custom log levels; it trivially combines the DECLARE_COMPONENT_USER_LOG_MASK() and AUTO_REGISTER_USER_LOG_MASK() macros and would be used in a single source file. The DECLARE_COMPONENT_USER_LOG_MASK() macro can be used in header files to to permit the use of the log mask across multiple source files.

Self-Describing Meta Data

Many applications have need of a form of meta data to describe items of interest. From the finance world, a plethora of examples arise from all of the product listings provided by the exchanges. These can be very simple, amounting to not much more than a table of fixed-sized records with all fields filled in, to more complex scenarios that include optional fields, such as would be the case with multi-legged options.

An expensive way to approach these situations is to write custom programs that convert the data provided by a third-party into some internal format to be used by a program of interest. In general, one expends the effort on deciphering the external data format and coercing it into a format more suitable for use by the application. A typical approach is to write a custom parser that consumes format, decodes it and populates internal records. A common deployment scenario arises when the external data file holds every possible record of interest, and multiple processes have been deployed on a server, each with an interest in a distinct subset, leading to a lot of redundant read activity. This can exacerbate the thundering herd effect as well as increase the amount of time required for recovery from an unexpected crash.

An alternative is to coerce the data into a form that can be directly mapped read-only into the address space of an application. The read-only aspect allows the same data to be shared among multiple processes. If the data is appropriately constructed, records can be found using binary search rather than a linear stream from start-to-end-of-file. Numeric fields that may have been presented as ASCII text or some ad-hoc representation (CBOE fixed point numbers come to mind) can be converted into the expected native representation.

The UniversalMetaData facilities described here support both the immediate loading and direct use of of the data, as well as making the format self-describing, which permits it to be parsed and interpreted by programs that were written before the new data format was known.

The convertCSVtoMetaData Program

Without loss of generality, the usage of these facilities will be motivated using a Comma-Separated-Value (CSV) file as the prototypical external data format. These are easy for a human to construct via a spreadsheet program like Excel or can be done programmatically.

The convertCSVtoMetaData program can be used to consume a CSV-style file and turn into a self-describing meta data file.

$ convertCSVtoMetaData --help
usage: convertCSVtoMetaData [‑‑cpp] [‑‑dense] [‑‑reorder] [‑o outputFile] [‑‑name structName] [‑‑magic magicNumber] [{‑‑delete srcColumn]} ...] [{‑‑rename srcColumn destColumnName} ...] {[‑i] fileName} [...]
    --stdout recognized as output file name
    --cpp emits C++ header describing format
    --dense emits potentially unaligned data
    --reorder reorganizes record layout for best alignment
    --name allows structure name to be assigned, defaults to CSVmeta
    --magic allows assignment of magic number, defaults to METADATA
    --rename allows columns to be renamed
    --delete allows columns to be removed

The convertCSVtoMetaData program requires that the first line of the incoming CSV file provides the name of the column headings. By default, these will be used as the field names, but they can be renamed using the ‑‑rename option or deleted altogether using the ‑‑delete option.

The convertCSVtoMetaData utility tries to be clever and infer the appropriate type for each output field. Field types that it detects currently include:

strings: Anything that does not appear to be a form of number
uint32: Unsigned integer that does not require a 64-bit representation
uint64: Unsigned integer that is too large to be represented with 32-bits
int32: Signed integer that does not require a 32-bit representation
int64: Signed integer that is too large to be represented with 64-bits
fixed-point: A decimal number that can be represented as 64-bit + precision value
double: A real number too large to be represented as a fixed-point value

With simple content, the desired result should be obtained with no additional work required. The next set of illustrations use the following CSV file, named without loss of generality as "example.csv":

Name,anInt,aSignedInt,aBigInt,aBigSignedInt,aFixedPoint
Row1,1,-1,5000000001,-5000000001,1.23
Row2,2,-2,5000000002,-5000000002,2.34

Invoking convertCSVtoMetaData example.csv yields a corresponding example.metadata file with content similar to those shown in the hexdump below:

00000000: 4d45 5441 4441 5441 1818 0600 3000 0000  METADATA....0...
00000010: 0200 0000 0000 0000 0112 0000 0800 0000  ................
00000020: 4e61 6d65 0000 0000 0000 0000 0000 0000  Name............
00000030: 0105 0800 0400 0000 616e 496e 7400 0000  ........anInt...
00000040: 0000 0000 0000 0000 0104 0c00 0400 0000  ................
00000050: 6153 6967 6e65 6449 6e74 0000 0000 0000  aSignedInt......
00000060: 0107 1000 0800 0000 6142 6967 496e 7400  ........aBigInt.
00000070: 0000 0000 0000 0000 0106 1800 0800 0000  ................
00000080: 6142 6967 5369 676e 6564 496e 7400 0000  aBigSignedInt...
00000090: 0109 2000 1000 0000 6146 6978 6564 506f  .. .....aFixedPo
000000a0: 696e 7400 0000 0000 526f 7731 0000 0000  int.....Row1....
000000b0: 0100 0000 ffff ffff 01f2 052a 0100 0000  ...........*....
000000c0: ff0d fad5 feff ffff 7b00 0000 0000 0000  ........{.......
000000d0: 0200 0000 0000 0000 526f 7732 0000 0000  ........Row2....
000000e0: 0200 0000 feff ffff 02f2 052a 0100 0000  ...........*....
000000f0: fe0d fad5 feff ffff ea00 0000 0000 0000  ................
00000100: 0200 0000 0000 0000                      ........

While this content is intentionally designed to be used directly by applications, it is not particularly easy for humans to read nor use with conventional shell scripts. The dump_metadata utility can be used to render the format into a variety of textual forms. Note that the dump_metadata utility had no knowledge of the structure and content of the original CSV file; one of the features of the self-describing meta data format is that it provides the necessary information.

$ dump_metadata --help
usage: dump_metadata [{--struct|--csv|--info|--cpp}] [{[-f] fileName} ...]
  --struct outputs in format suitable for processing by parseAttrVal
  --csv outputs as a comma-separated value file
  --info outputs a text summary of the field names and types
  --cpp outputs a prototype structure describing the field layout
  Default is to dump type, name, value on individual lines
    with blank lines between records

The output format will display the records with each field on a distinct line:

$ dump_metadata example.metadata
string Name = "Row1"
uint32_t anInt = 1
int32_t aSignedInt = -1
uint64_t aBigInt = 5000000001
int64_t aBigSignedInt = -5000000001
fixed aFixedPoint = 1.23

string Name = "Row2"
uint32_t anInt = 2
int32_t aSignedInt = -2
uint64_t aBigInt = 5000000002
int64_t aBigSignedInt = -5000000002
fixed aFixedPoint = 2.34

The original CSV file can be reconstituted using the ‑‑csv option:

$ dump_metadata --csv example.metadata
Name,anInt,aSignedInt,aBigInt,aBigSignedInt,aFixedPoint
"Row1",1,-1,5000000001,-5000000001,1.23
"Row2",2,-2,5000000002,-5000000002,2.34

A dense attribute/value structure format, capable of being parsed by parseAttrVal, will be output with the ‑‑struct option:

$ dump_metadata --struct example.metadata
[METADATA={ Name="Row1" anInt=1 aSignedInt=-1 aBigInt=5000000001 aBigSignedInt=-5000000001 aFixedPoint=1.23 }]
[METADATA={ Name="Row2" anInt=2 aSignedInt=-2 aBigInt=5000000002 aBigSignedInt=-5000000002 aFixedPoint=2.34 }]

An example of feeding the output into parseAttrVal:

$ dump_metadata --struct example.metadata | parseAttrVal --tree
METADATA = {
  Name = "Row1"
  anInt = 1
  aSignedInt = -1
  aBigInt = 5000000001
  aBigSignedInt = -5000000001
  aFixedPoint = 1.23
}
METADATA = {
  Name = "Row2"
  anInt = 2
  aSignedInt = -2
  aBigInt = 5000000002
  aBigSignedInt = -5000000002
  aFixedPoint = 2.34
}

The ‑‑info option provides a summary of the record format:

$ dump_metadata --info example.metadata
   string Name
   uint32_t anInt
   int32_t aSignedInt
   uint64_t aBigInt
   int64_t aBigSignedInt
   fixed aFixedPoint

Record Length: 103
   Total fields: 6

Finally, the ‑‑cpp option outputs a struct declaration that corresponds to the field layout.

struct METADATA {
    char Name[8];
    uint32_t anInt;
    int32_t aSignedInt;
    uint64_t aBigInt;
    int64_t aBigSignedInt;
    fixed aFixedPoint;
};

All of the above were obtained without the need to write any custom parsers. With a little reflection, developers should realize that this provides a foundation upon which file formats can be extended without breaking already deployed applications. This can be critical in production environments where not every user of the data can be upgraded at the same time.

Optional Fields

The previous example had content that was derived from a CSV file for which each record was expected to have the same set of fields. Although a field might be zero-valued or a null string, it was still expected to be present. Not at records have the luxury of such a fixed collection of fields and the self-describing meta data format supports the concept of optional field groups. By default, fields that are always intended to be present are treated as members of what will be called group 1. Subsequent groups (2, 3, etc.) identify optional collections of fields. Members of a particular field group are either all present or none are. Members of a field group, however, are not required to be contiguous.

The convertCSVtoMetaData utility supports conditional field groups, but the header of the CVS file needs to be appropriately marked and each row of the CSV file will be burdened with putting content in the relevant columns. An optional field is identified in the CSV file by prefixing the name with "?" followed by the group Id. As an illustration, the following CSV file defines two optional field groups:

"Name",anInt,"aSignedInt",aBigInt,aBigSignedInt,aFixedPoint,?2opField2a,?2optField2b,?3optField3
Row1,1,-1,5000000001,-5000000001,1.23
Row2,2,-2,5000000002,-5000000002,2.34,"2a","2b"
Row1,3,-3,5000000003,-5000000003,3.45,"3a","3b","3c"

This yields meta data records similar to:

$ dump_metadata --struct example2.metadata | parseAttrVal --tree
METADATA = {
  Name = "Row1"
  anInt = 1
  aSignedInt = -1
  aBigInt = 5000000001
  aBigSignedInt = -5000000001
  aFixedPoint = 1.23
}
METADATA = {
  Name = "Row2"
  anInt = 2
  aSignedInt = -2
  aBigInt = 5000000002
  aBigSignedInt = -5000000002
  aFixedPoint = 2.34
  opField2a = "2a"
  optField2b = "2b"
}
METADATA = {
  Name = "Row1"
  anInt = 3
  aSignedInt = -3
  aBigInt = 5000000003
  aBigSignedInt = -5000000003
  aFixedPoint = 3.45
  opField2a = "3a"
  optField2b = "3b"
  optField3 = "3c"
}

Loading a Meta Data File

As the previously presented examples have shown, the self-describing meta data format enables quite a bit of functionality without the need for a new application to be written that will be aware of the actual file format; however, more functionality is obtained by use of the meta data APIs within an application. The interface for a particular meta data format is described by an instance of the templated MetaDataLoaderForFormat<RECORD_FORMAT> class. The RECORD_FORMAT is used to identify the implementation of 3 required data structures:

fieldDescriptionTable: Describes the field names and respective format for the record; this will be described in detail later.
totalFields: The total number of fields defined by the fieldDescriptionTable.
magicNumber: Defines the magic number associated with the file format.

The MetaDataLoaderForFormat<RECORD_FORMAT>::loadMetaDataHeader() routine is the lowest-level interface and has the following signature:

        static const UniversalMetaData_ReferenceFileHeader *loadMetaDataHeader(
            const char *fileName,
            const RECORD_CLASS **resultTable,
            unsigned char **retSegmentBase,
            size_t *retSegmentLen, size_t *retDataOffset)

For simple fixed-length records, the contents of the memory-mapped file can be treated as a read-only array of RECORD_FORMAT records with a base address of resultTable. When feasible, it can be optimal to just access the desired data directly from the array; however, this does impose a significant constraint: the RECORD_FORMAT must correspond byte-for-byte with the actual format of the meta data file. This precludes changing the file format without adjusting the definition of RECORD_FORMAT and recompiling the application. There are definitely deployment scenarios that will find such one-to-one alignment between application and file format acceptable, but it does impose a point of fragility if the file format needs to evolve.

In contrast, the templated loadAndConvertMetaData<>() routine provides a powerful facility to process and transform meta data files into new formats. The two template arguments are used to specify the output format and the source format. Because the meta data format is self-describing, the source format defaults to GENERIC_META_RECORD. One downside of using the default GENERIC_META_RECORD is that it does not specify a magic number, which intentionally inhibits the safety check for matching magic numbers.

template <typename TO_FORMAT, typename FROM_FORMAT=GENERIC_META_RECORD>
    TO_FORMAT *loadAndConvertMetaData(const char *fileName, uint32_t *recTotal,
        ConvertAndTransferFieldFP transferFunction=defaultConvertAndTransferField,
        void *userData=nullptr)

The loadAndConvertMetaData<>() function traverses the content of the meta data file and passses it to a function of type ConvertAndTransferFieldFP. There is a defaultConvertAndTransferField() function that can be used in many scenarios: given the presence of a field with a given name F, the defaultConvertAndTransferField() function will copy the content into a corresponding field of the same name F in the destination record, performing any required type conversion. For example, this permits the incoming data to be specified as a text string and converted as needed into numeric (integer or real) data.

There are two distinct scenarios to be addressed. One is when the meta data file has more fields than the TO_FORMAT supports; in this case, the content of the unrecognized fields are ignored. The other scenario is when the TO_FORMAT has more fields than the source data in the meta data file; in this case, the fields are normally left untouched, but they can be marked so they are zero-filled as needed. If this feature is not exploited, the application should make provision to zero-fill them or assign useful default values when it creates each instance of a TO_FORMAT record.

Describing the Layout of a Meta Data File

The MetaDataLoaderForFormat<>::fieldDescriptionTable is an array of UniversalMetaData_FieldDescription records. These describe the field group identifier, type, offset, length and name of each field in a meta data file's record. The convenience macro DESCRIBE_NAMED_FIELD_OF_CLASS() can be used to populate UniversalMetaData_FieldDescription records for regular fields and DESCRIBE_OPTIONAL_NAMED_FIELD_OF_CLASS() can be used to do the same for optional fields. The variant DESCRIBE_OPTIONAL_OR_ZERO_FILLED_NAMED_FIELD_OF_CLASS() adds the capability to zero-fill a destination field if no data was present in the source record.

There are simpler variants DESCRIBE_FIELD_OF_CLASS(), DESCRIBE_OPTIONAL_FIELD_OF_CLASS(), and DESCRIBE_OPTIONAL_OR_ZERO_FILLED_FIELD_OF_CLASS() that name the field using the name of the member variable.

Normally, a field description table is hand-crafted from careful reading of the file format's specification, but the convertCSVtoMetaData utility supports a ‑‑cpp option that will generate a C++ header and UniversalMetaData_FieldDescriptionTable:

/* WARNING: machine generated source created by convertCSVtoMetaData */
#ifndef _META_CSVmeta_HPP_
#define _META_CSVmeta_HPP_ "$Id: FARGOSutilsLibrary.html 506 2023-02-15 05:13:28Z geoff $"

#include <utils/metadata/UniversalMetaData.hpp>

struct CSVmeta_MetaDataRecord {
    char                             Name[8];
    uint32_t                         anInt;
    int32_t                          aSignedInt;
    uint64_t                         aBigInt;
    uint64_t                         aBigSignedInt;
    FixedPointValue                  aFixedPoint;
}; /* end CSVmeta_MetaDataRecord */

template <typename STREAMTYPE> STREAMTYPE & operator<<(STREAMTYPE &os, const CSVmeta_MetaDataRecord &arg)
{
    os << "[CSVmeta_MetaDataRecord={";
    os << " Name=\"" << AS_TEXT_BUFFER(arg.Name, sizeof(arg.Name)) << "\"";
    os << " anInt=" << arg.anInt;
    os << " aSignedInt=" << arg.aSignedInt;
    os << " aBigInt=" << arg.aBigInt;
    os << " aBigSignedInt=" << arg.aBigSignedInt;
    os << " aFixedPoint=" << arg.aFixedPoint;
    os << " }]";
    return (os);
}

const struct UniversalMetaData_FieldDescription CSVmeta_fields[] = {
    DESCRIBE_FIELD_OF_CLASS(Name, SMV_TYPE_STRING, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(anInt, SMV_TYPE_UINT32, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aSignedInt, SMV_TYPE_INT32, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aBigInt, SMV_TYPE_UINT64, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aBigSignedInt, SMV_TYPE_INT64, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aFixedPoint, SMV_TYPE_FIXED, CSVmeta_MetaDataRecord),
    {0, 0, 0, 0, {}, {}} // end of table marker
};

// link field descriptions to transformation filters
template<> const UniversalMetaData_FieldDescription *MetaDataForFeed<CSVmeta_MetaDataRecord>::fieldDescriptionTable = CSVmeta_fields;
template<> const uint32_t MetaDataForFeed<CSVmeta_MetaDataRecord>::totalFields = 6;
template<> const char MetaDataForFeed<CSVmeta_MetaDataRecord>::magicNumber[sizeof(((UniversalMetaData_ReferenceFileHeader *)(nullptr))‑>magicNumber) + 1] = {"METADATA"};

#endif