Self-describing Meta Data

Part of the FARGOS Development, LLC Utility Library

Self-Describing Meta Data

Many applications have need of a form of meta data to describe items of interest. From the finance world, a plethora of examples arise from all of the product listings provided by the exchanges. These can be very simple, amounting to not much more than a table of fixed-sized records with all fields filled in, to more complex scenarios that include optional fields, such as would be the case with multi-legged options.

An expensive way to approach these situations is to write custom programs that convert the data provided by a third-party into some internal format to be used by a program of interest. In general, one expends the effort on deciphering the external data format and coercing it into a format more suitable for use by the application. A typical approach is to write a custom parser that consumes format, decodes it and populates internal records. A common deployment scenario arises when the external data file holds every possible record of interest, and multiple processes have been deployed on a server, each with an interest in a distinct subset, leading to a lot of redundant read activity. This can exacerbate the thundering herd effect as well as increase the amount of time required for recovery from an unexpected crash.

An alternative is to coerce the data into a form that can be directly mapped read-only into the address space of an application. The read-only aspect allows the same data to be shared among multiple processes. If the data is appropriately constructed, records can be found using binary search rather than a linear stream from start-to-end-of-file. Numeric fields that may have been presented as ASCII text or some ad-hoc representation (CBOE fixed point numbers come to mind) can be converted into the expected native representation.

The UniversalMetaData facilities described here support both the immediate loading and direct use of of the data, as well as making the format self-describing, which permits it to be parsed and interpreted by programs that were written before the new data format was known.

The convertCSVtoMetaData Program

Without loss of generality, the usage of these facilities will be motivated using a Comma-Separated-Value (CSV) file as the prototypical external data format. These are easy for a human to construct via a spreadsheet program like Excel or can be done programmatically.

The convertCSVtoMetaData program can be used to consume a CSV-style file and turn into a self-describing meta data file.

$ convertCSVtoMetaData --help
usage: convertCSVtoMetaData [‑‑cpp] [‑‑dense] [‑‑reorder] [‑o outputFile] [‑‑name structName] [‑‑magic magicNumber] [{‑‑delete srcColumn]} ...] [{‑‑rename srcColumn destColumnName} ...] {[‑i] fileName} [...]
    --stdout recognized as output file name
    --cpp emits C++ header describing format
    --dense emits potentially unaligned data
    --reorder reorganizes record layout for best alignment
    --name allows structure name to be assigned, defaults to CSVmeta
    --magic allows assignment of magic number, defaults to METADATA
    --rename allows columns to be renamed
    --delete allows columns to be removed

The convertCSVtoMetaData program requires that the first line of the incoming CSV file provides the name of the column headings. By default, these will be used as the field names, but they can be renamed using the ‑‑rename option or deleted altogether using the ‑‑delete option.

The convertCSVtoMetaData utility tries to be clever and infer the appropriate type for each output field. Field types that it detects currently include:

strings: Anything that does not appear to be a form of number
uint32: Unsigned integer that does not require a 64-bit representation
uint64: Unsigned integer that is too large to be represented with 32-bits
int32: Signed integer that does not require a 32-bit representation
int64: Signed integer that is too large to be represented with 64-bits
fixed-point: A decimal number that can be represented as 64-bit + precision value
double: A real number too large to be represented as a fixed-point value

With simple content, the desired result should be obtained with no additional work required. The next set of illustrations use the following CSV file, named without loss of generality as "example.csv":

Name,anInt,aSignedInt,aBigInt,aBigSignedInt,aFixedPoint
Row1,1,-1,5000000001,-5000000001,1.23
Row2,2,-2,5000000002,-5000000002,2.34

Invoking convertCSVtoMetaData example.csv yields a corresponding example.metadata file with content similar to those shown in the hexdump below:

00000000: 4d45 5441 4441 5441 1818 0600 3000 0000  METADATA....0...
00000010: 0200 0000 0000 0000 0112 0000 0800 0000  ................
00000020: 4e61 6d65 0000 0000 0000 0000 0000 0000  Name............
00000030: 0105 0800 0400 0000 616e 496e 7400 0000  ........anInt...
00000040: 0000 0000 0000 0000 0104 0c00 0400 0000  ................
00000050: 6153 6967 6e65 6449 6e74 0000 0000 0000  aSignedInt......
00000060: 0107 1000 0800 0000 6142 6967 496e 7400  ........aBigInt.
00000070: 0000 0000 0000 0000 0106 1800 0800 0000  ................
00000080: 6142 6967 5369 676e 6564 496e 7400 0000  aBigSignedInt...
00000090: 0109 2000 1000 0000 6146 6978 6564 506f  .. .....aFixedPo
000000a0: 696e 7400 0000 0000 526f 7731 0000 0000  int.....Row1....
000000b0: 0100 0000 ffff ffff 01f2 052a 0100 0000  ...........*....
000000c0: ff0d fad5 feff ffff 7b00 0000 0000 0000  ........{.......
000000d0: 0200 0000 0000 0000 526f 7732 0000 0000  ........Row2....
000000e0: 0200 0000 feff ffff 02f2 052a 0100 0000  ...........*....
000000f0: fe0d fad5 feff ffff ea00 0000 0000 0000  ................
00000100: 0200 0000 0000 0000                      ........

While this content is intentionally designed to be used directly by applications, it is not particularly easy for humans to read nor use with conventional shell scripts. The dump_metadata utility can be used to render the format into a variety of textual forms. Note that the dump_metadata utility had no knowledge of the structure and content of the original CSV file; one of the features of the self-describing meta data format is that it provides the necessary information.

$ dump_metadata --help
usage: dump_metadata [{--struct|--csv|--info|--cpp}] [{[-f] fileName} ...]
  --struct outputs in format suitable for processing by parseAttrVal
  --csv outputs as a comma-separated value file
  --info outputs a text summary of the field names and types
  --cpp outputs a prototype structure describing the field layout
  Default is to dump type, name, value on individual lines
    with blank lines between records

The output format will display the records with each field on a distinct line:

$ dump_metadata example.metadata
string Name = "Row1"
uint32_t anInt = 1
int32_t aSignedInt = -1
uint64_t aBigInt = 5000000001
int64_t aBigSignedInt = -5000000001
fixed aFixedPoint = 1.23

string Name = "Row2"
uint32_t anInt = 2
int32_t aSignedInt = -2
uint64_t aBigInt = 5000000002
int64_t aBigSignedInt = -5000000002
fixed aFixedPoint = 2.34

The original CSV file can be reconstituted using the ‑‑csv option:

$ dump_metadata --csv example.metadata
Name,anInt,aSignedInt,aBigInt,aBigSignedInt,aFixedPoint
"Row1",1,-1,5000000001,-5000000001,1.23
"Row2",2,-2,5000000002,-5000000002,2.34

A dense attribute/value structure format, capable of being parsed by parseAttrVal, will be output with the ‑‑struct option:

$ dump_metadata --struct example.metadata
[METADATA={ Name="Row1" anInt=1 aSignedInt=-1 aBigInt=5000000001 aBigSignedInt=-5000000001 aFixedPoint=1.23 }]
[METADATA={ Name="Row2" anInt=2 aSignedInt=-2 aBigInt=5000000002 aBigSignedInt=-5000000002 aFixedPoint=2.34 }]

An example of feeding the output into parseAttrVal:

$ dump_metadata --struct example.metadata | parseAttrVal --tree
METADATA = {
  Name = "Row1"
  anInt = 1
  aSignedInt = -1
  aBigInt = 5000000001
  aBigSignedInt = -5000000001
  aFixedPoint = 1.23
}
METADATA = {
  Name = "Row2"
  anInt = 2
  aSignedInt = -2
  aBigInt = 5000000002
  aBigSignedInt = -5000000002
  aFixedPoint = 2.34
}

The ‑‑info option provides a summary of the record format:

$ dump_metadata --info example.metadata
   string Name
   uint32_t anInt
   int32_t aSignedInt
   uint64_t aBigInt
   int64_t aBigSignedInt
   fixed aFixedPoint

Record Length: 103
   Total fields: 6

Finally, the ‑‑cpp option outputs a struct declaration that corresponds to the field layout.

struct METADATA {
    char Name[8];
    uint32_t anInt;
    int32_t aSignedInt;
    uint64_t aBigInt;
    int64_t aBigSignedInt;
    fixed aFixedPoint;
};

All of the above were obtained without the need to write any custom parsers. With a little reflection, developers should realize that this provides a foundation upon which file formats can be extended without breaking already deployed applications. This can be critical in production environments where not every user of the data can be upgraded at the same time.

Optional Fields

The previous example had content that was derived from a CSV file for which each record was expected to have the same set of fields. Although a field might be zero-valued or a null string, it was still expected to be present. Not at records have the luxury of such a fixed collection of fields and the self-describing meta data format supports the concept of optional field groups. By default, fields that are always intended to be present are treated as members of what will be called group 1. Subsequent groups (2, 3, etc.) identify optional collections of fields. Members of a particular field group are either all present or none are. Members of a field group, however, are not required to be contiguous.

The convertCSVtoMetaData utility supports conditional field groups, but the header of the CVS file needs to be appropriately marked and each row of the CSV file will be burdened with putting content in the relevant columns. An optional field is identified in the CSV file by prefixing the name with "?" followed by the group Id. As an illustration, the following CSV file defines two optional field groups:

"Name",anInt,"aSignedInt",aBigInt,aBigSignedInt,aFixedPoint,?2opField2a,?2optField2b,?3optField3
Row1,1,-1,5000000001,-5000000001,1.23
Row2,2,-2,5000000002,-5000000002,2.34,"2a","2b"
Row1,3,-3,5000000003,-5000000003,3.45,"3a","3b","3c"

This yields meta data records similar to:

$ dump_metadata --struct example2.metadata | parseAttrVal --tree
METADATA = {
  Name = "Row1"
  anInt = 1
  aSignedInt = -1
  aBigInt = 5000000001
  aBigSignedInt = -5000000001
  aFixedPoint = 1.23
}
METADATA = {
  Name = "Row2"
  anInt = 2
  aSignedInt = -2
  aBigInt = 5000000002
  aBigSignedInt = -5000000002
  aFixedPoint = 2.34
  opField2a = "2a"
  optField2b = "2b"
}
METADATA = {
  Name = "Row1"
  anInt = 3
  aSignedInt = -3
  aBigInt = 5000000003
  aBigSignedInt = -5000000003
  aFixedPoint = 3.45
  opField2a = "3a"
  optField2b = "3b"
  optField3 = "3c"
}

Loading a Meta Data File

As the previously presented examples have shown, the self-describing meta data format enables quite a bit of functionality without the need for a new application to be written that will be aware of the actual file format; however, more functionality is obtained by use of the meta data APIs within an application. The interface for a particular meta data format is described by an instance of the templated MetaDataLoaderForFormat<RECORD_FORMAT> class. The RECORD_FORMAT is used to identify the implementation of 3 required data structures:

fieldDescriptionTable: Describes the field names and respective format for the record; this will be described in detail later.
totalFields: The total number of fields defined by the fieldDescriptionTable.
magicNumber: Defines the magic number associated with the file format.

The MetaDataLoaderForFormat<RECORD_FORMAT>::loadMetaDataHeader() routine is the lowest-level interface and has the following signature:

        static const UniversalMetaData_ReferenceFileHeader *loadMetaDataHeader(
            const char *fileName,
            const RECORD_CLASS **resultTable,
            unsigned char **retSegmentBase,
            size_t *retSegmentLen, size_t *retDataOffset)

For simple fixed-length records, the contents of the memory-mapped file can be treated as a read-only array of RECORD_FORMAT records with a base address of resultTable. When feasible, it can be optimal to just access the desired data directly from the array; however, this does impose a significant constraint: the RECORD_FORMAT must correspond byte-for-byte with the actual format of the meta data file. This precludes changing the file format without adjusting the definition of RECORD_FORMAT and recompiling the application. There are definitely deployment scenarios that will find such one-to-one alignment between application and file format acceptable, but it does impose a point of fragility if the file format needs to evolve.

In contrast, the templated loadAndConvertMetaData<>() routine provides a powerful facility to process and transform meta data files into new formats. The two template arguments are used to specify the output format and the source format. Because the meta data format is self-describing, the source format defaults to GENERIC_META_RECORD. One downside of using the default GENERIC_META_RECORD is that it does not specify a magic number, which intentionally inhibits the safety check for matching magic numbers.

template <typename TO_FORMAT, typename FROM_FORMAT=GENERIC_META_RECORD>
    TO_FORMAT *loadAndConvertMetaData(const char *fileName, uint32_t *recTotal,
        ConvertAndTransferFieldFP transferFunction=defaultConvertAndTransferField,
        void *userData=nullptr)

The loadAndConvertMetaData<>() function traverses the content of the meta data file and passses it to a function of type ConvertAndTransferFieldFP. There is a defaultConvertAndTransferField() function that can be used in many scenarios: given the presence of a field with a given name F, the defaultConvertAndTransferField() function will copy the content into a corresponding field of the same name F in the destination record, performing any required type conversion. For example, this permits the incoming data to be specified as a text string and converted as needed into numeric (integer or real) data.

There are two distinct scenarios to be addressed. One is when the meta data file has more fields than the TO_FORMAT supports; in this case, the content of the unrecognized fields are ignored. The other scenario is when the TO_FORMAT has more fields than the source data in the meta data file; in this case, the fields are normally left untouched, but they can be marked so they are zero-filled as needed. If this feature is not exploited, the application should make provision to zero-fill them or assign useful default values when it creates each instance of a TO_FORMAT record.

Describing the Layout of a Meta Data File

The MetaDataLoaderForFormat<>::fieldDescriptionTable is an array of UniversalMetaData_FieldDescription records. These describe the field group identifier, type, offset, length and name of each field in a meta data file's record. The convenience macro DESCRIBE_NAMED_FIELD_OF_CLASS() can be used to populate UniversalMetaData_FieldDescription records for regular fields and DESCRIBE_OPTIONAL_NAMED_FIELD_OF_CLASS() can be used to do the same for optional fields. The variant DESCRIBE_OPTIONAL_OR_ZERO_FILLED_NAMED_FIELD_OF_CLASS() adds the capability to zero-fill a destination field if no data was present in the source record.

There are simpler variants DESCRIBE_FIELD_OF_CLASS(), DESCRIBE_OPTIONAL_FIELD_OF_CLASS(), and DESCRIBE_OPTIONAL_OR_ZERO_FILLED_FIELD_OF_CLASS() that name the field using the name of the member variable.

Normally, a field description table is hand-crafted from careful reading of the file format's specification, but the convertCSVtoMetaData utility supports a ‑‑cpp option that will generate a C++ header and UniversalMetaData_FieldDescriptionTable:

/* WARNING: machine generated source created by convertCSVtoMetaData */
#ifndef _META_CSVmeta_HPP_
#define _META_CSVmeta_HPP_ "$Id: FARGOSutilsLibrary.html 505 2023-02-15 02:11:32Z geoff $"

#include <utils/metadata/UniversalMetaData.hpp>

struct CSVmeta_MetaDataRecord {
    char                             Name[8];
    uint32_t                         anInt;
    int32_t                          aSignedInt;
    uint64_t                         aBigInt;
    uint64_t                         aBigSignedInt;
    FixedPointValue                  aFixedPoint;
}; /* end CSVmeta_MetaDataRecord */

template <typename STREAMTYPE> STREAMTYPE & operator<<(STREAMTYPE &os, const CSVmeta_MetaDataRecord &arg)
{
    os << "[CSVmeta_MetaDataRecord={";
    os << " Name=\"" << AS_TEXT_BUFFER(arg.Name, sizeof(arg.Name)) << "\"";
    os << " anInt=" << arg.anInt;
    os << " aSignedInt=" << arg.aSignedInt;
    os << " aBigInt=" << arg.aBigInt;
    os << " aBigSignedInt=" << arg.aBigSignedInt;
    os << " aFixedPoint=" << arg.aFixedPoint;
    os << " }]";
    return (os);
}

const struct UniversalMetaData_FieldDescription CSVmeta_fields[] = {
    DESCRIBE_FIELD_OF_CLASS(Name, SMV_TYPE_STRING, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(anInt, SMV_TYPE_UINT32, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aSignedInt, SMV_TYPE_INT32, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aBigInt, SMV_TYPE_UINT64, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aBigSignedInt, SMV_TYPE_INT64, CSVmeta_MetaDataRecord),
    DESCRIBE_FIELD_OF_CLASS(aFixedPoint, SMV_TYPE_FIXED, CSVmeta_MetaDataRecord),
    {0, 0, 0, 0, {}, {}} // end of table marker
};

// link field descriptions to transformation filters
template<> const UniversalMetaData_FieldDescription *MetaDataForFeed<CSVmeta_MetaDataRecord>::fieldDescriptionTable = CSVmeta_fields;
template<> const uint32_t MetaDataForFeed<CSVmeta_MetaDataRecord>::totalFields = 6;
template<> const char MetaDataForFeed<CSVmeta_MetaDataRecord>::magicNumber[sizeof(((UniversalMetaData_ReferenceFileHeader *)(nullptr))‑>magicNumber) + 1] = {"METADATA"};

#endif

Last updated: 2024-04-12