Running QA Routines

Important

Make sure r2r-ctd is talking to docker, see Docker in the Installing instructions.

Basic Usage

Given an R2R CTD breakout, run the QA routines by executing:

uvx r2r-ctd qa <path_to_breakout>

Multiple breakouts can be specified and they will be proceed in sequence:

uvx r2r-ctd qa <path_to_breakout1> <path_to_breakout2>

Important

Almost all crashes are considered bugs and should be reported/fixed.

With the exception of an invalid breakout structure where the xmlt and manifest files are missing or malformed, the QA processing should not throw or crash in the case of invalid input files, the invalidness should be reported in the QA xml report itself.

If the xmlt and manifrst file are malformed or missing, something has gone wrong on the r2r side that needs to be investigated.

Tip

It is always safe to interrupt/kill the python process with a control + c and restart the QA process. There is significant caching of intermediate results and the QA process should quickly catch up to where it left off.

Switches

Quiet -q

The verbosity of the logging can be controlled by adding one or more -q flags after the r2r-ctd but before the qa subcommand.

uvx r2r-ctd -q qa <path_to_breakout>

Only prints log message of level INFO or greater. Each q reduces the log verbosity by one level from the default which is DEBUG.

uvx r2r-ctd -qq qa <path_to_breakout>

Only prints logs messages of level WARNING or greater.

Skip CNV generation --no-gen-cnvs

Generating the cnv products is not necessary for the QA routines, it is also computationally expensive. Adding a --no-gen-cnvs will skip generating these files:

uvx r2r-ctd qa --no-gen-cnvs <path_to_breakout>

Warning

In testing and development, occasionally in the production of the cnv products the underlying seabird software programs would not exit. There would be no open GUI windows and I have been unable find logs or debug information about what might be causing this.

It is safe to kill (control + c) and restart the QA process when this occurs. The python program, not the docker container, the container should clean itself up when python exits.

Breakout Structure

When R2R receives data from a cruise it will be split up into separate collections called “breakouts”. To be processed, the breakout is expected to be a directory with contents, not an archive such as a zip file. r2r-ctd does no interaction with remote systems and has no assumptions about how to obtain the breakouts or put ths qa results back into.

The R2R CTD Breakout must have the following structure and almost follows the BagIt standard[1]. This section will follow the nomenclature in the BagIt terminology section. The starting / will refer here to the root of the breakout

  • A /manifest-md5.txt payload manifest, containing a list of md5 file hashes and relative paths to the files corresponding to those hashes. Only md5 is supported by r2r-ctd at this time.

  • A /data payload directory containing the datafiles that will be checked.

  • A /qa tag directory containing at a minimum a *_qa.2.0.xmlt tag file that conforms to the R2R QA 2.0 Schema schema. The prefix of this xml file is probably some combination of cruise name and breakout id, however this is not too important, only that exactly one file matches this pattern.

While the BagIt spec requires all the actual content to be in the /data directory, r2r-ctd just uses the paths inside the manifest-md5.txt file and does not do any validation that this breakout conforms to the BagIt specification. The details of what cruise specific files are being looked for within the /data directory are in the API documentation. Specifically r2r_ctd.breakout.Breakout.stations_hex_paths for what is considered as a station[2], and r2r_ctd.checks.check_three_files for what each station is expected to have.

QA Template File: *_qa.2.0.xmlt

This xml file is the “template” that will both be updated with the results of the QA routines, but also contains some of the metadata that the breakout files are tested against. Specifically, the cruise start/end dates and the bounding box.

QA Results

Several result files are produced along with some processing state files. Everything r2r-ctd generates will be placed into a /proc directory[3]. Inside this /proc directory are several other directories:

  • /proc/nc has netCDF files containing all the “state” of the QA routines, this includes test results and derived files. These netCDF files are an implementation detail and the contents can be ignored unless things are going really wrong. These files can be safely deleted, but it removes the “cache” of the QA results for each cast. Do not modify these files.

  • /proc/qa will have the qa results:

    • If the QA routines finished a *_qa.2.0.xml will be present (note the lack of t in the file extension), updated with results

    • A *_ctd_metdata.geoCSV file should be present.

    • A /proc/qa/config directory containing the instrument configuration report text files.

  • /proc/products/r2rctd will have all the generated cnv files (2 per cast) if the --no-gen-cnvs switch was not provided.

Presumably, the contents of /proc excluding the nc sub-directory can be rsync-ed back to the r2r server (without the --delete switch)

Parallel Processing

Since docker provides reasonable process isolation for the Windows based conversion tools, it is possible to have multiple container instances running the Seabird software in parallel. This is most simply done by having multiple terminal sessions open and running the basic usage commands above on a single breakout in each session. In the same session you could also use something like xargs to parallelize, but the emitted log message will be muxed making it difficult to follow what is going on.

In general, you’ll want to limit the number of parallel processors going to the number of physical cores in your CPU, in the case of Apple arm hardware, this is further the number of performance cores your machine has. To see how many performance cores are present on an M-family mac, you can use the system_profiler command:

system_profiler SPHardwareDataType

Look for the line that says: Total Number of Cores: In parenthesis it should have the breakdown between performance and efficiency cores. For example, the baseline M4 MacBook Air has 10 cores but only 4 are performance, so the number of parallel processes should reasonably kept to 4:

Total Number of Cores: 10 (4 performance and 6 efficiency)