Running QA Routines¶
Important
Make sure r2r-ctd is talking to docker, see Docker in the Installing instructions.
Basic Usage¶
Given an R2R CTD breakout, run the QA routines by executing:
uvx r2r-ctd qa <path_to_breakout>
Multiple breakouts can be specified and they will be proceed in sequence:
uvx r2r-ctd qa <path_to_breakout1> <path_to_breakout2>
Important
Almost all crashes are considered bugs and should be reported/fixed.
With the exception of an invalid breakout structure where the xmlt and manifest files are missing or malformed, the QA processing should not throw or crash in the case of invalid input files, the invalidness should be reported in the QA xml report itself.
If the xmlt and manifrst file are malformed or missing, something has gone wrong on the r2r side that needs to be investigated.
Tip
It is always safe to interrupt/kill the python process with a control + c and restart the QA process. There is significant caching of intermediate results and the QA process should quickly catch up to where it left off.
Switches¶
Quiet -q¶
The verbosity of the logging can be controlled by adding one or more -q flags after the r2r-ctd but before the qa subcommand.
uvx r2r-ctd -q qa <path_to_breakout>
Only prints log message of level INFO or greater.
Each q reduces the log verbosity by one level from the default which is DEBUG.
uvx r2r-ctd -qq qa <path_to_breakout>
Only prints logs messages of level WARNING or greater.
Skip CNV generation --no-gen-cnvs¶
Generating the cnv products is not necessary for the QA routines, it is also computationally expensive.
Adding a --no-gen-cnvs will skip generating these files:
uvx r2r-ctd qa --no-gen-cnvs <path_to_breakout>
Warning
In testing and development, occasionally in the production of the cnv products the underlying seabird software programs would not exit. There would be no open GUI windows and I have been unable find logs or debug information about what might be causing this.
It is safe to kill (control + c) and restart the QA process when this occurs. The python program, not the docker container, the container should clean itself up when python exits.
Control how closely to follow BagIt manifest validation spec --bag¶
The first release version of this software would only check what is in the manifest-md5.txt file, that was found to not be as robust as we wanted. Some breakouts were found to have files, but empty manifests, this software would treat this as an empty breakout and… crash. A stricter mode was implemented that can be controlled by the –bag switch value:
strict, any files in the/datadirectory and not in the manifest-md5.txt cause the manifest OK test to report failure.flex, a reasonable set of file names are allowed to exist in/dataand not in the manifest-md5.txt, seer2r_ctd.breakout.FLEX_FILES_OKfor the list of filenames allowed.manifestreverts to the original behavior where only paths in the manifest-md5.txt are checked and any extra files in/dataare ignored.
The flex mode is the default.
Example:
Use strict bag mode:
uvx r2r-ctd qa --bag strict <path_to_breakout>
Use strict bag mode and skip generating CNV files:
uvx r2r-ctd qa --bag strict --no-gen-cnvs <path_to_breakout>
Breakout Structure¶
When R2R receives data from a cruise it will be split up into separate collections called “breakouts”.
To be processed, the breakout is expected to be a directory with contents, not an archive such as a zip file.
r2r-ctd does no interaction with remote systems and has no assumptions about how to obtain the breakouts or put ths qa results back into.
The R2R CTD Breakout must have the following structure and almost follows the BagIt standard[1].
This section will follow the nomenclature in the BagIt terminology section.
The starting / will refer here to the root of the breakout
A
/manifest-md5.txtpayload manifest, containing a list of md5 file hashes and relative paths to the files corresponding to those hashes. Only md5 is supported byr2r-ctdat this time.A
/datapayload directory containing the datafiles that will be checked.A
/qatag directory containing at a minimum a*_qa.2.0.xmlttag file that conforms to the R2R QA 2.0 Schema schema. The prefix of this xml file is probably some combination of cruise name and breakout id, however this is not too important, only that exactly one file matches this pattern.
While the BagIt spec requires all the actual content to be in the /data directory, r2r-ctd just uses the paths inside the manifest-md5.txt file and does not do any validation that this breakout conforms to the BagIt specification.
The details of what cruise specific files are being looked for within the /data directory are in the API documentation.
Specifically r2r_ctd.breakout.Breakout.stations_hex_paths for what is considered as a station[2], and r2r_ctd.checks.check_three_files for what each station is expected to have.
QA Template File: *_qa.2.0.xmlt¶
This xml file is the “template” that will both be updated with the results of the QA routines, but also contains some of the metadata that the breakout files are tested against. Specifically, the cruise start/end dates and the bounding box.
QA Results¶
Several result files are produced along with some processing state files.
Everything r2r-ctd generates will be placed into a /proc directory[3].
Inside this /proc directory are several other directories:
/proc/nchas netCDF files containing all the “state” of the QA routines, this includes test results and derived files. These netCDF files are an implementation detail and the contents can be ignored unless things are going really wrong. These files can be safely deleted, but it removes the “cache” of the QA results for each cast. Do not modify these files./proc/qawill have the qa results:If the QA routines finished a
*_qa.2.0.xmlwill be present (note the lack oftin the file extension), updated with resultsA
*_ctd_metdata.geoCSVfile should be present.A
/proc/qa/configdirectory containing the instrument configuration report text files.
/proc/products/r2rctdwill have all the generated cnv files (2 per cast) if the--no-gen-cnvsswitch was not provided.A
*_qa_map.htmlfile will be generated, open this in a browser to see the stations plotted, the bounding box and overall score color. This map is created using folium and needs an internet connection to view (but not create).
Presumably, the contents of /proc excluding the nc sub-directory and map html can be rsync-ed back to the r2r server (without the --delete switch)
Parallel Processing¶
Since docker provides reasonable process isolation for the Windows based conversion tools, it is possible to have multiple container instances running the Seabird software in parallel.
This is most simply done by having multiple terminal sessions open and running the basic usage commands above on a single breakout in each session.
In the same session you could also use something like xargs to parallelize, but the emitted log message will be muxed making it difficult to follow what is going on.
In general, you’ll want to limit the number of parallel processors going to the number of physical cores in your CPU, in the case of Apple arm hardware, this is further the number of performance cores your machine has.
To see how many performance cores are present on an M-family mac, you can use the system_profiler command:
system_profiler SPHardwareDataType
Look for the line that says: Total Number of Cores:
In parenthesis it should have the breakdown between performance and efficiency cores.
For example, the baseline M4 MacBook Air has 10 cores but only 4 are performance, so the number of parallel processes should reasonably kept to 4:
Total Number of Cores: 10 (4 performance and 6 efficiency)