PrePARE

PrePARE#

The PrePARE software tool is provided by PCMDI (Program for Climate Model Diagnosis and Intercomparison) to verify that CMIP6 files conform to the CMIP6 data protocol. The CMIP6 data protocol comprises requirements set out in different documents published by the CMIP6 WIP (Working Group on Climate Models Infrastructure Panel).

The Data request contains variable specifications (frequency, cell methods,..)
The Model output requirements specify the data format, structure and content.
The CMIP6 meta data standard is defined in Attributes, DRS, File names, directory structure, CV.
All participants have to be regsitered in this Registry for allowed models.

These documents are translated into .json formatted Controlled Vocabularies and tables readable by PrePARE and named cmip6-cmor-tables.

PrePARE performs 10 different tests which can be summarized by the following points:

Check for invariable and conditional required global attributes and valid values of those.
Are file names and paths conform to the project’s data reference syntax (DRS)?
Check for required variable attributes.
Coordinates: Some variables are requested on specific coordinates that need to be provided in the files in a compliant format.

In the following, we run PrePARE for a subset of CMIP6 pool data.

Preparation#

We will use the PrePARE binary in a shell but wrapped by this python notebook. We provide a conda environment which all levante users can use.

In order to use this environment as a kernel for jupyter notebooks, you can use ipykernel as shown in the next cell. Afterwards, reload your browser and select the new kernel for the quality assurance notebook.

%%bash
#The following line activates the source for working in a shell.
#source activate /work/bm0021/conda-envs/quality-assurance
#
#The following line installs a jupyter kernel for the conda environment
#python -m ipykernel install --user --name $kernelname --display-name="$kernelname"

Per default, shells inside a kernel are not started from the environment of the kernel. That means, the PrePARE executable is not found:

This can be changed either by using a helper script for the kernel as follows. You can also do that at the top of notebooks.

import sys
import os
newpath=f"{os.sep.join(sys.executable.split(os.sep)[:-1])}:{os.environ['PATH']}"
os.environ['PATH']=newpath
pp=!which PrePARE
pp=pp[0]

We also import some useful packages

# copy2 copies without errors
from shutil import copy2
# tqdm gives a progressbar for for loops
from tqdm import tqdm
import subprocess

Since the data standard evolves over time, we need to find the matching version for the datasests which should be tested. For that, we need git to checkout the corresponding version of the data standard tables, named cmip6-cmor-tables. You can clone the tables repository via:

import git 
import re
# The following clones the cmip6 cmor tables if not available:
working_path="./"
cmip6_cmor_tables_url="https://github.com/PCMDI/cmip6-cmor-tables.git"
if "cmip6-cmor-tables" not in os.listdir(working_path):
    git.Git(working_path).clone(cmip6_cmor_tables_url)

One table in the tables repository only contains the global attributes and no information about the variables: CMIP6_CV.json where ‘CV’ is for Controlled Vocabulary. In contrast to those tables which contain variables, only the recent version of the global attributes table is valid. This is because this file is mostly never changed but rather extended. Whenever we checkout a different version of the tables repository, we need to copy the recent global attributes CV into that version. Therefore, we copy this CV to a save place named recentCV.

recentCV = working_path+"CMIP6-CV-20210419.json"
copy2(working_path+"cmip6-cmor-tables/Tables/CMIP6_CV.json", recentCV)

'./CMIP6-CV-20210419.json'

Settings#

The following variables are important for PrePARE and will be defined:

logChunk will hold the results of PrePARE
cmip6-cmor-table-path is the directory for the input tables
exec is the executable which we will run in bash

prepareSetting = {
    "exec" : pp,
    #"logChunk":"/mnt/lustre01/work/bm0021/prepare-test/",
    "logChunk":"prepare-test",
    "cmip6-cmor-table-path" : working_path+"cmip6-cmor-tables/Tables"
}
!mkdir -p {prepareSetting["logChunk"]}

prepareSetting["exec"]

'/opt/conda/envs/datapoolservices/bin/PrePARE'

Initialization#

We read in the dataset list, load the git repository and copy the recent Controlled Vocabulary for required attributes.

g = git.Git(prepareSetting["cmip6-cmor-table-path"]) 
g.reset("--hard")
g.checkout("master")
copy2(prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json", recentCV)

---------------------------------------------------------------------------
GitCommandError                           Traceback (most recent call last)
Cell In[8], line 3
      1 g = git.Git(prepareSetting["cmip6-cmor-table-path"]) 
      2 g.reset("--hard")
----> 3 g.checkout("master")
      4 copy2(prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json", recentCV)

File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/git/cmd.py:986, in Git.__getattr__.<locals>.<lambda>(*args, **kwargs)
    984 if name.startswith("_"):
    985     return super().__getattribute__(name)
--> 986 return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)

File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/git/cmd.py:1599, in Git._call_process(self, method, *args, **kwargs)
   1596 call.append(dashify(method))
   1597 call.extend(args_list)
-> 1599 return self.execute(call, **exec_kwargs)

File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/git/cmd.py:1389, in Git.execute(self, command, istream, with_extended_output, with_exceptions, as_process, output_stream, stdout_as_string, kill_after_timeout, with_stdout, universal_newlines, shell, env, max_chunk_size, strip_newline_in_stdout, **subprocess_kwargs)
   1386 # END handle debug printing
   1388 if with_exceptions and status != 0:
-> 1389     raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
   1391 if isinstance(stdout_value, bytes) and stdout_as_string:  # Could also be output_stream.
   1392     stdout_value = safe_decode(stdout_value)

GitCommandError: Cmd('git') failed due to: exit code(1)
  cmdline: git checkout master
  stderr: 'error: pathspec 'master' did not match any file(s) known to git'

Assume, we want to test the dataset dset_id in directory trunk:

trunk="/work/ik1017/CMIP6/data/"
dset_id="CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710"

In order to find out the data standard version used for the creation of the files which should be tested, we need to retrieve the value from the global attribute data_specs_version from one file of the dataset. We assign a corresponding attribute data_specs_version to the dset_id and combine it in a dictionary.

dsets_to_test={dset_id :
               { "dset_path":trunk+'/'.join(dset_id.split('.')),
                 "data_specs_version":""
               }
             }

The function addSpecs will retrieve the specs attribute by using the bash tool ncdump -h showing the header of a file including all attributes.

def addSpecs(entry):
    print([os.path.join(entry["dset_path"],f) 
              for f in os.listdir(entry["dset_path"]) 
      ])
    try:
        fileinpath = [os.path.join(entry["dset_path"],f) 
                      for f in os.listdir(entry["dset_path"]) 
                      if os.path.isfile(
                          os.path.join(entry["dset_path"],f)
                      )]
    except:
        return ""
#    ncdump_exec="/sw/rhel6-x64/netcdf/netcdf_c-4.4.1.1-gcc48/bin/ncdump"
    dsv = !ncdump -h {fileinpath[0]} | grep data_specs_version | cut -d '"' -f 2
    dsv = ''.join(dsv)
    return dsv

And now, we apply it for all dsets in the dsets_to_test dictionary:

for dset, entry in dsets_to_test.items():
    print(dset)
    entry["data_specs_version"] = addSpecs(entry)

CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[12], line 3
      1 for dset, entry in dsets_to_test.items():
      2     print(dset)
----> 3     entry["data_specs_version"] = addSpecs(entry)

Cell In[11], line 3, in addSpecs(entry)
      1 def addSpecs(entry):
      2     print([os.path.join(entry["dset_path"],f) 
----> 3               for f in os.listdir(entry["dset_path"]) 
      4       ])
      5     try:
      6         fileinpath = [os.path.join(entry["dset_path"],f) 
      7                       for f in os.listdir(entry["dset_path"]) 
      8                       if os.path.isfile(
      9                           os.path.join(entry["dset_path"],f)
     10                       )]

FileNotFoundError: [Errno 2] No such file or directory: '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710'

Retrieving all versions of the cmip6-cmor-table repository#

We are using the tags of the version releases and reformat their values to be conform to the data_specs_version.

tags = reversed(g.tag("-n").split("\n"))
tagdict = {"data_specs_versions":[]}           
for tag in tags :
    tl = tag.split(" ", 1)[0]
    tllen = len(tl.split("."))
    if tllen > 3 :
        continue
    dsvnumber = tl.split(".")[tllen-1]
    dsvnumber = "".join(filter(str.isdigit, dsvnumber))
    dsv = "['01.00."+dsvnumber+"']"
    if dsv not in tagdict["data_specs_versions"] :
        tagdict["data_specs_versions"].append(dsv)
        tagdict[dsv]={"tag_label":tl,
                      "description":tag.split(" ",1)[1]}

print(tagdict['data_specs_versions'])

["['01.00.33']", "['01.00.32']", "['01.00.31']", "['01.00.30']", "['01.00.29']", "['01.00.28']", "['01.00.27']", "['01.00.24']", "['01.00.23']", "['01.00.22']", "['01.00.21']", "['01.00.20']", "['01.00.19']", "['01.00.18']", "['01.00.17']", "['01.00.16']", "['01.00.15']", "['01.00.14']", "['01.00.13']", "['01.00.12']", "['01.00.11']"]

Application#

We loop over the datasets to be checked. Note that for many different datasets from different sources, it might be helpful to loop over different data_specs_versions instead so that we checkout each cmip6-cmor-tables repository version only once.

For the PrePARE run itself, we define the function checkSubset where:

We skip datasets for which we do not have a corresponding table version
We define a unique logPath for each dataset we are going to test using the logChunk, data_specs_version and the dset_id. PrePARE is able to create own directories which we also exploit. If there is already data in it, we skip the test to avoid duplications.
Checkout the correct cmip6-cmor tables and overwrite the CV with the most recent one saved in the beginning of this script.
Run PrePARE with 8 parallel processes.

def checkSubset(dset_id, dsetatts):
    print(dsetatts)
    if not "['"+dsetatts["data_specs_version"]+"']" in tagdict["data_specs_versions"] :
        print("No matching tag for data_specs_version {}".format(dsetatts["data_specs_version"]))
        return
    logPath=prepareSetting["logChunk"]+"/"+dsetatts["data_specs_version"].split('.')[2].split("'")[0]+"/"+dset_id
    if os.path.exists(logPath) and len(os.listdir(logPath)) != 0:
        return
    tag2checkout = tagdict["['"+dsetatts["data_specs_version"]+"']"]["tag_label"]
    g.reset("--hard")
    g.checkout(tag2checkout)
    copy2(recentCV, prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json")
    #
    a = subprocess.run("{0} -l {1} --all --table-path {2} {3}".format(
                           prepareSetting["exec"],
                           logPath,
                           prepareSetting["cmip6-cmor-table-path"],
                           dsetatts["dset_path"]),
                       capture_output=True, shell=True)
    print(a)

!rm -r prepare-test30

rm: cannot remove 'prepare-test30': No such file or directory

for dset_id, dsetatts in dsets_to_test.items() :
    checkSubset(dset_id, dsetatts)

{'dset_path': '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710', 'data_specs_version': ''}
No matching tag for data_specs_version 

Results#

As we let PrePARE write logifles for each dataset, we have to collect the results to get an overview. Each logfile start with a description of

how many files were scanned
how many files had failed

Apparently, if 0 files have failed, the dataset (if we get one logfile per dataset) has passed the checks. The next lines are not clearly formulated so that we parse them. We can distinguish between two error categories. The maximal severity of the errors max_severity is determined with every new match of an error.

Critical errors
- if the filename or filepath is not conform to the data standard
- if the data structure could not be parsed
- are identified by error keywords filename, not understood, SKIPPED
Minor issues
- if a value of a required global attribute could not be found.
- are identified by error keywords CV Fail

errorSeverity=["Passed", "Minor Issue", "Major Issue"]
parsedict={"meta": ["filename", "creation_date", "dset_id", "specs_version"],
           "filenoDict":{"checked": 'files scanned: (\d+)',
                        "failed": 'with error\(s\): (\d+)'
                       },
           "errorDict":{"filename": 2,
                        "Warning" : 1,
                        "CV FAIL" : 1,
                        "Permission denied" : 2,
                        "not understood" : 2,
                        "SKIPPED" : 2},
          }

<>:3: SyntaxWarning: invalid escape sequence '\d'
<>:4: SyntaxWarning: invalid escape sequence '\('
<>:3: SyntaxWarning: invalid escape sequence '\d'
<>:4: SyntaxWarning: invalid escape sequence '\('
/tmp/ipykernel_313/86600722.py:3: SyntaxWarning: invalid escape sequence '\d'
  "filenoDict":{"checked": 'files scanned: (\d+)',
/tmp/ipykernel_313/86600722.py:4: SyntaxWarning: invalid escape sequence '\('
  "failed": 'with error\(s\): (\d+)'

We subdivide the parsing into two processes, parse_file and collect_errors. parse_file is executed if errors are detected in collect_errors. As an argument, we provide not only the path to the logfile but rather a dictionary that will be filled with all important metadata to assess the PrePARE results.

def collect_errors(dset_entry) :
    errors=[]
    max_severity=0
    for line in open(dset_entry["logfile_name"]):
        for errorKeyword in parsedict["errorDict"].keys() :
            match = re.findall(errorKeyword, line)
            if match:
                errors.append(errorKeyword)
                max_severity=max(max_severity,int(parsedict["errorDict"][errorKeyword]))
    dset_entry["errors"]=tuple(errors)
    dset_entry["max_severity"]=max_severity

def parse_file(dset_entry):
    checkedFiles=[]
    failedFiles=[]
    for line in open(dset_entry["logfile_name"]):
        match = re.search(parsedict["filenoDict"]["checked"], line)
        if match:
            checkedFiles.append(''.join(match.group(1)))
        match = re.search(parsedict["filenoDict"]["failed"], line)
        if match:
            failedFiles.append(''.join(match.group(1)))
    if not checkedFiles or not failedFiles :
        print(dset_entry["logfile_name"], checkedFiles, failedFiles)
    dset_entry["checked"]=int(checkedFiles[0])
    dset_entry["failed"]=int(failedFiles[0])
    dset_entry["passed"]=dset_entry["checked"]-dset_entry["failed"]
    if not dset_entry["failed"] == 0 :
        collect_errors(dset_entry)        

We finally collect all results in a dictionary prepare_dict where the dset_ids are the keys. For that, we loop over all logfiles.

prepare_dict = {}
specs_paths=os.listdir(prepareSetting["logChunk"])
for specs_path in tqdm(specs_paths):
    for dirpath, dirnames, logfile_names in os.walk(os.path.join(prepareSetting["logChunk"], specs_path)):
        for logfile_name in logfile_names :
            dset_entry = {"logfile_name":os.path.join(dirpath, logfile_name),
                          "creation_date":logfile_name.split(".")[0].split("-")[1],
                          "dset_id":dirpath[len(os.path.join(prepareSetting["logChunk"], specs_path))+1:],
                          "specs_version": "01.00."+specs_path}

            parse_file(dset_entry)
            prepare_dict[dset_entry["dset_id"]]=dset_entry

0it [00:00, ?it/s]

0it [00:00, ?it/s]

print(prepare_dict)

{}