## PrePARE

The [PrePARE](https://cmor.llnl.gov/mydoc_cmip6_validator/) software tool is provided by [PCMDI](https://pcmdi.llnl.gov/) (Program for Climate Model Diagnosis and Intercomparison) to verify that CMIP6 files conform to the CMIP6 data protocol. The CMIP6 data protocol comprises requirements set out in different documents published by the CMIP6 WIP (Working Group on Climate Models Infrastructure Panel).

- The [Data request](https://cmip6dr.github.io/Data_Request_Home/) contains variable specifications (frequency, cell methods,..)
- The [Model output requirements](https://goo.gl/neswPr) specify the data format, structure and content.
- The CMIP6 meta data standard is defined in [Attributes, DRS, File names, directory structure, CV](https://goo.gl/v1drZl).
- All participants have to be regsitered in this [Registry for allowed models](https://github.com/WCRP-CMIP/CMIP6_CVs).

These documents are translated into `.json` formatted **Controlled Vocabularies** and tables readable by PrePARE and named [cmip6-cmor-tables](https://github.com/PCMDI/cmip6-cmor-tables).

PrePARE performs [10 different tests](https://goo.gl/NmuENr) which can be summarized by the following points:

1. Check for invariable and conditional **required global attributes** and valid values of those. 
2. Are **file names and paths** conform to the project's data reference syntax (DRS)?
3. Check for required **variable attributes**.
4. **Coordinates**: Some variables are requested on specific coordinates that need to be provided in the files in a compliant format.

In the following, we run `PrePARE` for a subset of CMIP6 pool data.

### Preparation

We will use the PrePARE binary in a shell but wrapped by this python notebook. We provide a conda environment which all levante users can use.

In order to use this environment as a kernel for jupyter notebooks, you can use `ipykernel` as shown in the next cell. Afterwards, reload your browser and select the new kernel for the quality assurance notebook.

In [None]:
%%bash
#The following line activates the source for working in a shell.
#source activate /work/bm0021/conda-envs/quality-assurance
#
#The following line installs a jupyter kernel for the conda environment
#python -m ipykernel install --user --name $kernelname --display-name="$kernelname"

Per default, shells inside a kernel are **not** started from the environment of the kernel. That means, the PrePARE executable is not found:

This can be changed either by using a helper script for the kernel as follows. You can also do that at the top of notebooks.

In [None]:
import sys
import os
newpath=f"{os.sep.join(sys.executable.split(os.sep)[:-1])}:{os.environ['PATH']}"
os.environ['PATH']=newpath
pp=!which PrePARE
pp=pp[0]

We also import some useful packages

In [None]:
# copy2 copies without errors
from shutil import copy2
# tqdm gives a progressbar for for loops
from tqdm import tqdm
import subprocess

Since the data standard evolves over time, we need to find the matching version for the datasests which should be tested. For that, we need `git` to checkout the corresponding version of the data standard tables, named [cmip6-cmor-tables](https://github.com/PCMDI/cmip6-cmor-tables). You can clone the tables repository via:

In [None]:
import git 
import re
# The following clones the cmip6 cmor tables if not available:
working_path="./"
cmip6_cmor_tables_url="https://github.com/PCMDI/cmip6-cmor-tables.git"
if "cmip6-cmor-tables" not in os.listdir(working_path):
    git.Git(working_path).clone(cmip6_cmor_tables_url)

One table in the tables repository only contains the global attributes and no information about the variables: `CMIP6_CV.json` where 'CV' is for Controlled Vocabulary. In contrast to those tables which contain variables, only the recent version of the global attributes table is valid. This is because this file is mostly never changed but rather extended. Whenever we checkout a different version of the tables repository, we need to copy the recent global attributes CV into that version. Therefore, we copy this CV to a save place named `recentCV`.

In [None]:
recentCV = working_path+"CMIP6-CV-20210419.json"
copy2(working_path+"cmip6-cmor-tables/Tables/CMIP6_CV.json", recentCV)

### Settings

The following variables are important for PrePARE and will be defined:

- `logChunk` will hold the results of PrePARE
- `cmip6-cmor-table-path` is the directory for the input tables
- `exec` is the executable which we will run in bash

In [None]:
prepareSetting = {
    "exec" : pp,
    #"logChunk":"/mnt/lustre01/work/bm0021/prepare-test/",
    "logChunk":"prepare-test",
    "cmip6-cmor-table-path" : working_path+"cmip6-cmor-tables/Tables"
}
!mkdir -p {prepareSetting["logChunk"]}

In [None]:
prepareSetting["exec"]

### Initialization

We read in the dataset list, load the git repository and copy the recent Controlled Vocabulary for required attributes.

In [None]:
g = git.Git(prepareSetting["cmip6-cmor-table-path"]) 
g.reset("--hard")
g.checkout("master")
copy2(prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json", recentCV)

Assume, we want to test the dataset `dset_id` in directory `trunk`:

In [None]:
trunk="/work/ik1017/CMIP6/data/"
dset_id="CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710"

In order to find out the data standard version used for the creation of the files which should be tested, we need to retrieve the value from the global attribute `data_specs_version` from one file of the dataset. We assign a corresponding attribute `data_specs_version` to the `dset_id` and combine it in a *dictionary*.

In [None]:
dsets_to_test={dset_id :
               { "dset_path":trunk+'/'.join(dset_id.split('.')),
                 "data_specs_version":""
               }
             }

The function `addSpecs` will retrieve the specs attribute by using the bash tool `ncdump -h` showing the header of a file including all attributes.

In [None]:
def addSpecs(entry):
    print([os.path.join(entry["dset_path"],f) 
              for f in os.listdir(entry["dset_path"]) 
      ])
    try:
        fileinpath = [os.path.join(entry["dset_path"],f) 
                      for f in os.listdir(entry["dset_path"]) 
                      if os.path.isfile(
                          os.path.join(entry["dset_path"],f)
                      )]
    except:
        return ""
#    ncdump_exec="/sw/rhel6-x64/netcdf/netcdf_c-4.4.1.1-gcc48/bin/ncdump"
    dsv = !ncdump -h {fileinpath[0]} | grep data_specs_version | cut -d '"' -f 2
    dsv = ''.join(dsv)
    return dsv

And now, we apply it for all dsets in the `dsets_to_test` dictionary:

In [None]:
for dset, entry in dsets_to_test.items():
    print(dset)
    entry["data_specs_version"] = addSpecs(entry)

#### Retrieving all versions of the cmip6-cmor-table repository

We are using the `tags` of the version releases and reformat their values to be conform to the `data_specs_version`.

In [None]:
tags = reversed(g.tag("-n").split("\n"))
tagdict = {"data_specs_versions":[]}           
for tag in tags :
    tl = tag.split(" ", 1)[0]
    tllen = len(tl.split("."))
    if tllen > 3 :
        continue
    dsvnumber = tl.split(".")[tllen-1]
    dsvnumber = "".join(filter(str.isdigit, dsvnumber))
    dsv = "['01.00."+dsvnumber+"']"
    if dsv not in tagdict["data_specs_versions"] :
        tagdict["data_specs_versions"].append(dsv)
        tagdict[dsv]={"tag_label":tl,
                      "description":tag.split(" ",1)[1]}

In [None]:
print(tagdict['data_specs_versions'])

### Application

We loop over the datasets to be checked.
Note that for many different datasets from different *sources*, it might be helpful to loop over different `data_specs_version`s instead so that we checkout each `cmip6-cmor-tables` repository version only once.

For the `PrePARE` run itself, we define the function `checkSubset` where:

- We skip datasets for which we do not have a corresponding table version
- We define a unique `logPath` for each dataset we are going to test using the `logChunk`,  `data_specs_version` and the `dset_id`. PrePARE is able to create own directories which we also exploit. If there is already data in it, we skip the test to avoid duplications.
- Checkout the correct cmip6-cmor tables and overwrite the CV with the most recent one saved in the beginning of this script.
- Run PrePARE with 8 parallel processes.

In [None]:
def checkSubset(dset_id, dsetatts):
    print(dsetatts)
    if not "['"+dsetatts["data_specs_version"]+"']" in tagdict["data_specs_versions"] :
        print("No matching tag for data_specs_version {}".format(dsetatts["data_specs_version"]))
        return
    logPath=prepareSetting["logChunk"]+"/"+dsetatts["data_specs_version"].split('.')[2].split("'")[0]+"/"+dset_id
    if os.path.exists(logPath) and len(os.listdir(logPath)) != 0:
        return
    tag2checkout = tagdict["['"+dsetatts["data_specs_version"]+"']"]["tag_label"]
    g.reset("--hard")
    g.checkout(tag2checkout)
    copy2(recentCV, prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json")
    #
    a = subprocess.run("{0} -l {1} --all --table-path {2} {3}".format(
                           prepareSetting["exec"],
                           logPath,
                           prepareSetting["cmip6-cmor-table-path"],
                           dsetatts["dset_path"]),
                       capture_output=True, shell=True)
    print(a)

In [None]:
!rm -r prepare-test30

In [None]:
for dset_id, dsetatts in dsets_to_test.items() :
    checkSubset(dset_id, dsetatts)

### Results

As we let PrePARE write logifles for each dataset, we have to collect the results to get an overview.
Each logfile start with a description of
- how many files were scanned
- how many files had failed

Apparently, if 0 files have failed, the dataset (if we get one logfile per dataset) has passed the checks. The next lines are not clearly formulated so that we parse them. We can distinguish between two error categories. The maximal severity of the errors `max_severity` is determined with every new match of an error.

- Critical errors
    - if the filename or filepath is not conform to the data standard
    - if the data structure could not be parsed
    - are identified by error keywords `filename`, `not understood`, `SKIPPED`
- Minor issues
    - if a value of a required global attribute could not be found.
    - are identified by error keywords `CV Fail`

In [None]:
errorSeverity=["Passed", "Minor Issue", "Major Issue"]
parsedict={"meta": ["filename", "creation_date", "dset_id", "specs_version"],
           "filenoDict":{"checked": 'files scanned: (\d+)',
                        "failed": 'with error\(s\): (\d+)'
                       },
           "errorDict":{"filename": 2,
                        "Warning" : 1,
                        "CV FAIL" : 1,
                        "Permission denied" : 2,
                        "not understood" : 2,
                        "SKIPPED" : 2},
          }

We subdivide the parsing into two processes, `parse_file` and `collect_errors`. `parse_file` is executed if errors are detected in `collect_errors`. As an argument, we provide not only the path to the logfile but rather a dictionary that will be filled with all important metadata to assess the PrePARE results.

In [None]:
def collect_errors(dset_entry) :
    errors=[]
    max_severity=0
    for line in open(dset_entry["logfile_name"]):
        for errorKeyword in parsedict["errorDict"].keys() :
            match = re.findall(errorKeyword, line)
            if match:
                errors.append(errorKeyword)
                max_severity=max(max_severity,int(parsedict["errorDict"][errorKeyword]))
    dset_entry["errors"]=tuple(errors)
    dset_entry["max_severity"]=max_severity

In [None]:
def parse_file(dset_entry):
    checkedFiles=[]
    failedFiles=[]
    for line in open(dset_entry["logfile_name"]):
        match = re.search(parsedict["filenoDict"]["checked"], line)
        if match:
            checkedFiles.append(''.join(match.group(1)))
        match = re.search(parsedict["filenoDict"]["failed"], line)
        if match:
            failedFiles.append(''.join(match.group(1)))
    if not checkedFiles or not failedFiles :
        print(dset_entry["logfile_name"], checkedFiles, failedFiles)
    dset_entry["checked"]=int(checkedFiles[0])
    dset_entry["failed"]=int(failedFiles[0])
    dset_entry["passed"]=dset_entry["checked"]-dset_entry["failed"]
    if not dset_entry["failed"] == 0 :
        collect_errors(dset_entry)        

We finally collect all results in a dictionary `prepare_dict` where the `dset_id`s are the keys. For that, we loop over all logfiles.

In [None]:
prepare_dict = {}
specs_paths=os.listdir(prepareSetting["logChunk"])
for specs_path in tqdm(specs_paths):
    for dirpath, dirnames, logfile_names in os.walk(os.path.join(prepareSetting["logChunk"], specs_path)):
        for logfile_name in logfile_names :
            dset_entry = {"logfile_name":os.path.join(dirpath, logfile_name),
                          "creation_date":logfile_name.split(".")[0].split("-")[1],
                          "dset_id":dirpath[len(os.path.join(prepareSetting["logChunk"], specs_path))+1:],
                          "specs_version": "01.00."+specs_path}

            parse_file(dset_entry)
            prepare_dict[dset_entry["dset_id"]]=dset_entry

In [None]:
print(prepare_dict)