PrePARE#
The PrePARE software tool is provided by PCMDI (Program for Climate Model Diagnosis and Intercomparison) to verify that CMIP6 files conform to the CMIP6 data protocol. The CMIP6 data protocol comprises requirements set out in different documents published by the CMIP6 WIP (Working Group on Climate Models Infrastructure Panel).
The Data request contains variable specifications (frequency, cell methods,..)
The Model output requirements specify the data format, structure and content.
The CMIP6 meta data standard is defined in Attributes, DRS, File names, directory structure, CV.
All participants have to be regsitered in this Registry for allowed models.
These documents are translated into .json
formatted Controlled Vocabularies and tables readable by PrePARE and named cmip6-cmor-tables.
PrePARE performs 10 different tests which can be summarized by the following points:
Check for invariable and conditional required global attributes and valid values of those.
Are file names and paths conform to the project’s data reference syntax (DRS)?
Check for required variable attributes.
Coordinates: Some variables are requested on specific coordinates that need to be provided in the files in a compliant format.
In the following, we run PrePARE
for a subset of CMIP6 pool data.
Preparation#
We will use the PrePARE binary in a shell but wrapped by this python notebook. We provide a conda environment which all levante users can use.
In order to use this environment as a kernel for jupyter notebooks, you can use ipykernel
as shown in the next cell. Afterwards, reload your browser and select the new kernel for the quality assurance notebook.
%%bash
#The following line activates the source for working in a shell.
#source activate /work/bm0021/conda-envs/quality-assurance
#
#The following line installs a jupyter kernel for the conda environment
#python -m ipykernel install --user --name $kernelname --display-name="$kernelname"
Per default, shells inside a kernel are not started from the environment of the kernel. That means, the PrePARE executable is not found:
This can be changed either by using a helper script for the kernel as follows. You can also do that at the top of notebooks.
import sys
import os
newpath=f"{os.sep.join(sys.executable.split(os.sep)[:-1])}:{os.environ['PATH']}"
os.environ['PATH']=newpath
pp=!which PrePARE
pp=pp[0]
We also import some useful packages
# copy2 copies without errors
from shutil import copy2
# tqdm gives a progressbar for for loops
from tqdm import tqdm
import subprocess
Since the data standard evolves over time, we need to find the matching version for the datasests which should be tested. For that, we need git
to checkout the corresponding version of the data standard tables, named cmip6-cmor-tables. You can clone the tables repository via:
import git
import re
# The following clones the cmip6 cmor tables if not available:
working_path="./"
cmip6_cmor_tables_url="https://github.com/PCMDI/cmip6-cmor-tables.git"
if "cmip6-cmor-tables" not in os.listdir(working_path):
git.Git(working_path).clone(cmip6_cmor_tables_url)
One table in the tables repository only contains the global attributes and no information about the variables: CMIP6_CV.json
where ‘CV’ is for Controlled Vocabulary. In contrast to those tables which contain variables, only the recent version of the global attributes table is valid. This is because this file is mostly never changed but rather extended. Whenever we checkout a different version of the tables repository, we need to copy the recent global attributes CV into that version. Therefore, we copy this CV to a save place named recentCV
.
recentCV = working_path+"CMIP6-CV-20210419.json"
copy2(working_path+"cmip6-cmor-tables/Tables/CMIP6_CV.json", recentCV)
'./CMIP6-CV-20210419.json'
Settings#
The following variables are important for PrePARE and will be defined:
logChunk
will hold the results of PrePAREcmip6-cmor-table-path
is the directory for the input tablesexec
is the executable which we will run in bash
prepareSetting = {
"exec" : pp,
#"logChunk":"/mnt/lustre01/work/bm0021/prepare-test/",
"logChunk":"prepare-test",
"cmip6-cmor-table-path" : working_path+"cmip6-cmor-tables/Tables"
}
!mkdir -p {prepareSetting["logChunk"]}
prepareSetting["exec"]
'/mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/bin/PrePARE'
Initialization#
We read in the dataset list, load the git repository and copy the recent Controlled Vocabulary for required attributes.
g = git.Git(prepareSetting["cmip6-cmor-table-path"])
g.reset("--hard")
g.checkout("master")
copy2(prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json", recentCV)
'./CMIP6-CV-20210419.json'
Assume, we want to test the dataset dset_id
in directory trunk
:
trunk="/work/ik1017/CMIP6/data/"
dset_id="CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710"
In order to find out the data standard version used for the creation of the files which should be tested, we need to retrieve the value from the global attribute data_specs_version
from one file of the dataset. We assign a corresponding attribute data_specs_version
to the dset_id
and combine it in a dictionary.
dsets_to_test={dset_id :
{ "dset_path":trunk+'/'.join(dset_id.split('.')),
"data_specs_version":""
}
}
The function addSpecs
will retrieve the specs attribute by using the bash tool ncdump -h
showing the header of a file including all attributes.
def addSpecs(entry):
print([os.path.join(entry["dset_path"],f)
for f in os.listdir(entry["dset_path"])
])
try:
fileinpath = [os.path.join(entry["dset_path"],f)
for f in os.listdir(entry["dset_path"])
if os.path.isfile(
os.path.join(entry["dset_path"],f)
)]
except:
return ""
# ncdump_exec="/sw/rhel6-x64/netcdf/netcdf_c-4.4.1.1-gcc48/bin/ncdump"
dsv = !ncdump -h {fileinpath[0]} | grep data_specs_version | cut -d '"' -f 2
dsv = ''.join(dsv)
return dsv
And now, we apply it for all dsets in the dsets_to_test
dictionary:
for dset, entry in dsets_to_test.items():
print(dset)
entry["data_specs_version"] = addSpecs(entry)
CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710
['/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_202001-202412.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_203501-203912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_205001-205412.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_207001-207412.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_206501-206912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_208001-208412.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_205501-205912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_207501-207912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_204501-204912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_209501-209912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_204001-204412.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_210001-210012.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_203001-203412.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_201501-201912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_202501-202912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_206001-206412.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_208501-208912.nc', '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp370_r1i1p1f1_gn_209001-209412.nc']
Retrieving all versions of the cmip6-cmor-table repository#
We are using the tags
of the version releases and reformat their values to be conform to the data_specs_version
.
tags = reversed(g.tag("-n").split("\n"))
tagdict = {"data_specs_versions":[]}
for tag in tags :
tl = tag.split(" ", 1)[0]
tllen = len(tl.split("."))
if tllen > 3 :
continue
dsvnumber = tl.split(".")[tllen-1]
dsvnumber = "".join(filter(str.isdigit, dsvnumber))
dsv = "['01.00."+dsvnumber+"']"
if dsv not in tagdict["data_specs_versions"] :
tagdict["data_specs_versions"].append(dsv)
tagdict[dsv]={"tag_label":tl,
"description":tag.split(" ",1)[1]}
print(tagdict['data_specs_versions'])
["['01.00.33']", "['01.00.32']", "['01.00.31']", "['01.00.30']", "['01.00.29']", "['01.00.28']", "['01.00.27']", "['01.00.24']", "['01.00.23']", "['01.00.22']", "['01.00.21']", "['01.00.20']", "['01.00.19']", "['01.00.18']", "['01.00.17']", "['01.00.16']", "['01.00.15']", "['01.00.14']", "['01.00.13']", "['01.00.12']", "['01.00.11']"]
Application#
We loop over the datasets to be checked.
Note that for many different datasets from different sources, it might be helpful to loop over different data_specs_version
s instead so that we checkout each cmip6-cmor-tables
repository version only once.
For the PrePARE
run itself, we define the function checkSubset
where:
We skip datasets for which we do not have a corresponding table version
We define a unique
logPath
for each dataset we are going to test using thelogChunk
,data_specs_version
and thedset_id
. PrePARE is able to create own directories which we also exploit. If there is already data in it, we skip the test to avoid duplications.Checkout the correct cmip6-cmor tables and overwrite the CV with the most recent one saved in the beginning of this script.
Run PrePARE with 8 parallel processes.
def checkSubset(dset_id, dsetatts):
print(dsetatts)
if not "['"+dsetatts["data_specs_version"]+"']" in tagdict["data_specs_versions"] :
print("No matching tag for data_specs_version {}".format(dsetatts["data_specs_version"]))
return
logPath=prepareSetting["logChunk"]+"/"+dsetatts["data_specs_version"].split('.')[2].split("'")[0]+"/"+dset_id
if os.path.exists(logPath) and len(os.listdir(logPath)) != 0:
return
tag2checkout = tagdict["['"+dsetatts["data_specs_version"]+"']"]["tag_label"]
g.reset("--hard")
g.checkout(tag2checkout)
copy2(recentCV, prepareSetting["cmip6-cmor-table-path"]+"/CMIP6_CV.json")
#
a = subprocess.run("{0} -l {1} --all --table-path {2} {3}".format(
prepareSetting["exec"],
logPath,
prepareSetting["cmip6-cmor-table-path"],
dsetatts["dset_path"]),
capture_output=True, shell=True)
print(a)
!rm -r prepare-test30
rm: cannot remove ‘prepare-test30’: No such file or directory
for dset_id, dsetatts in dsets_to_test.items() :
checkSubset(dset_id, dsetatts)
{'dset_path': '/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710', 'data_specs_version': '01.00.30'}
CompletedProcess(args='/mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/bin/PrePARE -l prepare-test/30/CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710 --all --table-path ./cmip6-cmor-tables/Tables /work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710', returncode=0, stdout=b'\x1b[1;32m\rCheck netCDF file(s): \x1b[0m5% | 1/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m11% | 2/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m16% | 3/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m22% | 4/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m27% | 5/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m33% | 6/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m38% | 7/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m44% | 8/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m50% | 9/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m55% | 10/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m61% | 11/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m66% | 12/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m72% | 13/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m77% | 14/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m83% | 15/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m88% | 16/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m94% | 17/18 files\x1b[1;32m\rCheck netCDF file(s): \x1b[0m100% | 18/18 files\r\x1b[K\x1b[95m\nNumber of files scanned: 18\x1b[0m\x1b[1;32m\nNumber of file with error(s): 0\x1b[0m\n', stderr=b'')
Results#
As we let PrePARE write logifles for each dataset, we have to collect the results to get an overview. Each logfile start with a description of
how many files were scanned
how many files had failed
Apparently, if 0 files have failed, the dataset (if we get one logfile per dataset) has passed the checks. The next lines are not clearly formulated so that we parse them. We can distinguish between two error categories. The maximal severity of the errors max_severity
is determined with every new match of an error.
Critical errors
if the filename or filepath is not conform to the data standard
if the data structure could not be parsed
are identified by error keywords
filename
,not understood
,SKIPPED
Minor issues
if a value of a required global attribute could not be found.
are identified by error keywords
CV Fail
errorSeverity=["Passed", "Minor Issue", "Major Issue"]
parsedict={"meta": ["filename", "creation_date", "dset_id", "specs_version"],
"filenoDict":{"checked": 'files scanned: (\d+)',
"failed": 'with error\(s\): (\d+)'
},
"errorDict":{"filename": 2,
"Warning" : 1,
"CV FAIL" : 1,
"Permission denied" : 2,
"not understood" : 2,
"SKIPPED" : 2},
}
We subdivide the parsing into two processes, parse_file
and collect_errors
. parse_file
is executed if errors are detected in collect_errors
. As an argument, we provide not only the path to the logfile but rather a dictionary that will be filled with all important metadata to assess the PrePARE results.
def collect_errors(dset_entry) :
errors=[]
max_severity=0
for line in open(dset_entry["logfile_name"]):
for errorKeyword in parsedict["errorDict"].keys() :
match = re.findall(errorKeyword, line)
if match:
errors.append(errorKeyword)
max_severity=max(max_severity,int(parsedict["errorDict"][errorKeyword]))
dset_entry["errors"]=tuple(errors)
dset_entry["max_severity"]=max_severity
def parse_file(dset_entry):
checkedFiles=[]
failedFiles=[]
for line in open(dset_entry["logfile_name"]):
match = re.search(parsedict["filenoDict"]["checked"], line)
if match:
checkedFiles.append(''.join(match.group(1)))
match = re.search(parsedict["filenoDict"]["failed"], line)
if match:
failedFiles.append(''.join(match.group(1)))
if not checkedFiles or not failedFiles :
print(dset_entry["logfile_name"], checkedFiles, failedFiles)
dset_entry["checked"]=int(checkedFiles[0])
dset_entry["failed"]=int(failedFiles[0])
dset_entry["passed"]=dset_entry["checked"]-dset_entry["failed"]
if not dset_entry["failed"] == 0 :
collect_errors(dset_entry)
We finally collect all results in a dictionary prepare_dict
where the dset_id
s are the keys. For that, we loop over all logfiles.
prepare_dict = {}
specs_paths=os.listdir(prepareSetting["logChunk"])
for specs_path in tqdm(specs_paths):
for dirpath, dirnames, logfile_names in os.walk(os.path.join(prepareSetting["logChunk"], specs_path)):
for logfile_name in logfile_names :
dset_entry = {"logfile_name":os.path.join(dirpath, logfile_name),
"creation_date":logfile_name.split(".")[0].split("-")[1],
"dset_id":dirpath[len(os.path.join(prepareSetting["logChunk"], specs_path))+1:],
"specs_version": "01.00."+specs_path}
parse_file(dset_entry)
prepare_dict[dset_entry["dset_id"]]=dset_entry
0%| | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 978.38it/s]
print(prepare_dict)
{'CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710': {'logfile_name': 'prepare-test/30/CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710/PrePARE-20230524-160258.log', 'creation_date': '20230524', 'dset_id': 'CMIP6.ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp370.r1i1p1f1.Amon.tas.gn.v20190710', 'specs_version': '01.00.30', 'checked': 18, 'failed': 0, 'passed': 18}}