The DKRZ CMIP Data Pool#
This is a beginners-level demonstration notebook and introduces you to the Data Pool at DKRZ. Based on the example of the recent phase 6 of the Coupled Model Intercomparison Project (CMIP6), you will learn
how you benefit from the CMIP Data Pool (CDP)
how to approach CMIP data
how to use the python packages
intake-esm
,xarray
andpandas
to investigate the CMIP Data Pool
This notebook can be executed on DKRZβs jupyterhub platform. For a detailled introduction into jupyterhub
and intake
, we recommend the DKRZ tech talks
jupyterhub by Sofiane Bendoukha
intake by Aaron Spring
Customizing the code inside, however, only requires basic python knowledge.
Introduction#
The Scientific Steering Commitee has thankfully granted a disk space on lustre file system of 5PB for the CMIP Data Pool for 2023. Started in 2016, DKRZ runs and maintains this common storage place.
π’ The DKRZ CMIP data pool contains often needed flagship collections of climate model data, is hosted as part of the DKRZ data infrastructure and supports scientists in high volume climate data collection, access and processing.
The notebook sources for the doc pages are available in this gitlab-repo
Important news and updates will be announced
on the DKRZ user portal
via a mailing list. Subscribe for β cmip-data-poolATlists.dkrz.de
β Highlight CDP climate model data collections are:
CMIP6: In May 2021, DKRZ provides Europeβs largest data pool with an amount of 4 PB for the recent phase of the Coupled Model Intercomparison Project
CORDEX: The size of data for the Coordinated Regional Downscaling Experiment is about 600TB over different projects.
CMIP5: The fifth phase of CMIP.
An example of a project which is also in the data pool, but not included in the term CMIP6βΊ:
ERA5: Weather data from the European Centre for Medium-Range Weather Forecasts by re-analysed and homogenised observation data.
from IPython.display import HTML, display, Markdown, IFrame
display(Markdown("Time series of three different data pool disk space measures. DKRZ has published about 1.5 PB, 2.5 PB are replicated data from other data nodes. An average CMIP6 dataset contains about 5 files and covers 4GB."))
IFrame(src="https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/pool-timeseries-hvplot.html",width="900",height="550",frameborder="0")
Time series of three different data pool disk space measures. DKRZ has published about 1.5 PB, 2.5 PB are replicated data from other data nodes. An average CMIP6 dataset contains about 5 files and covers 4GB.
display(Markdown("We develop, prepare and provide [jupyter notebook demonstrations](https://gitlab.dkrz.de/data-infrastructure-services/tutorials-and-use-cases) <br> "
"- as tutorials for software packages and applications *starting from scratch* </br>"
"- for more frequent use cases like the plot of `tas` of one member of two experiments and simulated by the German ESMs."))
IFrame(src="https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/plots/globalmean-yearlymean-tas.html",width="1000",height="650",frameborder="0")
We develop, prepare and provide jupyter notebook demonstrations
- as tutorials for software packages and applications starting from scratch
- for more frequent use cases like the plot of tas
of one member of two experiments and simulated by the German ESMs.
Why do we host the CDP? π€#
π The key benefit of the data pool is that the data is available on lustre (/work
) so that All DKRZ users with a current account have access. There is less need for local copies or data downloads. π
Where can I find the data pool? π#
The Data pool can be accessed from different portals.
Server-side on the file system e.g. under
/pool/data/CMIP6/data
All levante users with a current account have permission to do that.
Fastest way to work with the data
#Browsing with linux commands
ls /pool/data/CMIP6/data/ -x
echo ""
#For which MIPs did MPI-ESM1-2-XR produce data for?
find /pool/data/CMIP6/data/ -maxdepth 3 -name MPI-ESM1-2-XR -type d
#Using the FreVA CMIP-ESMVal tool
module load cmip6-dicad/1.0
freva --databrowser --help
Web-based from remote: πΈ
Published data via the fail-safe Earth System Grid Federation data portal
Partly available in the Copernicus Climate Data Store
Regularly updated Intake-esm catalogs made publically available in the cloud or in
/pool/data/Catalogs
π
Understanding CMIP6 data#
π§βπ« The goal of CMIP6
In order to evaluate and compare climate models, a globally organized intercomparison project is periodically conducted. CMIP6 tackles three major questions:
How does the Earth system respond to forcing? π
What are the origins and consequences of systematic model biases? π
How can we assess future climate changes given internal climate variability, predictability, and uncertainties and scenarios? π‘
Metadata: Required Attributes and Controlled Vocabularies#
CDP data is self-descriptive as it contains extensive and controlled metadata. This metadata is prepared in the search facets of the data portals and catalogs.
π
Besides the technical requirements, the CMIP data standard defines required attributes in so called Controlled Vocabularies (CV). While some values are predefined, models and institutions have to be registered to become a valid value of corresponding attributes. For many attributes, both a short form with _id
and a longer description exist.
Important required attributes:
activity_id
: A CMIP6-endorsed MIP that investigates a specific research question. It definesexperiment
s and requests data for it.source_id
: An ID for the Earth System Model used to produce the data.experiment_id
: The experiment which was conducted by thesource_id
.member_id
: The ensemble simulation member of theexperiment_id
. All members should be statistically equal.
Investigating the CMIP6 data pool with intake-esm
β΅#
Features
display catalogs as clearly structured tables inside jupyter notebooks for easy investigation
import intake
cloudpath=["https://www.dkrz.de/s/intake"]
poolpath="/pool/data/Catalogs/dkrz_cmip6_disk.json"
#cdp = intake.open_catalog(pagespath)
#col = cdp.dkrz_cmip6_disk
col = intake.open_esm_datastore("/work/ik1017/Catalogs/dkrz_cmip6_disk.json")
col.df.head()
/opt/conda/envs/datapoolservices/lib/python3.13/site-packages/intake_esm/__init__.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import DistributionNotFound, get_distribution
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[3], line 6
3 poolpath="/pool/data/Catalogs/dkrz_cmip6_disk.json"
4 #cdp = intake.open_catalog(pagespath)
5 #col = cdp.dkrz_cmip6_disk
----> 6 col = intake.open_esm_datastore("/work/ik1017/Catalogs/dkrz_cmip6_disk.json")
7 col.df.head()
File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/intake_esm/core.py:107, in esm_datastore.__init__(self, obj, progressbar, sep, registry, read_csv_kwargs, columns_with_iterables, storage_options, **intake_kwargs)
105 self.esmcat = ESMCatalogModel.from_dict(obj)
106 else:
--> 107 self.esmcat = ESMCatalogModel.load(
108 obj, storage_options=self.storage_options, read_csv_kwargs=read_csv_kwargs
109 )
111 self.derivedcat = registry or default_registry
112 self._entries = {}
File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/intake_esm/cat.py:238, in ESMCatalogModel.load(cls, json_file, storage_options, read_csv_kwargs)
235 json_file = str(json_file) # We accept Path, but fsspec doesn't.
236 _mapper = fsspec.get_mapper(json_file, **storage_options)
--> 238 with fsspec.open(json_file, **storage_options) as fobj:
239 data = json.loads(fobj.read())
240 if 'last_updated' not in data:
File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/fsspec/core.py:105, in OpenFile.__enter__(self)
102 mode = self.mode.replace("t", "").replace("b", "") + "b"
104 try:
--> 105 f = self.fs.open(self.path, mode=mode)
106 except FileNotFoundError as e:
107 if has_magic(self.path):
File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/fsspec/spec.py:1338, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
1336 else:
1337 ac = kwargs.pop("autocommit", not self._intrans)
-> 1338 f = self._open(
1339 path,
1340 mode=mode,
1341 block_size=block_size,
1342 autocommit=ac,
1343 cache_options=cache_options,
1344 **kwargs,
1345 )
1346 if compression is not None:
1347 from fsspec.compression import compr
File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/fsspec/implementations/local.py:206, in LocalFileSystem._open(self, path, mode, block_size, **kwargs)
204 if self.auto_mkdir and "w" in mode:
205 self.makedirs(self._parent(path), exist_ok=True)
--> 206 return LocalFileOpener(path, mode, fs=self, **kwargs)
File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/fsspec/implementations/local.py:383, in LocalFileOpener.__init__(self, path, mode, autocommit, fs, compression, **kwargs)
381 self.compression = get_compression(path, compression)
382 self.blocksize = io.DEFAULT_BUFFER_SIZE
--> 383 self._open()
File /opt/conda/envs/datapoolservices/lib/python3.13/site-packages/fsspec/implementations/local.py:388, in LocalFileOpener._open(self)
386 if self.f is None or self.f.closed:
387 if self.autocommit or "w" not in self.mode:
--> 388 self.f = open(self.path, mode=self.mode)
389 if self.compression:
390 compress = compr[self.compression]
FileNotFoundError: [Errno 2] No such file or directory: '/work/ik1017/Catalogs/dkrz_cmip6_disk.json'
Features
browse through the catalog and select your data without being on the pool file system
β¨ A pythonic reproducable alternative compared to complex find
commands or GUI searches. No need for Filesystems and filenames.
tas = col.search(experiment_id="historical", source_id="MPI-ESM1-2-HR", variable_id="tas", table_id="Amon", member_id="r1i1p1f1")
tas
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 tas = col.search(experiment_id="historical", source_id="MPI-ESM1-2-HR", variable_id="tas", table_id="Amon", member_id="r1i1p1f1")
2 tas
NameError: name 'col' is not defined
Features
open climate data in an analysis ready dictionary of
xarray
datasets
Forget about temporary merging and reformatting steps!
tas.to_dataset_dict()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 tas.to_dataset_dict()
NameError: name 'tas' is not defined
Intake
best practises:#
Intake
can make your scripts reusable.Instead of working with local copy or editions of files, always start from a globally defined catalog which everyone can access
Save the subset of the catalog which you work on as a new catalog instead of a subset of files
Check for new ingests by just repeating your script - it will open the most recent catalog.
Only load datasets with
to_dataset_dict
into xarrray which do not exceed your memory limits
Letβs get an overview over the CMIP6 Data pool by
finding the number of unique values of attributes
group and plot the names and sizes of different entries
The resulting statistics is about the percentage of File numbers.
unique_activites=col.unique("activity_id")
print(list(unique_activites["activity_id"].values()))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 unique_activites=col.unique("activity_id")
2 print(list(unique_activites["activity_id"].values()))
NameError: name 'col' is not defined
def pieplot(gbyelem) :
#groupby, sort and select the ten largest
size = col.df.groupby([gbyelem]).size().sort_values(ascending=False)
size10 = size.nlargest(10)
#Sum all others as 10th entry
size10[9] = sum(size[9:])
size10.rename(index={size10.index.values[9]:'all other'},inplace=True)
#return a pie plot
return size10.plot.pie(figsize=(18,8),ylabel='',autopct='%.2f', fontsize=16)
pieplot("activity_id")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 1
----> 1 pieplot("activity_id")
Cell In[7], line 3, in pieplot(gbyelem)
1 def pieplot(gbyelem) :
2 #groupby, sort and select the ten largest
----> 3 size = col.df.groupby([gbyelem]).size().sort_values(ascending=False)
4 size10 = size.nlargest(10)
5 #Sum all others as 10th entry
NameError: name 'col' is not defined
unique_sources=col.unique("source_id")
print("Number of unique earth system models in the cmip6 data pool: "+str(list(unique_sources["source_id"].values())[0]))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 unique_sources=col.unique("source_id")
2 print("Number of unique earth system models in the cmip6 data pool: "+str(list(unique_sources["source_id"].values())[0]))
NameError: name 'col' is not defined
pieplot("source_id")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 1
----> 1 pieplot("source_id")
Cell In[7], line 3, in pieplot(gbyelem)
1 def pieplot(gbyelem) :
2 #groupby, sort and select the ten largest
----> 3 size = col.df.groupby([gbyelem]).size().sort_values(ascending=False)
4 size10 = size.nlargest(10)
5 #Sum all others as 10th entry
NameError: name 'col' is not defined
unique_members=col.unique("member_id")
list(unique_members["member_id"].values())[1][0:3]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 unique_members=col.unique("member_id")
2 list(unique_members["member_id"].values())[1][0:3]
NameError: name 'col' is not defined
Data Reference Syntax#
An atomic Dataset contains all files which cover the entire time span of a single variable of a single simulation. This can be multiple files in one.
The Data Reference Syntax (DRS) is a set of required attributes which uniquely identify and describe a dataset. The DRS usually includes all attributes used in the path templates so that both words are used synonymously. The DRS elements are arranged to a hierarchical path template for CMIP6:
CMIP6: mip_era
/activity_id
/institution_id
/source_id
/experiment_id
/member_id
/table_id
/variable_id
/grid_label
/version
Be careful when browsing through the CMIP6 data tree!
Unique in CMIP6 data hierarchy:
experiment_id
(only in oneactivity_id
)variable_id
intable_id
: Both combined represent the CMIP VariableOnly one
version
for one dataset should be published
# Searching for the MIP which defines the experiment 'historical':
cat = col.search(experiment_id="historical")
cat.unique("activity_id")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 3
1 # Searching for the MIP which defines the experiment 'historical':
----> 3 cat = col.search(experiment_id="historical")
4 cat.unique("activity_id")
NameError: name 'col' is not defined
# Searching for all tables which contain the variable 'tas':
cat = col.search(variable_id="tas")
cat.unique("table_id")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[13], line 3
1 # Searching for all tables which contain the variable 'tas':
----> 3 cat = col.search(variable_id="tas")
4 cat.unique("table_id")
NameError: name 'col' is not defined
Not Unique in CMIP6 data hierarchy:
institution_id
for bothsource_id
+experiment_id
( +member_id
)
No requirements for member_id
# Searching for all institution_ids which uses the model 'MPI-ESM1-2-HR' to produce 'ssp585' results:
cat = col.search(source_id="MPI-ESM1-2-HR", experiment_id="ssp585")
cat.unique("institution_id")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 3
1 # Searching for all institution_ids which uses the model 'MPI-ESM1-2-HR' to produce 'ssp585' results:
----> 3 cat = col.search(source_id="MPI-ESM1-2-HR", experiment_id="ssp585")
4 cat.unique("institution_id")
NameError: name 'col' is not defined
# Searching for all experiment_ids produced with ESM 'EC-Earth3' and as ensemble member 'r1i1p1f1':
cat = col.search(source_id="EC-Earth3", member_id="r1i1p1f1")
cat.unique("experiment_id")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 3
1 # Searching for all experiment_ids produced with ESM 'EC-Earth3' and as ensemble member 'r1i1p1f1':
----> 3 cat = col.search(source_id="EC-Earth3", member_id="r1i1p1f1")
4 cat.unique("experiment_id")
NameError: name 'col' is not defined
# Searching for all valid ensemble member_ids produced with ESM 'EC-Earth3' for experiment 'abrupt-4xCO2'
cat = col.search(source_id="EC-Earth3", experiment_id="abrupt-4xCO2")
cat.unique("member_id")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 3
1 # Searching for all valid ensemble member_ids produced with ESM 'EC-Earth3' for experiment 'abrupt-4xCO2'
----> 3 cat = col.search(source_id="EC-Earth3", experiment_id="abrupt-4xCO2")
4 cat.unique("member_id")
NameError: name 'col' is not defined
β¨ Do not search for institution_id
, table_id
and member_id
unless you are sure about what you are doing.
Instead, begin to search for
experiment_id
, source_id
, variable_id
.
How can I find the variables I need? π#
Search for the matching
standard_name
Most of the data in the data pool is compliant to the Climate and Forecast Convention. This defines standard_names
, which need to be assigned to variables as a variable attribute inside the data. As a reliable description of the variable, the standard_name
is a bridge to the shorter variable identifier name in the data, the so-called short_name
. This short name is saved in the data catalogs which can be searched.
Search for corresponding
short_name
s in the CMIP6 data request
E.g., you get many results for air_temperature
. Multiple definitions for one βphysicalβ variable like air_temperature exist in CMIP which are mostly specific diagnostics of that variable like tasmin
and tasmax
. Sometimes, there is output for a specific level given as a variable, e.g. ta500
. This can be the case if not all levels are requested for a specific frequency.
Best practice in ESGF
Search for the fitting
mip_table
Each mip_table
is a combination of requirements for an output variable_id
including
frequency
time cell methods (average or instantaneous)
vertical level (e.g. interpolated on pressure levels)
grid
realm (e.g. atmosphere model output or ocean model output)
This requirements are set according to the interest of the MIPs. Variables with the similar requirements are collected in one MIP-table which can be identified by table_id
.
The data infrastructure for the DKRZ CDP#
In order to tackle the challenges of data provision and dissemination for a 4 PB repository, a state-of-the-art data infrastructure has been developed around that pool. In the following, we highlight three aspects of the data workflow.
You benefit from the DKRZ CDP because
its data is standardized and quality controlled π
it is a curated, updated, published and catalogized data repository π©βπ
it prevents data duplication and downloading into local workspaces which is inefficient, expensive and just a waste of storage resources π
Data quality#
CMIP6 data is only available in a common and reliable Data format
No adaptions needed for output of specific models
Makes data interoperable π enabling evaluation software products as, for example, ESMValTool
π CMIP6 data was quality controlled before published with PrePARE
CMIP6 data is transparent about occuring errors
Search the errata data base for origins of suspicious analysis results β
If you find an error, please inform the modeling group. Either via the contact in the citation or, if available, via the contact
attriubte in the file.
Data publication#
Exentended documentation for simulation conducts provided in the ES-Doc data base
Persistent Identfier (PIDs) ensure long-term webaccess to dataset information
Citation information and DOIs for all published datasets easily retrievable
One method to retrieve a citation from the data is via the attribute further_info_url
import xarray
random_file=xarray.open_dataset(cat.df["uri"][0])
random_file.attrs["further_info_url"]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 2
1 import xarray
----> 2 random_file=xarray.open_dataset(cat.df["uri"][0])
3 random_file.attrs["further_info_url"]
NameError: name 'cat' is not defined
When using data provided in the framework of the DKRZ CMIP Data Pool as basis for a publication, we ask you to add the following text to the Acknowledgements-Section:
βWe acknowledge the CMIP6 community for providing the climate model data, retained and globally distributed in the framework of the ESGF. The CMIP6 data of this study were replicated and made available for this study by the DKRZ.β
Contacts#
Requests for replication and retraction: data-poolATdkrz.de π«
News and updates will be announced
on the new DKRZ User Portal
via the mailing list β cmip-data-poolATlists.dkrz.de
This notebook is a collaboration effort by the DM Data Infrastructure team.
π Thank you for your attention!