CMIP6 storage

CMIP6 storage#

with panel, pandas and hvplot

The primary publication of national Earth System Model data at DKRZ takes the largest part of the CMIP Data Pool (CDP). Most of the data have been produced within the national CMIP Project DICAD and in the compute project RZ988.

DKRZ supports modeling groups in all steps of the data wokflow from preparation to publication. In order to track and display the effort for this data workflow, we run automated scripts (cronjobs) which capture the extent of the final product: the disk space usage of these groups in the data pool and update it daily. The resulting statistics are uploaded into a public and freely available swift storage.

In the following, we create responsive bar plots with pandas, pandas and hvplot for statistical Key Performance Indicators of the CDP.

German contribution and publication#

Here we present you statistics of DICAD contributions to the CDP. Datasets which were

created as part of DICAD and
have been primarily published at the DKRZ ESGF Node

are considered.

The statisctis are computed by grouping the measures by:

source_id: Earth System Models (ESM)s which have contributed to the CDP.
institution_id: Institutions which have conducted and submitted model simulations to the CDP.
publication type: How much data has been published and replicated at DKRZ ESGF node.

import warnings
warnings.filterwarnings('ignore')
kpis=["size [TB]", "filenumber","datasets"]

import panel as pn
pn.extension("tabulator")
import pandas as pd
sourcesumdf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-source.csv.gz").sort_values("size", ascending=False)
allinstdf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-dicad-institutes.csv.gz").sort_values("size", ascending=False)
allreplicadf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-publicationType.csv.gz").sort_values("size", ascending=False)

plot_group=grouped_df.get_group("institution_id").sort_values("filenumber", ascending=False)

plot_group

	source_id	size [TB]	filenumber	datasets	Key	Legend	institution_id	publicationType
Group
By institution_id	NaN	1178.140296	1350265.0	179205.0	institution_id	MPI-M	MPI-M	NaN
By institution_id	NaN	148.295859	497486.0	71831.0	institution_id	AWI	AWI	NaN
By institution_id	NaN	275.529866	226404.0	26804.0	institution_id	DKRZ	DKRZ	NaN
By institution_id	NaN	20.835685	25420.0	1069.0	institution_id	DWD	DWD	NaN

The German contribution to CMIP6 by the five sources of MPI-M and AWI comprises

1.6PB of data primary published at dkrz
more than 33% of the CMIP6 data pool
2Mio files or 250 000 datasets

Statistics for different source_id#

The file mistral-cmip6-allocation-by-source.csv.gz contains the results per source with an additional classification by experiment.

MPI-ESM1-2-HR: The high resolution version of the MPI-ESM1-2. CV-entry*, Citation example
MPI-ESM1-2-LR: The lower resolution version of the MPI-ESM1-2. CV-entry*, Citation example
AWI-CM-1-1-MR: CV-entry*, Citation example
AWI-ESM-1-1-LR: CV-entry*, Citation example
ICON-ESM-LR: CV-entry*, Citation example

* CV link to the registration in the official CMIP6 Controlled Vocabulay where all CMIP6 models had to register.

As soon as CMIP6 data from other ESMs like EMAC-2-53 is available, the lists will be expanded correspondingly.

Statistics for different institution_ids#

The file mistral-cmip6-allocation-by-dicad-institutes.csv.gz contains statistics grouped by institutes that have contributed to DICAD.

Statistics for different publication types#

The file mistral-cmip6-allocation-by-publicationType.csv.gz contains statistics grouped by institutes that have contributed to DICAD

published originals: Data which has been published first at the esgf-node at dkrz and is still valid and available.
retracted originals: Data which has been published first at the esgf-node at dkrz but has also been retracted afterwards.
published replicas: Data which has been copied to and published at dkrz and is still valid and available.
retracted replicas: Data which has been copied to and published at dkrz but has also been retracted afterwards.

tmplot

Cloud upload#

We use the swiftclient for the upload.