CMIP6 storage
Contents
CMIP6 storage#
with panel
, pandas
and hvplot
The primary publication of national Earth System Model data at DKRZ takes the largest part of the CMIP Data Pool (CDP). Most of the data have been produced within the national CMIP Project DICAD and in the compute project RZ988.
DKRZ supports modeling groups in all steps of the data wokflow from preparation to publication. In order to track and display the effort for this data workflow, we run automated scripts (cronjobs) which capture the extent of the final product: the disk space usage of these groups in the data pool and update it daily. The resulting statistics are uploaded into a public and freely available swift storage.
In the following, we create responsive bar plots with pandas
, pandas
and hvplot
for statistical Key Performance Indicators of the CDP.
kpis=["size [TB]", "filenumber","datasets"]
German contribution and publication#
Here we present you statistics of DICAD contributions to the CDP. Datasets which were
created as part of DICAD and
have been primarily published at the DKRZ ESGF Node
are considered.
The statisctis are computed by grouping the measures by:
source_id: Earth System Models (ESM)s which have contributed to the CDP.
institution_id: Institutions which have conducted and submitted model simulations to the CDP.
publication type: How much data has been published and replicated at DKRZ ESGF node.
import panel as pn
pn.extension("tabulator")
import pandas as pd
sourcesumdf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-source.csv.gz").sort_values("size", ascending=False)
allinstdf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-dicad-institutes.csv.gz").sort_values("size", ascending=False)
allreplicadf = pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-publicationType.csv.gz").sort_values("size", ascending=False)
import intake
from pathlib import Path
import hvplot.pandas
from bokeh.models import NumeralTickFormatter
import pandas as pd
sourcesumdf["Group"]="By source_id"
sourcesumdf["Key"]="source_id"
sourcesumdf["Legend"]=sourcesumdf["source_id"]
allinstdf["Group"]="By institution_id"
allinstdf["Key"]="institution_id"
allinstdf["Legend"]=allinstdf["institution_id"]
allreplicadf["Group"]="By Publication Status"
allreplicadf["Key"]="publicationType"
allreplicadf["Legend"]=allreplicadf["publicationType"]
sourcesumdf=sourcesumdf.set_index("Group")
allinstdf=allinstdf.set_index("Group")
allreplicadf=allreplicadf.set_index("Group")
#
#plotdf=sourcesumrz.append(allinstdf).append(sourcesum).append(allreplica) #.append(expdf)
plotdf=sourcesumdf.append(allinstdf).append(allreplicadf) #.append(expdf)
/tmp/ipykernel_8809/137436908.py:21: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
plotdf=sourcesumdf.append(allinstdf).append(allreplicadf) #.append(expdf)
/tmp/ipykernel_8809/137436908.py:21: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
plotdf=sourcesumdf.append(allinstdf).append(allreplicadf) #.append(expdf)
def create_plot(group, kpi):
plot_group=grouped_df.get_group(group).sort_values(kpi, ascending=False)
a=plot_group.hvplot.bar(y=kpi,
ylabel=f"Sum of {kpi} in the CMIP6 Data Pool",
xlabel="Group",
by="Legend",
#stacked=True,
grid=True,
yformatter=NumeralTickFormatter(format='0,0'),
title="",
# legend="top_left",
fontsize={'legend': "10%"},
width=650,
height=500,
muted_alpha=0,
fontscale=1.2
)
b=plot_group.hvplot.bar(y=kpi,
ylabel="",
xlabel="Group",
by="Legend",
stacked=True,
grid=True,
yformatter=NumeralTickFormatter(format='0,0'),
title="",
legend=False,
fontsize={'legend': "10%"},
width=150,
height=500,
muted_alpha=0,
fontscale=1.2
)
return a+b
plotdf=plotdf.rename(columns={"size":"size [TB]"})
grouped_df=plotdf.groupby(["Key"])
interact = pn.interact(create_plot, group=list(grouped_df.groups.keys()), kpi=kpis)
pn.Column(pn.Card(interact[0], title="Plots for different <i>groups and kpis</i>", background='WhiteSmoke'),
interact[1]
).embed()
The German contribution to CMIP6 by the five sources of MPI-M and AWI comprises
1.6PB of data primary published at dkrz
more than 33% of the CMIP6 data pool
2Mio files or 250 000 datasets
Statistics for different source_id#
The file mistral-cmip6-allocation-by-source.csv.gz
contains the results per source with an additional classification by experiment.
MPI-ESM1-2-HR: The high resolution version of the MPI-ESM1-2. CV-entry*, Citation example
MPI-ESM1-2-LR: The lower resolution version of the MPI-ESM1-2. CV-entry*, Citation example
AWI-CM-1-1-MR: CV-entry*, Citation example
AWI-ESM-1-1-LR: CV-entry*, Citation example
ICON-ESM-LR: CV-entry*, Citation example
* CV link to the registration in the official CMIP6 Controlled Vocabulay where all CMIP6 models had to register.
As soon as CMIP6 data from other ESMs like EMAC-2-53 is available, the lists will be expanded correspondingly.
tabsource=pn.widgets.Tabulator(sourcesumdf, height=200)
filenamesource, buttonsource = tabsource.download_menu(
text_kwargs={'name': 'Enter filename', 'value': 'mistral-cmip6-dicad-sources.csv.csv', 'width':100, 'height':60},
button_kwargs={'name': 'Download table','width':100, 'height':60}
)
pn.Row(pn.Column(filenamesource,buttonsource),tabsource).embed()
Statistics for different institution_ids#
The file mistral-cmip6-allocation-by-dicad-institutes.csv.gz
contains statistics grouped by institutes that have contributed to DICAD.
tabinst=pn.widgets.Tabulator(allinstdf, height=200)
filenameinst, buttoninst = tabinst.download_menu(
text_kwargs={'name': 'Enter filename', 'value': 'mistral-cmip6-dicad-institutes.csv', 'width':100, 'height':60},
button_kwargs={'name': 'Download table','width':100, 'height':60}
)
pn.Row(pn.Column(filenameinst, buttoninst),tabinst).embed()
Statistics for different publication types#
The file mistral-cmip6-allocation-by-publicationType.csv.gz
contains statistics grouped by institutes that have contributed to DICAD
published originals: Data which has been published first at the esgf-node at dkrz and is still valid and available.
retracted originals: Data which has been published first at the esgf-node at dkrz but has also been retracted afterwards.
published replicas: Data which has been copied to and published at dkrz and is still valid and available.
retracted replicas: Data which has been copied to and published at dkrz but has also been retracted afterwards.
tabrepl=pn.widgets.Tabulator(allreplicadf, height=200)
filenamerepl, buttonrepl = tabrepl.download_menu(
text_kwargs={'name': 'Enter filename', 'value': 'mistral-cmip6-replica.csv.csv', 'width':100, 'height':60},
button_kwargs={'name': 'Download table','width':100, 'height':60}
)
pn.Row(pn.Column(filenamerepl, buttonrepl),tabrepl).embed()
timeseries=pd.read_csv("/home/k/k204210/intake-esm/statistics/mistral-cmip6-allocation-timeseries.csv.gz", index_col="Date", parse_dates=True, skiprows=1)
timeseries=timeseries.loc[timeseries["Number of Files"] != "Number of Files"]
timeseries=timeseries.dropna()
#timeseries=pd.read_csv("https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-timeseries.csv.gz", index_col="Date", parse_dates=True)
#tmplot= timeseries.loc["2019-02-18":"2020-12-31"].hvplot.line(y=["Disk Allocation [GB]", "Number of Datasets", "Number of Files"],
tmplot= timeseries.hvplot.scatter(y=["Disk Allocation [GB]", "Number of Datasets", "Number of Files"],
x="Date",
shared_axes=False,
grid=True,
yformatter=NumeralTickFormatter(format='0,0'),
width=600,
height=500,
legend="top_left",
).opts(axiswise=True)
hvplot.save(tmplot,"pool-timeseries-hvplot.html")
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 timeseries=pd.read_csv("/home/k/k204210/intake-esm/statistics/mistral-cmip6-allocation-timeseries.csv.gz", index_col="Date", parse_dates=True, skiprows=1)
2 timeseries=timeseries.loc[timeseries["Number of Files"] != "Number of Files"]
3 timeseries=timeseries.dropna()
File /mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.10/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
305 if len(args) > num_allow_args:
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
File /mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.10/site-packages/pandas/io/parsers/readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
665 kwds_defaults = _refine_defaults_read(
666 dialect,
667 delimiter,
(...)
676 defaults={"delimiter": ","},
677 )
678 kwds.update(kwds_defaults)
--> 680 return _read(filepath_or_buffer, kwds)
File /mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.10/site-packages/pandas/io/parsers/readers.py:575, in _read(filepath_or_buffer, kwds)
572 _validate_names(kwds.get("names", None))
574 # Create the parser.
--> 575 parser = TextFileReader(filepath_or_buffer, **kwds)
577 if chunksize or iterator:
578 return parser
File /mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.10/site-packages/pandas/io/parsers/readers.py:933, in TextFileReader.__init__(self, f, engine, **kwds)
930 self.options["has_index_names"] = kwds["has_index_names"]
932 self.handles: IOHandles | None = None
--> 933 self._engine = self._make_engine(f, self.engine)
File /mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1217, in TextFileReader._make_engine(self, f, engine)
1213 mode = "rb"
1214 # error: No overload variant of "get_handle" matches argument types
1215 # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
1216 # , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1217 self.handles = get_handle( # type: ignore[call-overload]
1218 f,
1219 mode,
1220 encoding=self.options.get("encoding", None),
1221 compression=self.options.get("compression", None),
1222 memory_map=self.options.get("memory_map", False),
1223 is_text=is_text,
1224 errors=self.options.get("encoding_errors", "strict"),
1225 storage_options=self.options.get("storage_options", None),
1226 )
1227 assert self.handles is not None
1228 f = self.handles.handle
File /mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.10/site-packages/pandas/io/common.py:714, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
711 assert isinstance(handle, str)
712 # error: Incompatible types in assignment (expression has type
713 # "GzipFile", variable has type "Union[str, BaseBuffer]")
--> 714 handle = gzip.GzipFile( # type: ignore[assignment]
715 filename=handle,
716 mode=ioargs.mode,
717 **compression_args,
718 )
719 else:
720 handle = gzip.GzipFile(
721 # No overload variant of "GzipFile" matches argument types
722 # "Union[str, BaseBuffer]", "str", "Dict[str, Any]"
(...)
725 **compression_args,
726 )
File /mnt/root_disk3/gitlab-runner/.conda/envs/mambaenv/lib/python3.10/gzip.py:174, in GzipFile.__init__(self, filename, mode, compresslevel, fileobj, mtime)
172 mode += 'b'
173 if fileobj is None:
--> 174 fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
175 if filename is None:
176 filename = getattr(fileobj, 'name', '')
PermissionError: [Errno 13] Permission denied: '/home/k/k204210/intake-esm/statistics/mistral-cmip6-allocation-timeseries.csv.gz'
timeseries
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [11], in <cell line: 1>()
----> 1 timeseries
NameError: name 'timeseries' is not defined
Cloud upload#
We use the swiftclient
for the upload.
#from swiftclient import client
#from swiftenvbk0988 import *
#
#with open("pool-statistics-hvplot.html", 'rb') as f:
# client.put_object(OS_STORAGE_URL, OS_AUTH_TOKEN, "Pool-Statistics", "pool-statistics-hvplot.html", f)
#with open("pool-timeseries-hvplot.html", 'rb') as f:
# client.put_object(OS_STORAGE_URL, OS_AUTH_TOKEN, "Pool-Statistics", "pool-timeseries-hvplot.html", f)