{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CMIP6 storage\n", "with `panel`, `pandas` and `hvplot`\n", "\n", "The primary publication of national Earth System Model data at DKRZ takes the largest part of the CMIP Data Pool (CDP). Most of the data have been produced within the national CMIP Project [DICAD](https://www.dkrz.de/c6de) and in the compute project RZ988. \n", "\n", "DKRZ supports modeling groups in all steps of the data wokflow from **preparation** to **publication**. In order to track and display the effort for this data workflow, we run automated scripts (*cronjobs*) which capture the extent of the final product: the disk space usage of these groups in the data pool and update it daily. The resulting statistics are uploaded into a public and freely available [swift storage](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/). \n", "\n", "In the following, we create *responsive* bar plots with `pandas`, `pandas` and `hvplot` for statistical Key Performance Indicators of the CDP." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## German contribution and publication\n", "\n", "Here we present you statistics of DICAD contributions to the CDP. Datasets which were\n", "\n", "- created as part of DICAD and\n", "- have been primarily published at the DKRZ ESGF Node \n", "\n", "are considered.\n", "\n", "The statisctis are computed by grouping the measures by:\n", "\n", "- **source_id**: Earth System Models (ESM)s which have contributed to the CDP.\n", "- **institution_id**: Institutions which have conducted and submitted model simulations to the CDP.\n", "- **publication type**: How much data has been published and replicated at DKRZ ESGF node." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "kpis=[\"size [TB]\", \"filenumber\",\"datasets\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "import panel as pn\n", "pn.extension(\"tabulator\")\n", "import pandas as pd\n", "sourcesumdf = pd.read_csv(\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-source.csv.gz\").sort_values(\"size\", ascending=False)\n", "allinstdf = pd.read_csv(\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-dicad-institutes.csv.gz\").sort_values(\"size\", ascending=False)\n", "allreplicadf = pd.read_csv(\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-by-publicationType.csv.gz\").sort_values(\"size\", ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [], "source": [ "import intake\n", "from pathlib import Path\n", "import hvplot.pandas\n", "from bokeh.models import NumeralTickFormatter\n", "import pandas as pd\n", "sourcesumdf[\"Group\"]=\"By source_id\"\n", "sourcesumdf[\"Key\"]=\"source_id\"\n", "sourcesumdf[\"Legend\"]=sourcesumdf[\"source_id\"]\n", "allinstdf[\"Group\"]=\"By institution_id\"\n", "allinstdf[\"Key\"]=\"institution_id\"\n", "allinstdf[\"Legend\"]=allinstdf[\"institution_id\"]\n", "allreplicadf[\"Group\"]=\"By Publication Status\"\n", "allreplicadf[\"Key\"]=\"publicationType\"\n", "allreplicadf[\"Legend\"]=allreplicadf[\"publicationType\"]\n", "\n", "sourcesumdf=sourcesumdf.set_index(\"Group\")\n", "allinstdf=allinstdf.set_index(\"Group\")\n", "allreplicadf=allreplicadf.set_index(\"Group\")\n", "#\n", "#plotdf=sourcesumrz.append(allinstdf).append(sourcesum).append(allreplica) #.append(expdf)\n", "plotdf=pd.concat([sourcesumdf,allinstdf,allreplicadf])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-output", "hide-input" ] }, "outputs": [], "source": [ "plotdf=plotdf.rename(columns={\"size\":\"size [TB]\"})\n", "grouped_df=plotdf.groupby([\"Key\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "plot_group=grouped_df.get_group(\"institution_id\").sort_values(\"filenumber\", ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "plot_group" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "def create_plot(group, kpi):\n", " global grouped_df\n", " plot_group=grouped_df.get_group(group).sort_values(kpi, ascending=False)\n", " a=plot_group.hvplot.bar(y=kpi,\n", " ylabel=f\"Sum of {kpi} in the CMIP6 Data Pool\",\n", " xlabel=\"Group\",\n", " by=\"Legend\",\n", " stacked=False,\n", " #grid=True,\n", " yformatter=NumeralTickFormatter(format='0,0'),\n", " title=\"\",\n", " # legend=\"top_left\",\n", " fontsize={'legend': \"10%\"},\n", " width=650,\n", " height=500,\n", " muted_alpha=0,\n", " fontscale=1.2\n", " )\n", " b=plot_group.hvplot.bar(y=kpi,\n", " ylabel=\"\",\n", " xlabel=\"Group\",\n", " by=\"Legend\",\n", " stacked=True,\n", " #grid=True,\n", " yformatter=NumeralTickFormatter(format='0,0'),\n", " title=\"\",\n", " legend=False,\n", " fontsize={'legend': \"10%\"},\n", " width=150,\n", " height=500,\n", " muted_alpha=0,\n", " fontscale=1.2\n", " )\n", " return a+b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "interact = pn.interact(create_plot, group=list(grouped_df.groups.keys()), kpi=kpis)\n", "pn.Column(pn.Card(interact[0], title=\"Plots for different groups and kpis\", styles=dict(background='WhiteSmoke')),\n", " interact[1]\n", " ).embed()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The German contribution to CMIP6 by the five sources of **MPI-M** and **AWI** comprises\n", "\n", "- 1.6PB of data primary published at dkrz\n", "- more than 33% of the CMIP6 data pool\n", "- 2Mio files or 250 000 datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistics for different *source_id*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file `mistral-cmip6-allocation-by-source.csv.gz` contains the results per source with an additional classification by experiment.\n", "\n", "- **MPI-ESM1-2-HR**: The high resolution version of the MPI-ESM1-2. [CV-entry](https://github.com/WCRP-CMIP/CMIP6_CVs/blob/20ef5bb121bdf50fc00ba5b2520298fd4766ffa9/CMIP6_source_id.json#L5621)\\*, [Citation example](https://doi.org/10.22033/ESGF/CMIP6.4403)\n", "- **MPI-ESM1-2-LR**: The lower resolution version of the MPI-ESM1-2. [CV-entry](https://github.com/WCRP-CMIP/CMIP6_CVs/blob/20ef5bb121bdf50fc00ba5b2520298fd4766ffa9/CMIP6_source_id.json#L5682)\\*, [Citation example](https://doi.org/10.22033/ESGF/CMIP6.6693)\n", "- **AWI-CM-1-1-MR**: [CV-entry](https://github.com/WCRP-CMIP/CMIP6_CVs/blob/20ef5bb121bdf50fc00ba5b2520298fd4766ffa9/CMIP6_source_id.json#L311)\\*, [Citation example](https://doi.org/10.22033/ESGF/CMIP6.376)\n", "- **AWI-ESM-1-1-LR**: [CV-entry](https://github.com/WCRP-CMIP/CMIP6_CVs/blob/20ef5bb121bdf50fc00ba5b2520298fd4766ffa9/CMIP6_source_id.json#L475)\\*, [Citation example](https://doi.org/10.22033/ESGF/CMIP6.9328)\n", "- **ICON-ESM-LR**: [CV-entry](https://github.com/WCRP-CMIP/CMIP6_CVs/blob/20ef5bb121bdf50fc00ba5b2520298fd4766ffa9/CMIP6_source_id.json#L4569)\\*, [Citation example](https://doi.org/10.22033/ESGF/CMIP6.743)\n", "\n", "\\* *CV* link to the registration in the official [CMIP6 Controlled Vocabulay](https://github.com/WCRP-CMIP/CMIP6_CVs/) where all CMIP6 models had to register.\n", " \n", "As soon as CMIP6 data from other ESMs like *EMAC-2-53* is available, the lists will be expanded correspondingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "tabsource=pn.widgets.Tabulator(sourcesumdf, height=200)\n", "filenamesource, buttonsource = tabsource.download_menu(\n", " text_kwargs={'name': 'Enter filename', 'value': 'mistral-cmip6-dicad-sources.csv.csv', 'width':100, 'height':60},\n", " button_kwargs={'name': 'Download table','width':100, 'height':60}\n", ")\n", "pn.Row(pn.Column(filenamesource,buttonsource),tabsource).embed()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistics for different *institution_id*s\n", "\n", "The file `mistral-cmip6-allocation-by-dicad-institutes.csv.gz` contains statistics grouped by **institutes** that have contributed to DICAD." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "tabinst=pn.widgets.Tabulator(allinstdf, height=200)\n", "filenameinst, buttoninst = tabinst.download_menu(\n", " text_kwargs={'name': 'Enter filename', 'value': 'mistral-cmip6-dicad-institutes.csv', 'width':100, 'height':60},\n", " button_kwargs={'name': 'Download table','width':100, 'height':60}\n", ")\n", "pn.Row(pn.Column(filenameinst, buttoninst),tabinst).embed()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistics for different *publication type*s\n", "\n", "The file `mistral-cmip6-allocation-by-publicationType.csv.gz` contains statistics grouped by **institutes** that have contributed to DICAD\n", "\n", "- *published originals*: Data which has been published first at the esgf-node at dkrz and is still valid and available.\n", "- *retracted originals*: Data which has been published first at the esgf-node at dkrz but has also been retracted afterwards.\n", "- *published replicas*: Data which has been copied to and published at dkrz and is still valid and available.\n", "- *retracted replicas*: Data which has been copied to and published at dkrz but has also been retracted afterwards." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "tabrepl=pn.widgets.Tabulator(allreplicadf, height=200)\n", "filenamerepl, buttonrepl = tabrepl.download_menu(\n", " text_kwargs={'name': 'Enter filename', 'value': 'mistral-cmip6-replica.csv.csv', 'width':100, 'height':60},\n", " button_kwargs={'name': 'Download table','width':100, 'height':60}\n", ")\n", "pn.Row(pn.Column(filenamerepl, buttonrepl),tabrepl).embed()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [], "source": [ "timeseries=pd.read_csv(\"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/Pool-Statistics/mistral-cmip6-allocation-timeseries.csv.gz\",\n", " parse_dates=True,\n", " index_col=0\n", " )\n", "tmplot= timeseries.hvplot.line(y=[\"Disk Allocation [GB]\", \"Number of Datasets\", \"Number of Files\"],\n", " shared_axes=False,\n", " yformatter=NumeralTickFormatter(format='0,0'),\n", " grid=True,\n", " width=600,\n", " height=500,\n", " legend=\"top_left\",\n", " ).opts(axiswise=True)\n", "hvplot.save(tmplot,\"pool-timeseries-hvplot.html\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cloud upload\n", "\n", "We use the `swiftclient` for the upload." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "#from swiftclient import client\n", "#from swiftenvbk0988 import *\n", "#\n", "#with open(\"pool-statistics-hvplot.html\", 'rb') as f:\n", "# client.put_object(OS_STORAGE_URL, OS_AUTH_TOKEN, \"Pool-Statistics\", \"pool-statistics-hvplot.html\", f)\n", "#with open(\"pool-timeseries-hvplot.html\", 'rb') as f:\n", "# client.put_object(OS_STORAGE_URL, OS_AUTH_TOKEN, \"Pool-Statistics\", \"pool-timeseries-hvplot.html\", f)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 4 }