{
"cells": [
{
"cell_type": "markdown",
"id": "d9937587-ba81-4ca5-9fae-247e80f1fec5",
"metadata": {},
"source": [
"## WDCC CMIP6 data download generator on DKRZ's lustre\n",
"\n",
"This Notebook builds interactive panels `app1` and `app2` that allow you to\n",
"\n",
"1. get an overview over all fully archived CMIP6 experiments in WDCC\n",
"1. select a download wish list on file level with a GUI\n",
"1. generate a download script with `jblob`\n",
"\n",
"We heavily make use of `pandas`, `panel` and the python interface to the WDCC.\n",
"This Notebook works well on Levante with the `/work/bm1344/conda-evns/py_312` environment.\n",
"\n",
"[This link](https://s3.eu-dkrz-1.dkrz.cloud/bm1344/test/wdcc_dataset_tabulator/wdcc_dataset_tabulator.html) will lead you to a deployed version of both Apps. It will start a web worker in the background which runs the code and finally reloads the web page."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6fa6dc67-5f60-4876-bfc0-267aac3400c8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import panel as pn\n",
"pn.extension('tabulator',sizing_mode=\"stretch_width\", template=\"fast\", inline=True)\n",
"from panel.widgets import Tqdm\n",
"from bokeh.models.widgets.tables import HTMLTemplateFormatter\n",
"import pandas as pd\n",
"import json\n",
"import requests\n",
"import tempfile\n",
"import hvplot.pandas\n",
"from pywdcc import WDCCClient\n",
"client=WDCCClient()\n",
"\n",
"ALLOWED_NO_OF_DATASETS=700\n",
"LANDING_PAGE_TRUNK_URL=\"https://www.wdc-climate.de/ui/entry?acronym=\"\n",
"drs='Project_id/Activity_id/Institution_id/Source_id/Experiment_id/Member_id/Table_id/Variable_id/Grid_label/Version_id/Filename'\n",
"to_hide=[\"Path\",\"Filename\", \"Record\"]"
]
},
{
"cell_type": "markdown",
"id": "7d29dd55-ab43-4482-9720-12b2a82d7f0f",
"metadata": {},
"source": [
"### Fully archived CMIP6 experiments\n",
"\n",
"`app1` fetches data from the WDCC API to generate an interactive table for all fully archived CMIP6 experiments. The table contains a column for each DRS element. A identifier column contains a link to the WDCC landing page so that one can proceed with the GUI interface at any point.\n",
"\n",
"We also add filters to both allow to subset the table and show the unique values of each column."
]
},
{
"cell_type": "markdown",
"id": "195752d8-0ca0-4ce7-89bf-6b36ee3451dd",
"metadata": {},
"source": [
"#### Create a table"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a62f48a5-d24b-4338-8183-5a20ca9e0d4d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def map_drs(row:list,start:int=4,end:int=999)->dict:\n",
" global drs\n",
" drs_dict={}\n",
" for key,value in zip(\n",
" drs.split('/')[start:end],\n",
" row[\"Path\"].split('/')[start:end]\n",
" ):\n",
" drs_dict[key]=value\n",
" return drs_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d354b6e8-3efa-4652-808b-7ec007f2c585",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"link_formatter = HTMLTemplateFormatter(\n",
" template=f'\"><%= value %>'\n",
")"
]
},
{
"cell_type": "markdown",
"id": "50a2cf86-dcf7-4600-8737-0eb591f53e83",
"metadata": {},
"source": [
"Now we fetch data for the experiments. To not overload the API, we do it in a loop. `model_dump` will return a `dict` object."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "baf69d1f-c4cf-4283-a5b1-d04aaf75bc4f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"total_docs=[]\n",
"estimated_total_number_of_exps=1500\n",
"for s in range(0,estimated_total_number_of_exps,100):\n",
" cmip6_solr_dict=await client.get_all_fully_archived_cmip6_exps(start=s)\n",
" total_docs+=cmip6_solr_dict.model_dump()[\"docs\"]"
]
},
{
"cell_type": "markdown",
"id": "352268df-8530-436a-bec8-f3a141b2a6ad",
"metadata": {},
"source": [
"Next, we parse the information as a pandas DataFrame table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "995236c9-b84f-432d-9992-d70a8693aa37",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df=pd.DataFrame.from_records(\n",
" total_docs,\n",
" exclude=[\n",
" 'id',\n",
" #'entry_acronym_s',\n",
" 'textSuggest',\n",
" #'entry_name_s',\n",
" 'title_sort',\n",
" 'creation_date_dt', 'entry_type_s', 'entry_type_l',\n",
" 'publication_type_s', 'publication_type_l', 'progress_acronym_s',\n",
" 'summary_s', 'data_size_l', 'geo', 'date_range_rdt', 'general_key_ss',\n",
" 'institute_info_ss', 'pinstitute_info_ss', 'person_name_ss',\n",
" 'project_name_ss', 'project_acronym_ss', 'hierarchy_ss',\n",
" 'hierarchy_steps_ss', 'access_s', 'authors_s', 'is_downloadable',\n",
" '_version_', 'score']\n",
").rename(columns=dict(\n",
" entry_acronym_s=\"Acronym\",\n",
" entry_name_s=\"Path\"\n",
"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c361df71-9a2a-4948-821e-6ae0ac4b3fe0",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"df[\"Path\"]=df[\"Path\"].str.lstrip(\"WCRP\").str.strip().str.replace(' ','/')"
]
},
{
"cell_type": "markdown",
"id": "6f25362c-5c9f-4fba-ac0a-fb3b4f1e7afd",
"metadata": {},
"source": [
"We can map the `Path` column to the DRS so that we can create more columns: One for each DRS element:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2cbafd01-b3fa-4bb9-821f-cc208b4d3ac6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"applied_df=df.apply(lambda row: map_drs(row,start=1,end=5), axis='columns', result_type='expand')\n",
"c6df = pd.concat([df, applied_df], axis='columns') "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0b74283-b6c1-4a51-927d-036104058dda",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"c6df"
]
},
{
"cell_type": "markdown",
"id": "fe255970-6ee1-46a1-aab3-352ef9519840",
"metadata": {},
"source": [
"#### Make the table interactive\n",
"\n",
"At this stage, we could continue to work with `c6df` programmatically. To make it even more convenient, we parse the table with the `pn.widgets.Tabulator` to make it interactive. For the final `app1`, we add filters on the left to show the unique values for each column."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d468edf8-103f-44a7-a1f0-cf988c75f5ac",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def get_filter_columns(df) -> dict:\n",
" column_filter_dict={}\n",
" for col in list(df.columns):\n",
" if col in to_hide or \"Size\" in col or \"blob\" in col.lower():\n",
" continue\n",
" select=pn.widgets.MultiChoice(\n",
" options=sorted(list([a for a in df[col].unique() if a])),\n",
" name=f'{col}',\n",
" search_option_limit=100\n",
" )\n",
" column_filter_dict[col]=select\n",
" return column_filter_dict"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93d87dbf-91c1-462f-a3db-ceab6aad56cf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"c6tabu = pn.widgets.Tabulator(\n",
" c6df,\n",
" show_index=False,\n",
" formatters={\"Acronym\": link_formatter},\n",
" hidden_columns=to_hide,\n",
" header_filters=True,\n",
" selectable=1,\n",
" page_size=20,\n",
" disabled=True\n",
")\n",
"c6filters=get_filter_columns(c6df)\n",
"for col,cf in c6filters.items():\n",
" c6tabu.add_filter(cf,col)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "577e3c21-c89a-488b-b270-e8b90fdc01a8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"app1=pn.Column(\n",
" pn.Row(\n",
" pn.Column(\n",
" *list(c6filters.values())\n",
" ),\n",
" pn.Column(\n",
" pn.pane.HTML(f' Fully archived CMIP6 experiments'),\n",
" c6tabu\n",
" ),\n",
" name=\"prerendered\"\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dda85d11-5c9d-4454-8094-051bdcab3184",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"app1.servable()"
]
},
{
"cell_type": "markdown",
"id": "3bebcb23-d04d-4037-ae8a-2f5ab63d1690",
"metadata": {},
"source": [
"### File subsetting and download script generation for a specific dataset\n",
"\n",
"For our `app2`, we use two widgets in a row: One *TextInput* for a WDCC Dataset Landing Page, and one trigger button to create a row including the interactive table for that input. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d13434e-2e1e-43a0-a2ac-b8b810e017b8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"app2=pn.Row()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e43e74e3-63cf-457e-85b6-22b539717479",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"dataset_landing_page = pn.widgets.TextInput(\n",
" name='WDCC Dataset Landing Page',\n",
" placeholder='https://www.wdc-climate.de/ui/entry?acronym=C6_5189604',\n",
" description=\"Specify a URL to a WDCC Dataset Landing Page\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "080f82e8-9d4a-4987-96b6-6a370e99ca5a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"op=pn.widgets.Button(\n",
" name=\"Create a interactive table\",\n",
" description=\"This will fetch the WDCC Dataset and children\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0506a515-0ba4-493a-a113-b45e960b5492",
"metadata": {},
"source": [
"Each created table should be removable. For that, we define the `remove_row` function and bind it to a button that comes with the interactive table. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c13b50e8-aae6-4aa5-93d3-0191c6b52540",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def remove_row(event,name):\n",
" global app2\n",
" if not event:\n",
" return \"\"\n",
" idxl=[]\n",
" for i,elem in enumerate(app2):\n",
" if name in elem.name :\n",
" idxl.append(i)\n",
" if idxl:\n",
" for idx in idxl:\n",
" app2.pop(idx)\n",
" return \"\""
]
},
{
"cell_type": "markdown",
"id": "cde7798f-5da5-46ec-aad3-3415cb693588",
"metadata": {},
"source": [
"For a selection of the interactive table, you can click another button that will generate a download script that will work on Levante."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffcf45ee-4930-4c02-8606-adbfc96c96f3",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def create_jblob_script(event,name):\n",
" global app2\n",
" if not event:\n",
" return \"\"\n",
" ltabu=None\n",
" dow=None\n",
" for i,elem in enumerate(app2):\n",
" if name in elem.name :\n",
" ltabu=elem[0][-1][-1]\n",
" if ltabu: \n",
" recent_df=ltabu.current_view\n",
" tmp_file=tempfile.NamedTemporaryFile(suffix=\".sh\",delete=False)\n",
" # tmp_file.name,\n",
" fname=f\"jblob-script-{name}.sh\"\n",
"\n",
" with open(tmp_file.name, 'w') as f:\n",
" f.write('module load jblob\\n')\n",
" for local_acro in recent_df[\"Acronym\"].unique():\n",
" subtabu=recent_df[recent_df[\"Acronym\"]==local_acro]\n",
" for rec in subtabu[\"Record\"].unique():\n",
" subsubtabu=subtabu[subtabu[\"Record\"]==rec]\n",
" f.write(f\"mkdir -p {'/'.join(subsubtabu['Path'].values[0].split('/')[:-1])}\\n\")\n",
" f.write(f\"jblob --dataset {local_acro} --rmin {rec} --rmax {rec} --file {subsubtabu['Path'].values[0]}\\n\")\n",
"\n",
" dow=pn.widgets.FileDownload(\n",
" file=tmp_file.name,\n",
" filename=fname\n",
" )\n",
" return dow \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e591fd8-ba8a-4690-bf86-636265b9ef24",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def refine_df(df):\n",
" df=df.rename(\n",
" columns=dict(\n",
" FILE_NAME=\"Path\",\n",
" FILE_SIZE=\"Size [MB]\",\n",
" CACHED_CONTAINER=\"Cached\",\n",
" BLOB_ID=\"Record\"\n",
" )\n",
" ) \n",
" df[\"Record\"]=df[\"Record\"].astype(int)\n",
" df[\"Filename\"]=df[\"Path\"].str.split('/').str[-1]\n",
" df[\"Size [MB]\"]=(df[\"Size [MB]\"]/1024**2).astype(int)\n",
" df.loc[~df[\"Cached\"].isnull(),\"Cached\"]=\"Yes\" \n",
" df.loc[df[\"Cached\"].isnull(),\"Cached\"]=\"No\"\n",
" df[\"Time slice\"]=df[\"Filename\"].str.split('_').str[-1].str.strip('.nc')\n",
" df.loc[df[\"Time slice\"].str[0]=='g',\"Time slice\"]=\"Fix\"\n",
" df=df.drop(columns=\"Filename\")\n",
" applied_df=df.apply(lambda row: map_drs(row), axis='columns', result_type='expand')\n",
" df = pd.concat([df, applied_df], axis='columns') \n",
" df[\"Grid_label\"]=df[\"Grid_label\"].astype(str)\n",
" return df"
]
},
{
"cell_type": "markdown",
"id": "474989ce-262a-4888-a97e-537c0a5745bc",
"metadata": {},
"source": [
"If you do not want to work with the app but rather only with the DataFrame, you can just run the `create_df` function for a specific epxeriment acronym `main_acronym`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ef93ca43-f479-4feb-bee1-ebcd5924105e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"async def create_df(client, main_acronym):\n",
" global ALLOWED_NO_OF_DATASETS, app2\n",
" all_acronyms=await client.get_dataset_acronyms(main_acronym)\n",
" no_all_acronyms=len(all_acronyms)\n",
" if no_all_acronyms > ALLOWED_NO_OF_DATASETS:\n",
" raise ValueError(f\"More than {ALLOWED_NO_OF_DATASETS} acronyms are not allowed. Found {no_all_acronyms} for acronym {main_acronym}\")\n",
" df=pd.DataFrame()\n",
" print(\"Start for loop...\")\n",
" tqdm=Tqdm()\n",
" app2.append(tqdm) \n",
" for acronym in tqdm(all_acronyms):\n",
" download_form=await client.get_download_form(acronym)\n",
" #print(download_form)\n",
" newdf=pd.DataFrame.from_records(\n",
" download_form[\"downloadInfo\"][\"metaTable\"],\n",
" exclude=[\n",
" 'ENTRY_ID',\n",
" #'BLOB_ID',\n",
" 'CHECKSUM',\n",
" 'CHECKSUM_TYPE',\n",
" 'EXTERNAL_IDENTIFIER',\n",
" 'START_DATE',\n",
" 'UPLOAD_DATE',\n",
" #'CACHED_CONTAINER'\n",
" ]\n",
" )\n",
" newdf[\"Acronym\"]=acronym\n",
" df=pd.concat(\n",
" [\n",
" df,\n",
" newdf\n",
" ]\n",
" )\n",
" df=refine_df(df)\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92b28d33-3987-4b23-b18c-4baa63d79185",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"async def create_app2(event):\n",
" global to_hide, app2, link_formatter, ALLOWED_NO_OF_DATASETS, client\n",
" landing_page=dataset_landing_page.value\n",
" headers = {\n",
" \"User-Agent\": \"Mozilla/5.0\"\n",
" } \n",
" resp=requests.get(landing_page,allow_redirects=True,headers=headers)\n",
" if resp.status_code >500:\n",
" raise ValueError(f\"Landing page {landing_page} returned >500\")\n",
" landing_page=resp.url\n",
" main_acronym=await client.get_acronym_from_landing_page(landing_page)\n",
" if not main_acronym:\n",
" raise ValueError(f\"Failed to find acronym for landing page {landing_page}\")\n",
" #all_md=get_all_metadata_for_acronym(main_acronym)\n",
" if await client.check_project_for_acronym(main_acronym,\"CMIP6\"):\n",
" df = await create_df(client, main_acronym)\n",
" tabu = pn.widgets.Tabulator(\n",
" df,\n",
" show_index=False,\n",
" formatters={\"Acronym\": link_formatter},\n",
" hidden_columns=to_hide,\n",
" header_filters=True,\n",
" #selectable=1,\n",
" page_size=20,\n",
" disabled=True\n",
" )\n",
" column_filter_dict=get_filter_columns(df)\n",
" for col,select in column_filter_dict.items():\n",
" tabu.add_filter(select, col) \n",
" rem=pn.widgets.Button(\n",
" name=\"Remove table\"\n",
" ) \n",
" jblob=pn.widgets.Button(\n",
" name=\"Create jblob script\"\n",
" ) \n",
" avw=pn.pane.HTML(f' Volume of files of dataset: {str(df[\"Size [MB]\"].sum())} MB ')\n",
" uvn=pn.pane.HTML(f' Unique variable names of dataset: {str(len(df[\"Variable_id\"].unique()))} ')\n",
" app2.pop(-1)\n",
" app2.append(\n",
" pn.Column(\n",
" pn.Row(\n",
" pn.Column(\n",
" *list(column_filter_dict.values()),\n",
" rem,\n",
" pn.bind(\n",
" remove_row,\n",
" rem,\n",
" main_acronym\n",
" ),\n",
" jblob,\n",
" pn.bind(\n",
" create_jblob_script,\n",
" jblob,\n",
" main_acronym\n",
" )\n",
" ),\n",
" pn.Column(\n",
" pn.pane.HTML(f' {main_acronym} '),\n",
" tabu\n",
" )\n",
" ),\n",
" pn.Row(\n",
" avw,\n",
" uvn,\n",
" ),\n",
" name=main_acronym \n",
" )\n",
" )\n",
" else:\n",
" raise ValueError(f\"Acronym {main_acronym} not a CMIP6 dataset\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58091336-f910-443a-a078-c841f6ccc2b8",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def opclick(event):\n",
" open_dataset()\n",
"\n",
"op.on_click(create_app2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b963aaa3-04de-4d3f-a38c-5071e4f657a2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"app2=pn.Column(pn.Row(dataset_landing_page,op))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c6e6449-88dd-460c-9e41-5a5f6ed8056a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"app2.servable()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "python3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}