{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Egest\n", "\n", "Due to the limited amount of disk space, we have to remove data from the pool storage in case the **data is not valid** anymore. It highly depends on the publishing data center whether this information is quickly and correctly disseminated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We move datasets out of the official data pool tree into the directory\n", "`retracted=/mnt/lustre02/work/ik1017/CMIP6/data/CMIP6_retracted` \n", "in case one of the following condition is fulfilled:\n", "\n", "- The dataset is **retracted** at the ESGF node where it has been published\n", "- The dataset is **outdated**, no matter if the dataset is retracted or not\n", "- The dataset is **removed** at the ESGF data node where it has been published\n", "\n", "Additionally, datasets which we host in the data pool for other services like the *C3S Climate Data Store* are not moved even if the data has been deleted in ESGF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The datasets remain *available on the filesystem up to next archival circle*.\n", "- The `retracted` tree is archived once per month." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the following script, you can find out datasets that have been either **retracted** or are **outdate**d. A similar script is run as a *cronjob* on the dkrz system.\n", "- see https://esgf-pyclient.readthedocs.io/en/latest/ for pyclient documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm\n", "from pyesgf.search import SearchConnection\n", "\n", "#One of the most stable search indeces is the llnl index\n", "\n", "index_search_url = 'http://esgf-node.llnl.gov/esg-search'\n", "#index_search_url = 'http://esgf-data.dkrz.de/esg-search'\n", "\n", "conn = SearchConnection(index_search_url, distrib=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We read in the list of already retracted and outdatet datasets in order to only append to that list.\n", "Once in a while, we start from scratch in order to exclude the possibility of re-published data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "content=[]\n", "with open(\"retracted_unlatest.txt\", \"r\") as f:\n", " content = f.readlines()\n", " content = [x.strip() for x in content]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def addtolist(recentctx):\n", " with open(\"retracted_unlatest.txt\", \"a\") as f:\n", " for dataset in tqdm(recentctx):\n", " instance = dataset.json[\"instance_id\"]\n", " if instance not in content:\n", " f.write(instance+\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following block searches for **retracted** datasets in the global ESGF network and adds missing datasets to our list of missing datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ctx = conn.new_context(retracted=True, replica=False)\n", "recentct=ctx.search()\n", "addtolist(recentct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In February 2021, this list has a length of 150 000 datasets that have been retracted officially from ESGF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following block searches for **outdated** datasets in the global ESGF network and adds missing datasets to our list of missing datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ctx = conn.new_context(retracted=False, latest=False, replica=False)\n", "recentct=ctx.search()\n", "addtolist(recentct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to the retracted datasets, 50 000 datasets are officially **outdated** in february 2021." ] } ], "metadata": { "kernelspec": { "display_name": "python3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "nbsphinx": { "execute": "never" } }, "nbformat": 4, "nbformat_minor": 4 }