{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Egest\n",
    "\n",
    "Due to the limited amount of disk space, we have to remove data from the pool storage in case the **data is not valid** anymore. It highly depends on the publishing data center whether this information is quickly and correctly disseminated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We move datasets out of the official data pool tree into the directory\n",
    "`retracted=/mnt/lustre02/work/ik1017/CMIP6/data/CMIP6_retracted` \n",
    "in case one of the following condition is fulfilled:\n",
    "\n",
    "- The dataset is **retracted** at the ESGF node where it has been published\n",
    "- The dataset is **outdated**, no matter if the dataset is retracted or not\n",
    "- The dataset is **removed** at the ESGF data node where it has been published\n",
    "\n",
    "Additionally, datasets which we host in the data pool for other services like the *C3S Climate Data Store* are not moved even if the data has been deleted in ESGF."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The datasets remain *available on the filesystem up to next archival circle*.\n",
    "- The `retracted` tree is archived once per month."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the following script, you can find out datasets that have been either **retracted** or are **outdate**d. A similar script is run as a *cronjob* on the dkrz system.\n",
    "- see https://esgf-pyclient.readthedocs.io/en/latest/ for pyclient documentation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "from pyesgf.search import SearchConnection\n",
    "\n",
    "#One of the most stable search indeces is the llnl index\n",
    "\n",
    "index_search_url = 'http://esgf-node.llnl.gov/esg-search'\n",
    "#index_search_url = 'http://esgf-data.dkrz.de/esg-search'\n",
    "\n",
    "conn = SearchConnection(index_search_url, distrib=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We read in the list of already retracted and outdatet datasets in order to only append to that list.\n",
    "Once in a while, we start from scratch in order to exclude the possibility of re-published data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "content=[]\n",
    "with open(\"retracted_unlatest.txt\", \"r\") as f:\n",
    "    content = f.readlines()\n",
    "    content = [x.strip() for x in content]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def addtolist(recentctx):\n",
    "    with open(\"retracted_unlatest.txt\", \"a\") as f:\n",
    "        for dataset in tqdm(recentctx):\n",
    "            instance = dataset.json[\"instance_id\"]\n",
    "            if instance not in content:\n",
    "                f.write(instance+\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following block searches for **retracted** datasets in the global ESGF network and adds missing datasets to our list of missing datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ctx = conn.new_context(retracted=True, replica=False)\n",
    "recentct=ctx.search()\n",
    "addtolist(recentct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In February 2021, this list has a length of 150 000 datasets that have been retracted officially from ESGF."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following block searches for **outdated** datasets in the global ESGF network and adds missing datasets to our list of missing datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ctx = conn.new_context(retracted=False, latest=False, replica=False)\n",
    "recentct=ctx.search()\n",
    "addtolist(recentct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to the retracted datasets, 50 000 datasets are officially **outdated** in february 2021."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}