Egest

Egest#

Due to the limited amount of disk space, we have to remove data from the pool storage in case the data is not valid anymore. It highly depends on the publishing data center whether this information is quickly and correctly disseminated.

We move datasets out of the official data pool tree into the directory retracted=/mnt/lustre02/work/ik1017/CMIP6/data/CMIP6_retracted in case one of the following condition is fulfilled:

The dataset is retracted at the ESGF node where it has been published
The dataset is outdated, no matter if the dataset is retracted or not
The dataset is removed at the ESGF data node where it has been published

Additionally, datasets which we host in the data pool for other services like the C3S Climate Data Store are not moved even if the data has been deleted in ESGF.

The datasets remain available on the filesystem up to next archival circle.
The retracted tree is archived once per month.

With the following script, you can find out datasets that have been either retracted or are outdated. A similar script is run as a cronjob on the dkrz system.

see https://esgf-pyclient.readthedocs.io/en/latest/ for pyclient documentation

from tqdm import tqdm
from pyesgf.search import SearchConnection

#One of the most stable search indeces is the llnl index

index_search_url = 'http://esgf-node.llnl.gov/esg-search'
#index_search_url = 'http://esgf-data.dkrz.de/esg-search'

conn = SearchConnection(index_search_url, distrib=True)

We read in the list of already retracted and outdatet datasets in order to only append to that list. Once in a while, we start from scratch in order to exclude the possibility of re-published data.

content=[]
with open("retracted_unlatest.txt", "r") as f:
    content = f.readlines()
    content = [x.strip() for x in content]

def addtolist(recentctx):
    with open("retracted_unlatest.txt", "a") as f:
        for dataset in tqdm(recentctx):
            instance = dataset.json["instance_id"]
            if instance not in content:
                f.write(instance+"\n")

The following block searches for retracted datasets in the global ESGF network and adds missing datasets to our list of missing datasets.

ctx = conn.new_context(retracted=True, replica=False)
recentct=ctx.search()
addtolist(recentct)

In February 2021, this list has a length of 150 000 datasets that have been retracted officially from ESGF.

The following block searches for outdated datasets in the global ESGF network and adds missing datasets to our list of missing datasets.

ctx = conn.new_context(retracted=False, latest=False, replica=False)
recentct=ctx.search()
addtolist(recentct)

In addition to the retracted datasets, 50 000 datasets are officially outdated in february 2021.