Core API#

fetchez.core#

This module is the core of the Fetchez library. It handles the initialization of fetchers, connection pooling, threading, and the base FetchModule class.

copyright:
  1. 2010-2026 Regents of the University of Colorado

license:

MIT, see LICENSE for more details.

fetchez.core.fetches_callback(r)[source]#

Default callback for fetches processes. r: [url, local-fn, data-type, fetch-status-or-error-code]

Parameters:

r (List[Any])

fetchez.core.urlencode_(opts)[source]#

Encode opts for use in a URL.

Parameters:

opts (Dict)

Return type:

str

fetchez.core.urlencode(opts, doseq=True)[source]#

Encode opts for use in a URL.

Parameters:
  • opts (Dict) – Dictionary of query parameters.

  • doseq (bool, default: True) – If True, lists in values are encoded as separate parameters (e.g., {‘a’: [1, 2]} -> ‘a=1&a=2’).

Return type:

str

fetchez.core.xml2py(node)[source]#

Parse an xml file into a python dictionary.

Return type:

Optional[Dict]

fetchez.core.get_userpass(authenticator_url)[source]#

Retrieve username and password from netrc for a given URL.

Parameters:

authenticator_url (str)

Return type:

Tuple[Optional[str], Optional[str]]

fetchez.core.get_credentials(url, authenticator_url='https://urs.earthdata.nasa.gov')[source]#

Get user credentials from .netrc or prompt for input. Used for EarthData, etc.

Parameters:
  • url (str)

  • authenticator_url (str, default: 'https://urs.earthdata.nasa.gov')

Return type:

Optional[str]

class fetchez.core.iso_xml(url=None, xml=None, timeout=20, read_timeout=60)[source]#

Bases: object

Helper class for parsing ISO 19115 XML Metadata.

__init__(url=None, xml=None, timeout=20, read_timeout=60)[source]#
title()[source]#

Extract Title.

abstract()[source]#

Extract Abstract.

date()[source]#

Extract Date.

linkages()[source]#

Extract first valid download URL (specifically looking for Zips/Data).

polygon(geom=True)[source]#

Extract Bounding Box and return GeoJSON Polygon.

class fetchez.core.HttpFile(url, session=None, callback=None)[source]#

Bases: IOBase

A file-like object backed by an HTTP URL.

Translates read() calls into HTTP Range requests to fetch only needed bytes.

__init__(url, session=None, callback=None)[source]#
seek(offset, whence=0)[source]#

Change the stream position to the given byte offset.

offset

The stream position, relative to ‘whence’.

whence

The relative position to seek from.

The offset is interpreted relative to the position indicated by whence. Values for whence are:

  • os.SEEK_SET or 0 – start of stream (the default); offset should be zero or positive

  • os.SEEK_CUR or 1 – current stream position; offset may be negative

  • os.SEEK_END or 2 – end of stream; offset is usually negative

Return the new absolute position.

tell()[source]#

Return current stream position.

read(size=-1)[source]#
class fetchez.core.Fetch(url, callback=<function fetches_callback>, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0'}, verify=True, allow_redirects=True)[source]#

Bases: object

Fetch class to fetch ftp/http data files

Parameters:
  • url (str)

  • headers (Dict, default: {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0'})

  • verify (bool, default: True)

  • allow_redirects (bool, default: True)

__init__(url, callback=<function fetches_callback>, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0'}, verify=True, allow_redirects=True)[source]#
Parameters:
  • url (str)

  • headers (Dict, default: {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0'})

  • verify (bool, default: True)

  • allow_redirects (bool, default: True)

fetch_req(method='GET', params=None, data=None, json=None, tries=5, timeout=30, read_timeout=120)[source]#

Fetch src_url and return the requests object (iterative retry).

Parameters:
Return type:

Optional[Response]

fetch_html(timeout=2)[source]#

Fetch src_url and return it as an HTML object.

fetch_xml(timeout=2, read_timeout=10)[source]#

Fetch src_url and return it as an XML object.

fetch_file(dst_fn, method='GET', params=None, datatype=None, overwrite=False, timeout=30, read_timeout=120, tries=5, check_size=True, verbose=True)[source]#

Fetch src_url and save to dst_fn with resume support.

Parameters:

dst_fn (str)

Return type:

int

fetch_ftp_file(dst_fn, params=None, datatype=None, overwrite=False)[source]#

Fetch an ftp file via ftplib with a progress bar.

fetchez.core.run_fetchez(modules, threads=3, global_hooks=None)[source]#

Run Fetchez in parallel with hooks.

  • mod.hooks: Run ONLY on entries belonging to ‘mod’.

  • global_hooks: Run on ALL entries combined.

Parameters:
class fetchez.core.FetchModule(src_region=None, callback=<function fetches_callback>, hook=None, outdir=None, name='fetches', min_year=None, max_year=None, weight=1.0, uncertainty=0.0, params={}, **kwargs)[source]#

Bases: object

Base class for all fetch modules.

__init__(src_region=None, callback=<function fetches_callback>, hook=None, outdir=None, name='fetches', min_year=None, max_year=None, weight=1.0, uncertainty=0.0, params={}, **kwargs)[source]#
property hooks#

Combine internal and external hooks in the correct execution order.

add_hook(hook_obj)[source]#

Add a hook instance at runtime.

run()[source]#

set run in a sub-module to populate results with urls

fetch_entry(entry, check_size=True, retries=5, verbose=True)[source]#
fill_results(entry)[source]#

fill self.results with the fetch module entry

add_entry_to_results(url, dst_fn, data_type, **kwargs)[source]#

Add fetch entries to results. any keyword/args can be added to results, but we need url, dst_fn and data_type.

class fetchez.core.HttpDataset(url=None, **kwargs)[source]#

Bases: FetchModule

Fetch an http file directly.

__init__(url=None, **kwargs)[source]#
run()[source]#

set run in a sub-module to populate results with urls

class fetchez.core.Scratch(url, path, datatype, **kwargs)[source]#

Bases: FetchModule

Scratch module that just fills the results with it’s own arguments.

__init__(url, path, datatype, **kwargs)[source]#
run()[source]#

set run in a sub-module to populate results with urls