.\" Man page generated from reStructuredText.
.
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.TH "INTAKE" "1" "Sep 22, 2022" "0.6.6" "intake"
.SH NAME
intake \- Intake Documentation
.sp
\fITaking the pain out of data access and distribution\fP
.sp
Intake is a lightweight package for finding, investigating, loading and disseminating data. It will appeal to different
groups for some of the reasons below, but is useful for all and acts as a common platform that everyone can use to
smooth the progression of data from developers and providers to users.
.sp
Intake contains the following main components. You \fIdo not\fP need to use them all! The
library is modular, only use the parts you need:
.INDENT 0.0
.IP \(bu 2
A set of \fBdata loaders\fP (\fI\%Drivers\fP) with a common interface, so that you can
investigate or load anything, from local or remote, with the exact same call, and turning into data structures
that you already know how to manipulate, such as arrays and data\-frames.
.IP \(bu 2
A \fBCataloging system\fP (\fI\%Catalogs\fP) for listing data sources, their metadata and parameters,
and referencing which of the Drivers should load each. The catalogs for a hierarchical,
searchable structure, which can be backed by files, Intake servers or third\-party
data services
.IP \(bu 2
Sets of \fBconvenience functions\fP to apply to various data sources, such as data\-set
persistence, automatic concatenation and metadata inference and the ability to
distribute catalogs and data sources using simple packaging abstractions.
.IP \(bu 2
A \fBGUI layer\fP accessible in the Jupyter notebook or as a standalone webserver, which
allows you to find and navigate catalogs, investigate data sources, and plot either
predefined visualisations or interactively find the right view yourself
.IP \(bu 2
A \fBclient\-server protocol\fP to allow for arbitrary data cataloging services or to
serve the data itself, with a pluggable auth model.
.UNINDENT
.SH DATA USER
.INDENT 0.0
.IP \(bu 2
Intake loads the data for a range of formats and types (see \fI\%Plugin Directory\fP) into containers you already use,
like Pandas dataframes, Python lists, NumPy arrays, and more
.IP \(bu 2
Intake loads, then gets out of your way
.IP \(bu 2
GUI search and introspect data\-sets in \fI\%Catalogs\fP: quickly find what you need to do your work
.IP \(bu 2
Install data\-sets and automatically get requirements
.IP \(bu 2
Leverage cloud resources and distributed computing.
.UNINDENT
.sp
See the executable tutorial:
.sp
\fI\%https://mybinder.org/v2/gh/intake/intake\-examples/master?filepath=tutorial%2Fdata_scientist.ipynb\fP
.SH DATA PROVIDER
.INDENT 0.0
.IP \(bu 2
Simple spec to define data sources
.IP \(bu 2
Single point of truth, no more copy&paste
.IP \(bu 2
Distribute data using packages, shared files or a server
.IP \(bu 2
Update definitions in\-place
.IP \(bu 2
Parametrise user options
.IP \(bu 2
Make use of additional functionality like filename parsing and caching.
.UNINDENT
.sp
See the executable tutorial:
.sp
\fI\%https://mybinder.org/v2/gh/intake/intake\-examples/master?filepath=tutorial%2Fdata_engineer.ipynb\fP
.SH IT
.INDENT 0.0
.IP \(bu 2
Create catalogs out of established departmental practices
.IP \(bu 2
Provide data access credentials via Intake parameters
.IP \(bu 2
Use server\-client architecture as gatekeeper:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
add authentication methods
.IP \(bu 2
add monitoring point; track the data\-sets being accessed.
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
Hook Intake into proprietary data access systems.
.UNINDENT
.SH DEVELOPER
.INDENT 0.0
.IP \(bu 2
Turn boilerplate code into a reusable \fI\%Driver\fP
.IP \(bu 2
Pluggable architecture of Intake allows for many points to add and improve
.IP \(bu 2
Open, simple code\-base \-\- come and get involved on \fI\%github\fP!
.UNINDENT
.sp
See the executable tutorial:
.sp
\fI\%https://mybinder.org/v2/gh/intake/intake\-examples/master?filepath=tutorial%2Fdev.ipynb\fP
.sp
The \fI\%Start here\fP document contains the sections that all users new to Intake should
read through. \fI\%Use Cases \- I want to...\fP shows specific problems that Intake solves.
For a brief demonstration, which you can execute locally, go to \fI\%Quickstart\fP\&.
For a general description of all of the components of Intake and how they fit together, go
to \fI\%Overview\fP\&. Finally, for some notebooks using Intake and articles about Intake, go
to \fI\%Examples\fP and \fI\%intake\-examples\fP\&.
These and other documentation pages will make reference to concepts that
are defined in the \fI\%Glossary\fP\&.
.nf

.fi
.sp
.nf

.fi
.sp
.SH START HERE
.sp
These documents will familiarise you with Intake, show you some basic usage and examples,
and describe Intake\(aqs place in the wider python data world.
.SS Quickstart
.sp
This guide will show you how to get started using Intake to read data, and give you a flavour
of how Intake feels to the \fI\%Data User\fP\&.
It assumes you are working in either a conda or a virtualenv/pip environment. For notebooks with
executable code, see the \fI\%Examples\fP\&. This walk\-through can be run from a notebook or interactive
python session.
.SS Installation
.sp
If you are using \fI\%Anaconda\fP or Miniconda, install Intake with the following commands:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
conda install \-c conda\-forge intake
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you are using virtualenv/pip, run the following command:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
pip install intake
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Note that this will install with the minimum of optional requirements. If you want a more complete
install, use \fIintake[complete]\fP instead.
.SS Creating Sample Data
.sp
Let\(aqs begin by creating a sample data set and catalog.  At the command line, run the \fBintake example\fP command.
This will create an example data \fI\%Catalog\fP and two CSV data files.  These files contains some basic facts about the 50
US states, and the catalog includes a specification of how to load them.
.SS Loading a Data Source
.sp
\fI\%Data sources\fP can be created directly with the \fBopen_*()\fP functions in the \fBintake\fP
module.  To read our example data:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> import intake
>>> ds = intake.open_csv(\(aqstates_*.csv\(aq)
>>> print(ds)
<intake.source.csv.CSVSource object at 0x1163882e8>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Each open function has different arguments, specific for the data format or service being used.
.SS Reading Data
.sp
Intake reads data into memory using \fI\%containers\fP you are already familiar with:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
Tables: Pandas DataFrames
.IP \(bu 2
Multidimensional arrays: NumPy arrays
.IP \(bu 2
Semistructured data: Python lists of objects (usually dictionaries)
.UNINDENT
.UNINDENT
.UNINDENT
.sp
To find out what kind of container a data source will produce, inspect the \fBcontainer\fP attribute:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> ds.container
\(aqdataframe\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The result will be \fBdataframe\fP, \fBndarray\fP, or \fBpython\fP\&.  (New container types will be added in
the future.)
.sp
For data that fits in memory, you can ask Intake to load it directly:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> df = ds.read()
>>> df.head()
        state        slug code                nickname  ...
0     Alabama     alabama   AL      Yellowhammer State
1      Alaska      alaska   AK       The Last Frontier
2     Arizona     arizona   AZ  The Grand Canyon State
3    Arkansas    arkansas   AR       The Natural State
4  California  california   CA            Golden State
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Many data sources will also have quick\-look plotting available. The attribute \fB\&.plot\fP will list
a number of built\-in plotting methods, such as \fB\&.scatter()\fP, see \fI\%Plotting\fP\&.
.sp
Intake data sources can have \fIpartitions\fP\&.  A partition refers to a contiguous chunk of data that can be loaded
independent of any other partition.  The partitioning scheme is entirely up to the plugin author.  In
the case of the CSV plugin, each \fB\&.csv\fP file is a partition.
.sp
To read data from a data source one chunk at a time, the \fBread_chunked()\fP method returns an iterator:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> for chunk in ds.read_chunked(): print(\(aqChunk: %d\(aq % len(chunk))
\&...
Chunk: 24
Chunk: 26
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Working with Dask
.sp
Working with large datasets is much easier with a parallel, out\-of\-core computing library like
\fI\%Dask\fP\&.  Intake can create Dask containers (like \fBdask.dataframe\fP)
from data sources that will load their data only when required:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> ddf = ds.to_dask()
>>> ddf
Dask DataFrame Structure:
            admission_date admission_number capital_city capital_url    code constitution_url facebook_url landscape_background_url map_image_url nickname population population_rank skyline_background_url    slug   state state_flag_url state_seal_url twitter_url website
npartitions=2
                    object            int64       object      object  object           object       object                   object        object   object      int64           int64                 object  object  object         object         object      object  object
                        ...              ...          ...         ...     ...              ...          ...                      ...           ...      ...        ...             ...                    ...     ...     ...            ...            ...         ...     ...
                        ...              ...          ...         ...     ...              ...          ...                      ...           ...      ...        ...             ...                    ...     ...     ...            ...            ...         ...     ...
Dask Name: from\-delayed, 4 tasks
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The Dask containers will be partitioned in the same way as the Intake data source, allowing different chunks
to be processed in parallel. Please read the Dask documentation to understand the differences when
working with Dask collections (Bag, Array or Data\-frames).
.SS Opening a Catalog
.sp
A \fI\%Catalog\fP is an inventory of data sources, with the type and arguments prescribed for each, and
arbitrary metadata about each source.
In the simplest case, a catalog can be described by a file in YAML format, a
"\fI\%Catalog file\fP". In real usage, catalogues can be defined in a number of ways, such as remote
files, by
connecting to a third\-party data service (e.g., SQL server) or through an Intake \fI\%Server\fP protocol, which
can implement any number of ways to search and deliver data sources.
.sp
The \fBintake example\fP command, above, created a catalog file
with the following \fI\%YAML\fP\-syntax content:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  states:
    description: US state information from [CivilServices](https://civil.services/)
    driver: csv
    args:
      urlpath: \(aq{{ CATALOG_DIR }}/states_*.csv\(aq
    metadata:
      origin_url: \(aqhttps://github.com/CivilServiceUSA/us\-states/blob/v1.0.0/data/states.csv\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To load a \fI\%Catalog\fP from a \fI\%Catalog file\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> cat = intake.open_catalog(\(aqus_states.yml\(aq)
>>> list(cat)
[\(aqstates\(aq]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This catalog contains one data source, called \fBstates\fP\&.  It can be accessed by attribute:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> cat.states.to_dask()[[\(aqstate\(aq,\(aqslug\(aq]].head()
        state        slug
0     Alabama     alabama
1      Alaska      alaska
2     Arizona     arizona
3    Arkansas    arkansas
4  California  california
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Placing data source specifications into a catalog like this enables declaring data sets in a single canonical place,
and not having to use boilerplate code in each notebook/script that makes use of the data. The catalogs can also
reference one\-another, be stored remotely, and include extra metadata such as a set of named quick\-look plots that
are appropriate for the particular data source. Note that catalogs are \fBnot\fP restricted
to being stored in YAML files, that just happens to be the simplest way to display them.
.sp
Many catalog entries will also contain "user_parameter" blocks, which are indications of options explicitly
allowed by the catalog author, or for validation or the values passed. The user can customise how a data
source is accessed by providing values for the user_parameters, overriding the arguments specified in
the entry, or passing extra keyword arguments to be passed to the driver. The keywords that should
be passed are limited to the user_parameters defined and the inputs expected by the specific
driver \- such usage is expected only from those already familiar with the specifics of the given
format. In the following example, the user overrides the "csv_kwargs" keyword, which is described
in the documentation for \fI\%CSVSource\fP and gets passed down to the CSV reader:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# pass extra kwargs understood by the csv driver
>>> intake.cat.states(csv_kwargs={\(aqheader\(aq: None, \(aqskiprows\(aq: 1}).read().head()
           0           1   ...                                17
0     Alabama     alabama  ...    https://twitter.com/alabamagov
1      Alaska      alaska  ...        https://twitter.com/alaska
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Note that, if you are \fIcreating\fP such catalogs, you may well start by trying the \fBopen_csv\fP command,
above, and then use \fBprint(ds.yaml())\fP\&. If you do this now, you will see that the output is very
similar to the catalog file we have provided.
.SS Installing Data Source Packages
.sp
Intake makes it possible to create \fI\%Data packages\fP (\fBpip\fP or \fBconda\fP)
that install data sources into a
global catalog.  For example, we can
install a data package containing the same data we have been working with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
conda install \-c intake data\-us\-states
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fI\%Conda\fP installs the catalog file in this package to \fB$CONDA_PREFIX/share/intake/us_states.yml\fP\&.
Now, when we import
\fBintake\fP, we will see the data from this package appear as part of a global catalog called \fBintake.cat\fP\&. In this
particular case we use Dask to do the reading (which can handle larger\-than\-memory data and parallel
processing), but \fBread()\fP would work also:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> import intake
>>> intake.cat.states.to_dask()[[\(aqstate\(aq,\(aqslug\(aq]].head()
        state        slug
0     Alabama     alabama
1      Alaska      alaska
2     Arizona     arizona
3    Arkansas    arkansas
4  California  california
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The global catalog is a union of all catalogs installed in the conda/virtualenv environment and also any catalogs
installed in user\-specific locations.
.SS Adding Data Source Packages using the Intake path
.sp
Intake checks the Intake config file for \fBcatalog_path\fP or the environment variable \fB"INTAKE_PATH"\fP for a colon
separated list of paths (semicolon on windows) to search for catalog files.
When you import \fBintake\fP we will see all entries from all of the catalogues referenced as part of a global catalog
called \fBintake.cat\fP\&.
.SS Using the GUI
.sp
A graphical data browser is available in the Jupyter notebook environment or standalone web\-server.
It will show the
contents of any installed catalogs, plus allows for selecting local and remote catalogs,
to browse and select entries from these. See \fI\%GUI\fP\&.
.SS Use Cases \- I want to...
.sp
Here follows a list of specific things that people may want to get done, and
details of how Intake can help. The details of how to achieve each of these
activities can be found in the rest of the detailed documentation.
.SS Avoid copy&paste of blocks of code for accessing data
.sp
This is a very common pattern, if you want to load some specific data, to
find someone, perhaps a colleague, who has accessed it before, and copy that
code. Such a practice is extremely error prone, and cause a proliferation of
copies of code, which may evolve over time, with various versions simultaneously
in use.
.sp
Intake separates the concerns of data\-source specification from code. The
specs are stored separately, and all users can reference the one and only
authoritative definition, whether in a shared file, a service visible to
everyone or by using the Intake server. This spec can be updated so that
everyone gets the current version instead of relying on outdated code.
.SS Version control data sources
.sp
Version control (e.g., using \fBgit\fP) is an essential practice in modern
software engineering and data science. It ensures that the change history is
recorded, with times, descriptions and authors along with the changes themselves.
.sp
When data is specified using a well\-structured syntax such as YAML, it can
be checked into a version controlled repository in the usual fashion. Thus, you
can bring rigorous practices to your data as well as your code.
.sp
If using conda packages to distribute data specifications, these come with a
natural internal version numbering system, such that users need only do
\fBconda update ...\fP to get the latest version.
.SS "Install" data
.sp
Often, finding and grabbing data is a major hurdle to productivity. People may be
required to download artifacts from various places or search through storage
systems to find the specific thing that they are after. One\-line commands which
can retrieve data\-source specifications or the files themselves can be a massive
time\-saver. Furthermore, each data\-set will typically need its own code to
be able to access it, and probably additional software dependencies.
.sp
Intake allows you to build \fBconda\fP packages, which can include catalog files
referencing online resources, or to include data files directly in that package.
Whether uploaded to \fBanaconda.org\fP or hosted on a private enterprise channel,
getting the data becomes a single \fBconda install ...\fP command, whereafter
it will appear as an entry in \fBintake.cat\fP\&. The conda package brings versioning
and dependency declaration for free, and you can include any code that may be
required for that specific data\-set directly in the package too.
.SS Update data specifications in\-place
.sp
Individual data\-sets often may be static, but commonly, the "best" data to get a
job done changes with time as new facts emerge. Conversely, the very same data
might be better stored in a different format which is, for instance, better\-suited
to parallel access in the cloud. In such situations, you really don\(aqt want to force
all the data scientists who rely on it to have their code temporarily broken and
be forced to change this code.
.sp
By working with a catalog file/service in a fixed shared location, it is possible to
update the data source specs in\-place. When users now run their code, they will get
the latest version. Because all Intake drivers have the same API, the code using the
data will be identical and not need to be changed, even when the format has been
updated to something more optimised.
.SS Access data stored on cloud resources
.sp
Services such as AWS S3, GCS and Azure Datalake (or private enterprise variants of
these) are increasingly popular locations to amass large amounts of data. Not only
are they relatively cheap per GB, but they provide long\-term resilience, metadata
services, complex access control patterns and can have very large data throughput
when accessed in parallel by machines on the same architecture.
.sp
Intake comes with integration to cloud\-based storage out\-of\-the box for most of the
file\-based data formats, to be able to access the data directly in\-place and in
parallel. For the few remaining cases where direct access is not feasible, the
caching system in Intake allows for download of files on first use, so that
all further  access is much faster.
.SS Work with "Big Data"
.sp
The era of Big Data is here! The term means different things to different people,
but certainly implies that an individual data\-set is too large to fit into the
memory of a typical workstation computer (>>10GB). Nevertheless, most data\-loading
examples available use functions in packages such as \fBpandas\fP and expect to
be able to produce in\-memory representations of the whole data. This is clearly a
problem, and a more general answer should be available aside from "get more memory
in your machine".
.sp
Intake integrates with \fBDask\fP and \fBSpark\fP, which both offer out\-of\-core
computation (loading the data in chunks which fit in memory and aggregating result)
or can spread their work over a cluster of machines, effectively making use of the
shared memory resources of the whole cluster. Dask integration is built into the
majority of the the drivers and exposed with the \fB\&.to_dask()\fP method, and Spark
integration is available for a small number of drivers with a similar \fB\&.to_spark()\fP
method, as well as directly with the \fBintake\-spark\fP package.
.sp
Intake also integrates with many data services which themselves can perform big\-data
computations, only extracting the smaller aggregated data\-sets that \fIdo\fP fit into
memory for further analysis. Services such as SQL systems, \fBsolr\fP, \fBelastic\-search\fP,
\fBsplunk\fP, \fBaccumulo\fP and \fBhbase\fP all can distribute the work required to fulfill
a query across many nodes of a cluster.
.SS Find the right data\-set
.sp
Browsing for the data\-set which will solve a particular problem can be hard, even
when the data have been curated and stored in a single, well\-structured system. You
do \fInot\fP want to rely on word\-of\-mouth to specify which data is right for which job.
.sp
Intake catalogs allow for self\-description of data\-sets, with simple text and
arbitrary metadata, with a consistent access pattern. Not only can you list the data
available to you, but you can find out what exactly that data represents, and
the form the data would take if loaded (table versus list of items, for example). This extra
metadata is also searchable: you can descend through a hierarchy of catalogs with
a single search, and find all the entries containing some particular keywords.
.sp
You can use the Intake GUI to graphically browse through your available data\-sets or
point to catalogs available to you, look through the entries listed there and get
information about each, or even show a sample of the data or quick\-look plots. The GUI
is also able to execute searches and browse file\-systems to find data artifacts of
interest. This same functionality is also available via a command\-line interface
or programmatically.
.SS Work remotely
.sp
Interacting with cloud storage resources is very convenient, but you will not want to
download large amounts of data to your laptop or workstation for analysis. Intake
finds itself at home in the remote\-execution world of jupyter and Anaconda Enterprise
and other in\-browser technologies. For instance, you can run the Intake GUI either as a
stand\-alone application for browsing data\-sets or in a notebook for full analytics,
and have all the runtime live on a remote machine, or perhaps a cluster which is
co\-located with the data storage. Together with cloud\-optimised data formats such
as parquet, this is an ideal set\-up for processing data at web scale.
.SS Transform data to efficient formats for sharing
.sp
A massive amount of data exists in human\-readable formats such as JSON, XML and CSV,
which are not very efficient in terms of space usage and need to be parsed on
load to turn into arrays or tables. Much faster processing times can be had with
modern compact, optimised formats, such as parquet.
.sp
Intake has a "persist" mechanism to transform any input data\-source into the format
most appropriate for that type of data, e.g., parquet for tabular data. The persisted
data will be used in preference at analysis time, and the schedule for updating from
the original source is configurable. The location of these persisted data\-sets can
be shared with others, so they can also gain the benefits, or the "export" variant
can be used to produce an independent version in the same format, together with a
spec to reference it by; you would then share this spec with others.
.SS Access data without leaking credentials
.sp
Security is important. Users\(aq identity and authority to view specific data should be
established before handing over any sensitive bytes. It is, unfortunately, all too
common for data scientists to include their username, passwords or other credentials
directly in code, so that it can run automatically, thus presenting a potential
security gap.
.sp
Intake does not manage credentials or user identities directly, but does provide hooks
for fetching details from the environment or other service, and using the values in
templating at the time of reading the data. Thus, the details are not included in
the code, but every access still requires for them to be present.
.sp
In other cases, you may want to require the user to provide their credentials every
time, rather that automatically establish them, and "user parameters" can be specified
in Intake to cover this case.
.SS Establish a data gateway
.sp
The Intake server protocol allows you fine\-grained control over the set of data sources
that are listed, and exactly what to return to a user when they want to read some of
that data. This is an ideal opportunity to include authorisation checks,
audit logging, and any more complicated access patterns, as required.
.sp
By streaming the data through a single channel on the server, rather than allowing
users direct access to the data storage backend, you can log and verify all access
to your data.
.SS Clear distinction between data curator and analyst roles
.sp
It is desirable to separate out two tasks: the definition of data\-source specifications, and
accessing and using data. This is so that those who understand the origins of the data
and the implications of various formats and other storage options (such as chunk\-size)
should make those decisions and encode what they have done into specs. It leaves the
data users, e.g., data scientists, free to find and use the data\-sets appropriate for
their work and simply get on with their job \- without having to learn about various
storage formats and access APIs.
.sp
This separation is at the very core of what Intake was designed to do.
.SS Users to be able to access data without learning every backend API
.sp
Data formats and services are a wide mess of many libraries and APIs. A large amount of
time can be wasted in the life of a data scientist or engineer in finding out the details
of the ones required by their work. Intake wraps these various libraries, REST APIs and
similar, to provide a consistent experience for the data user. \fBsource.read()\fP will
simply get all of the data into memory in the container type for that source \- no
further parameters or knowledge required.
.sp
Even for the curator of data catalogs or data driver authors, the framework established by
Intake provides a lot of convenience and simplification which allows each person to
deal with only the specifics of their job.
.SS Data sources to be self\-describing
.sp
Having a bunch of files in some directory is a very common pattern for data storage in the
wild. There may or may not be a README file co\-located giving some information in a
human\-readable form, but generally not structured \- such files are usually different
in every case.
.sp
When a data source is encoded into a catalog, the spec offers a natural place to describe
what that data is, along with the possibility to provide an arbitrary amount of
structured metadata and to describe any parameters that are to be exposed for user
choice. Furthermore, Intake data sources each have a particular container type, so
that users know whether to expect a dataframe, array, etc., and simple introspection
methods like \fBdescribe\fP and \fBdiscover\fP which return basic information about the
data without having to load all of it into memory first.
.SS A data source hierarchy for natural structuring
.sp
Usually, the set of data sources held by an organisation have relationships to one another,
and would be poorly served to be provided as a simple flat list of everything available.
Intake allows catalogs to refer to other catalogs. This means, that you can group data
sources by various facets (type, department, time...) and establish hierarchical
data\-source trees within which to find the particular data most likely to be of interest.
Since the catalogs live outside and separate from the data files themselves, as many
hierarchy structures as thought useful could be created.
.sp
For even more complicated data source meta\-structures, it is possible to store all the
details and even metadata in some external service (e.g., traditional SQL tables) with which
Intake can interact to perform queries and return particular subsets of the
available data sources.
.SS Expose several data collections under a single system
.sp
There are already several catalog\-like data services in existence in the world, and
some organisation may have several of these in\-house for various different purposes.
For example, an SQL server may hold details of customer lists and transactions, but
historical time\-series and reference data may be held separately in archival data formats like
parquet on a file\-storage system; while real\-time system monitoring is done by a
totally unrelated system such as Splunk or elastic search.
.sp
Of course, Intake can read from various file formats and data services. However, it
can also interpret the internal conception of data catalogs that some data services may
have. For example, all of the tables known to the SQL server, or all of the pre\-defined
queries in Splunk can be automatically included as catalogs in Intake, and take their
place amongst the regular YAML\-specified data sources, with exactly the same usage for
all of them.
.sp
These data sources and their hierarchical structure can then be exposed via the
graphical data browser, for searching, selecting and visualising data\-sets.
.SS Modern visualisations for all data\-sets
.sp
Intake is integrated with the comprehensive \fBholoviz\fP suite, particularly \fBhvplot\fP, to
bring simple yet powerful data visualisations to any Intake data source by using just one
single method for everything. These plots are interactive, and can include server\-side
dynamic aggregation of very large data\-sets to display more data points than the
browser can handle.
.sp
You can specify specific plot types right in the data source definition, to have these
customised visualisations available to the user as simple one\-liners known to
reveal the content of the data, or even view the same visuals right in the graphical
data source browser application. Thus, Intake is already an all\-in\-one data investigation
and dashboarding app.
.SS Update data specifications in real time
.sp
Intake data catalogs are not limited to reading static specification from
files. They can also execute queries on remote data services and return lists of
data sources dynamically at runtime. New data sources may appear, for example,
as directories of data files are pushed to a storage service, or new tables are
created within a SQL server.
.SS Distribute data in a custom format
.sp
Sometimes, the well\-known data formats are just not right for a given data\-set,
and a custom\-built format is required. In such cases, the code to read the data
may not exist in any library. Intake allows for code to be distributed along
with data source specs/catalogs or even files in a single \fBconda\fP package.
That encapsulates everything needed to describe and use that particular data,
and can then be distributed as a single entity, and installed with a one\-liner.
.sp
Furthermore, should the few builtin container types (sequence, array, dataframe)
not be sufficient, you can supply your own, and then build drivers that use it.
This was done, for example, for \fBxarray\fP\-type data, where multiple related
N\-D arrays share a coordinate system and metadata. By creating this container,
a whole world of scientific and engineering data was opened up to Intake. Creating
new containers is not hard, though, and we foresee more coming, such as
machine\-learning models and streaming/real\-time data.
.SS Create Intake data\-sets from scratch
.sp
If you have a set of files or a data service which you wish to make into a data\-set,
so that you can include it in a catalog, you should use the set of functions
\fBintake.open_*\fP, where you need to pick the function appropriate for your
particular data. You can use tab\-completion to list the set of data drivers you have
installed, and find others you may not yet have installed at \fI\%Plugin Directory\fP\&.
Once you have determined the right set of parameters to load the data in the manner
you wish, you can use the source\(aqs \fB\&.yaml()\fP method to find the spec that describes
the source, so you can insert it into a catalog (with appropriate description and
metadata). Alternatively, you can open a YAML file as a catalog with \fBintake.open_catalog\fP
and use its \fB\&.add()\fP method to insert the source into the corresponding file.
.sp
If, instead, you have data in your session in one of the containers supported by Intake
(e.g., array, data\-frame), you can use the \fBintake.upload()\fP function to save it to
files in an appropriate format and a location you specify, and give you back a data\-source
instance, which, again, you can use with \fB\&.yaml()\fP or \fB\&.add()\fP, as above.
.SS Overview
.SS Introduction
.sp
This page describes the technical design of Intake, with brief details of the aims of the project and
components of the library
.SS Why Intake?
.sp
Intake solves a related set of problems:
.INDENT 0.0
.IP \(bu 2
Python API standards for loading data (such as DB\-API 2.0) are optimized for transactional databases and query results
that are processed one row at a time.
.IP \(bu 2
Libraries that do load data in bulk tend to each have their own API for doing so, which adds friction when switching
data formats.
.IP \(bu 2
Loading data into a distributed data structure (like those found in Dask and Spark) often requires writing a separate
loader.
.IP \(bu 2
Abstractions often focus on just one data model (tabular, n\-dimensional array, or semi\-structured), when many projects
need to work with multiple kinds of data.
.UNINDENT
.sp
Intake has the explicit goal of \fBnot\fP defining a computational expression
system.  Intake plugins load the data into containers (e.g., arrays or data\-frames) that
provide their data processing features.  As a result, it is
very easy to make a new Intake plugin with a relatively small amount of Python.
.SS Structure
.sp
Intake is a Python library for accessing data in a simple and uniform way.  It consists of three parts:
.sp
1. A lightweight plugin system for adding data loader \fI\%drivers\fP for new file formats and servers
(like databases, REST endpoints or other cataloging services)
.sp
2. A cataloging system for specifying these sources in simple \fI\%YAML\fP syntax, or with plugins that read source specs
from some external data service
.sp
3. A server\-client architecture that can share data catalog metadata over the network, or even stream the data directly
to clients if needed
.sp
Intake supports loading data into standard Python containers. The list can be easily extended,
but the currently supported list is:
.INDENT 0.0
.IP \(bu 2
Pandas Dataframes \- tabular data
.IP \(bu 2
NumPy Arrays \- tensor data
.IP \(bu 2
Python lists of dictionaries \- semi\-structured data
.UNINDENT
.sp
Additionally, Intake can load data into distributed data structures.  Currently it supports Dask, a flexible parallel
computing library with distributed containers like \fI\%dask.dataframe\fP,
\fI\%dask.array\fP,
and \fI\%dask.bag\fP\&.
In the future, other distributed computing systems could use Intake to create similar data structures.
.SS Concepts
.sp
Intake is built out of four core concepts:
.INDENT 0.0
.IP \(bu 2
Data Source classes: the "driver" plugins that each implement loading of some specific type of data into python, with
plugin\-specific arguments.
.IP \(bu 2
Data Source: An object that represents a reference to a data source.  Data source instances have methods for loading the
data into standard containers, like Pandas DataFrames, but do not load any data until specifically requested.
.IP \(bu 2
Catalog: An inventory of catalog entries, each of which defines a Data Source. Catalog objects can be created from
local YAML definitions, by connecting
to remote servers, or by some driver that knows how to query an external data service.
.IP \(bu 2
Catalog Entry: A named data source held internally by catalog objects, which generate
data source instances when accessed.
The catalog entry includes metadata about the source, as well as the name of the
driver and arguments. Arguments can be parameterized, allowing one entry to return
different subsets of data depending on the user request.
.UNINDENT
.sp
The business of a plugin is to go from some data format (bunch of files or some remote service)
to a "\fI\%Container\fP" of the data (e.g., data\-frame), a thing on which you can perform further analysis.
Drivers can be used directly by the user, or indirectly through data catalogs.  Data sources can be pickled, sent over
the network to other hosts, and reopened (assuming the remote system has access to the required files or servers).
.sp
See also the \fI\%Glossary\fP\&.
.SS Future Directions
.sp
Ongoing work for enhancements, as well as requests for plugins, etc., can be found at the
\fI\%issue tracker\fP\&. See the \fI\%Roadmap\fP
for general mid\- and long\-term goals.
.SS Examples
.sp
Here we list links to notebooks and other code demonstrating the use of Intake in various
scenarios. The first section is of general interest to various users, and the sections that
follow tend to be more specific about particular features and workflows.
.sp
Many of the entries here include a link to Binder, which a service that lest you execute
code live in a notebook environment. This is a great way to experience using Intake.
It can take a while, sometimes, for Binder to come up; please have patience.
.sp
See also the \fI\%examples\fP repository, containing data\-sets which can be built and installed
as conda packages.
.SS General
.INDENT 0.0
.IP \(bu 2
Basic Data scientist workflow: using Intake
[\fI\%Static\fP]
[\fI\%Executable\fP].
.IP \(bu 2
Workflow for creating catalogs: a Data Engineer\(aqs approach to Intake
[\fI\%Static\fP]
[\fI\%Executable\fP]
.UNINDENT
.SS Developer
.sp
Tutorials delving deeper into the Internals of Intake, for those who wish to contribute
.INDENT 0.0
.IP \(bu 2
How you would go about writing a new plugin
[\fI\%Static\fP]
[\fI\%Executable\fP]
.UNINDENT
.SS Features
.sp
More specific examples of Intake functionality
.INDENT 0.0
.IP \(bu 2
Caching:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
New\-style data package creation [\fI\%Static\fP]
.IP \(bu 2
Using automatically cached data\-files
[\fI\%Static\fP]
[\fI\%Executable\fP]
.IP \(bu 2
Earth science demonstration of cached dataset
[\fI\%Static\fP]
[\fI\%Executable\fP]
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
File\-name pattern parsing:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
Satellite imagery, science workflow
[\fI\%Static\fP]
[\fI\%Executable\fP]
.IP \(bu 2
How to set up pattern parsing
[\fI\%Static\fP]
[\fI\%Executable\fP]
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
Custom catalogs:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
A custom intake plugin that adapts DCAT catalogs
[\fI\%Static\fP]
[\fI\%Executable\fP]
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Data
.INDENT 0.0
.IP \(bu 2
\fI\%Anaconda package data\fP, originally announced in \fI\%this blog\fP
.IP \(bu 2
\fI\%Planet Four Catalog\fP, originally from \fI\%https://www.planetfour.org/results\fP
.IP \(bu 2
The official Intake \fI\%examples\fP
.UNINDENT
.SS Blogs
.sp
These are Intake\-related articles that may be of interest.
.INDENT 0.0
.IP \(bu 2
\fI\%Discovering and Exploring Data in a Graphical Interface\fP
.IP \(bu 2
\fI\%Taking the Pain out of Data Access\fP
.IP \(bu 2
\fI\%Caching Data on First Read Makes Future Analysis Faster\fP
.IP \(bu 2
\fI\%Parsing Data from Filenames and Paths\fP
.IP \(bu 2
\fI\%Intake for cataloguing Spark\fP
.IP \(bu 2
\fI\%Intake released on Conda\-Forge\fP
.UNINDENT
.SS Talks
.INDENT 0.0
.IP \(bu 2
\fI\%__init__ podcast interview (May 2019)\fP
.IP \(bu 2
\fI\%AnacondaCon (March 2019)\fP
.IP \(bu 2
\fI\%PyData DC (November 2018)\fP
.IP \(bu 2
\fI\%PyData NYC (October 2018)\fP
.IP \(bu 2
\fI\%ESIP tech dive (November 2018)\fP
.UNINDENT
.SS News
.INDENT 0.0
.IP \(bu 2
See out \fI\%Wiki\fP page
.UNINDENT
.SS Deployment Scenarios
.sp
In the following sections, we will describe some of the ways in which Intake is used in real
production systems. These go well beyond the typical YAML files presented in the quickstart
and examples sections, which are necessarily short and simple, and do not demonstrate the
full power of Intake.
.SS Sharing YAML files
.sp
This is the simplest scenario, and amply described in these documents. The primary
advantage is simplicity: it is enough to put a file in an accessible place (even
a gist or repo), in order
for someone else to be able to discover and load that data. Furthermore, such
files can easily refer to one\-another, to build up a full tree of data assets with
minimum pain Since YAML files are
text, this also lends itself to working well with version control systems.
Furthermore, all sources can describe themselves as YAML, and the
\fBexport\fP and \fBupload\fP commands can produce an efficient format (possibly remote) together
with YAML definition in a single step.
.SS Pangeo
.sp
The \fI\%Pangeo\fP collaboration uses Intake to catalog their data holdings, which are generally
in various forms of netCDF\-compliant formats, massive multi\-dimensional arrays with data
relating to earth and climate science and meteorology. On their cloud\-based platform,
containers start up jupyter\-lab sessions which have Intake installed, and therefore can
simply pick and load the data that each researcher needs \- often requiring large Dask
clusters to actually do the processing.
.sp
A \fI\%static\fP rendering of the catalog
contents is available, so that users can browse the holdings
without even starting a python session. This rendering is produced by CI on the
\fI\%repo\fP whenever new definitions are
added, and it also checks (using Intake) that each definition is indeed loadable.
.sp
Pangeo also developed intake\-stac, which can talk to STAC servers to make real\-time
queries and parse the results into Intake data sources. This is a standard for
spaceo\-temporal data assets, and indexes massive amounts of cloud\-stored data.
.SS Anaconda Enterprise
.sp
Intake will be the basis of the data access and cataloging service within
\fI\%Anaconda Enterprise\fP, running as a micro\-service in a container, and offering data
source definitions to users. The access control, who gets to see which data\-set,
and serving of credentials to be able to read from the various data storage services,
will all be handled by the platform and be fully configurable by admins.
.SS National Center for Atmospheric Research
.sp
NCAR has developed \fI\%intake\-esm\fP, a mechanism for creating file\-based Intake catalogs
for climate data from project efforts such as the \fI\%Coupled Model Intercomparison Project (CMIP)\fP
and the \fI\%Community Earth System Model (CESM) Large Ensemble Project\fP\&.
These projects produce a huge of amount climate data persisted on tape, disk storage components
across multiple (of the order ~300,000) netCDF files. Finding, investigating, loading these files into data array containers
such as \fIxarray\fP can be a daunting task due to the large number of files a user may be interested in.
\fBIntake\-esm\fP addresses this issue in three steps:
.INDENT 0.0
.IP \(bu 2

.nf
\(gaDataset Catalog Curation\(ga_
.fi
 in form of YAML files. These YAML files provide information about data locations,
access pattern,  directory structure, etc. \fBintake\-esm\fP uses these YAML files in conjunction with file name templates
to construct a local database. Each row in this database consists of a set of metadata such as \fBexperiment\fP,
\fBmodeling realm\fP, \fBfrequency\fP corresponding to data contained in one netCDF file.
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cat = intake.open_esm_metadatastore(catalog_input_definition="GLADE\-CMIP5")
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
Search and Discovery: once the database is built, \fBintake\-esm\fP can be used for searching and discovering
of climate datasets by eliminating the need for the user to know specific locations (file path) of
their data set of interest:
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sub_cat = cat.search(variable=[\(aqhfls\(aq], frequency=\(aqmon\(aq, modeling_realm=\(aqatmos\(aq, institute=[\(aqCCCma\(aq, \(aqCNRM\-CERFACS\(aq])
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
Access: when the user is satisfied with the results of their query, they can ask \fBintake\-esm\fP
to load the actual netCDF files into xarray datasets:
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
dsets = cat.to_xarray(decode_times=True, chunks={\(aqtime\(aq: 50})
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Brookhaven Archive
.sp
The \fI\%Bluesky\fP project uses Intake to dynamically query a MongoDB instance, which
holds the details of experimental and simulation data catalogs, to return a
custom Catalog for every query. Data\-sets can then be loaded into python, or the original
raw data can be accessed ...
.SS Zillow
.sp
Zillow is developing Intake to meet the needs of their datalake access layer (DAL),
to encapsulate the highly hierarchical nature of their data. Of particular importance,
is the ability to provide different version (testing/production, and different
storage formats) of the same logical dataset, depending on
whether it is being read on a laptop versus the production infrastructure ...
.SS Intake Server
.sp
The server protocol (see \fI\%Server Protocol\fP) is simple enough that anyone can write their
own implementation with full customisation and behaviour. In particular, auth and
monitoring would be essential for a production\-grade deployment.
.SH USER GUIDE
.sp
More detailed information about specific parts of Intake, such as how to author catalogs,
how to use the graphical interface, plotting, etc.
.SS GUI
.SS Using the GUI
.sp
\fBNote\fP: the GUI requires \fBpanel\fP and \fBbokeh\fP to
be available in the current environment.
.sp
The Intake top\-level singleton \fBintake.gui\fP gives access to a graphical data browser
within the Jupyter notebook. To expose it, simply enter it into a code cell (Jupyter
automatically display the last object in a code cell).
[image]
.sp
New instances of the GUI are also available by instantiating \fBintake.interface.gui.GUI\fP,
where you can specify a list of catalogs to initially include.
.sp
The GUI contains three main areas:
.INDENT 0.0
.IP \(bu 2
a \fBlist of catalogs\fP\&. The "builtin" catalog, displayed by default, includes data\-sets installed
in the system, the same as \fBintake.cat\fP\&.
.IP \(bu 2
a \fBlist of sources\fP within the currently selected catalog.
.IP \(bu 2
a \fBdescription\fP of the currently selected source.
.UNINDENT
.SS Catalogs
.sp
Selecting a catalog from the list will display nested catalogs below the parent and display
source entries from the catalog in the \fBlist of sources\fP\&.
.sp
Below the \fBlists of catalogs\fP is a row of buttons that are used for adding, removing and
searching\-within catalogs:
.INDENT 0.0
.IP \(bu 2
\fBAdd\fP: opens a sub\-panel for adding catalogs to the interface, by either browsing for a local
YAML file or by entering a URL for a catalog, which can be a remote file or Intake server
.IP \(bu 2
\fBRemove\fP: deletes the currently selected catalog from the list
.IP \(bu 2
\fBSearch\fP: opens a sub\-panel for finding entries in the currently selected catalog (and its
sub\-catalogs)
.UNINDENT
.SS Add Catalogs
.sp
The Add button (+) exposes a sub\-panel with two main ways to add catalogs to the interface:
[image]
.sp
This panel has a tab to load files from \fBlocal\fP; from that you can navigate around the filesystem
using the arrow or by editing the path directly. Use the home button to get back to the starting
place. Select the catalog file you need. Use the "Add Catalog" button to add the catalog to the list
above.
[image]
.sp
Another tab loads a catalog from \fBremote\fP\&. Any URL is valid here, including cloud locations,
\fB"gcs://bucket/..."\fP, and intake servers, \fB"intake://server:port"\fP\&. Without a protocol
specifier, this can be a local path. Again, use the "Add Catalog" button to add
the catalog to the list above.
[image]
.sp
Finally, you can add catalogs to the interface in code, using the \fB\&.add()\fP method,
which can take filenames, remote URLs or existing \fBCatalog\fP instances.
.SS Remove Catalogs
.sp
The Remove button (\-) deletes the currently selected catalog from the list. It is important to
note that this action does not have any impact on files, it only affects what shows up in the list.
[image]
.SS Search
.sp
The sub\-panel opened by the Search button (🔍) allows the user to search within the selected catalog
[image]
.sp
From the Search sub\-panel the user enters for free\-form text. Since some catalogs contain nested sub\-catalogs,
the Depth selector allows the search to be limited to the stated number of nesting levels.
This may be necessary, since, in theory, catalogs can contain circular references,
and therefore allow for infinite recursion.
[image]
.sp
Upon execution of the search, the currently selected catalog will be searched. Entries will
be considered to match if any of the entered words is found in the description of the entry (this
is case\-insensitive). If any matches are found, a new entry will be made in the catalog list,
with the suffix "_search".
[image]
.SS Sources
.sp
Selecting a source from the list updates the description text on the left\-side of the gui.
.sp
Below the \fBlist of sources\fP is a row of buttons for inspecting the selected data source:
.INDENT 0.0
.IP \(bu 2
\fBPlot\fP: opens a sub\-panel for viewing the pre\-defined (specified in the yaml) plots
for the selected source.
.UNINDENT
.SS Plot
.sp
The Plot button (📊) opens a sub\-panel with an area for viewing pre\-defined plots.
[image]
.sp
These plots are specified in the catalog yaml and that yaml can be displayed by
checking the box next to "show yaml".
[image]
.sp
The holoviews object can be retrieved from the gui using \fBintake.interface.source.plot.pane.object\fP,
and you can then use it in Python or export it to a file.
.SS Interactive Visualization
.sp
If you have installed the optional extra packages \fI\%dfviz\fP and \fI\%xrviz\fP, you can
interactively plot your dataframe or array data, respectively.
[image]
.sp
The button "customize" will be available for data sources of the appropriate type.
Click this to open the interactive interface. If you have not selected a predefined
plot (or there are none), then the interface will start without any prefilled
values, but if you do first select a plot, then the interface will have its options
pre\-filled from the options
.sp
For specific instructions on how to use the interfaces (which can also be used
independently of the Intake GUI), please navigate to the linked documentation.
.sp
Note that the final parameters that are sent to \fBhvPlot\fP to produce the output
each time a plot if updated, are explicitly available in YAML format, so that
you can save the state as a "predefined plot" in the catalog. The same set of
parameters can also be used in code, with \fBdatasource.plot(...)\fP\&.
[image]
.SS Using the Selection
.sp
Once catalogs are loaded and the desired sources has been identified and selected,
the selected sources will be available at the \fB\&.sources\fP attribute (\fBintake.gui.sources\fP).
Each source entry has informational methods available and can be opened as a data source,
as with any catalog entry:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
In [ ]: source_entry = intake.gui.sources[0]
        source_entry
Out   :
name: sea_ice_origin
container: dataframe
plugin: [\(aqcsv\(aq]
description: Arctic/Antarctic Sea Ice
direct_access: forbid
user_parameters: []
metadata:
args:
  urlpath: https://timeseries.weebly.com/uploads/2/1/0/8/21086414/sea_ice.csv

In [ ]: data_source = source_entry()  # may specify parameters here
        data_source.read()
Out   : < some data >

In [ ]: source_entry.plot()  # or skip data source step
Out   : < graphics>
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Catalogs
.sp
Data catalogs provide an abstraction that allows you to externally define, and optionally share, descriptions of
datasets, called \fIcatalog entries\fP\&.  A catalog entry for a dataset includes information like:
.INDENT 0.0
.IP \(bu 2
The name of the Intake driver that can load the data
.IP \(bu 2
Arguments to the \fB__init__()\fP method of the driver
.IP \(bu 2
Metadata provided by the catalog author (such as field descriptions and types, or data provenance)
.UNINDENT
.sp
In addition, Intake allows the arguments to data sources to be templated, with the variables explicitly
expressed as "user parameters". The given arguments are rendered using \fBjinja2\fP, the
values of named user parameterss, and any overrides.
The parameters are also offer validation of the allowed types and values, for both the template
values and the final arguments passed to the data source. The parameters are named and described, to
indicate to the user what they are for. This kind of structure can be used to, for example,
choose between two parts of a given data source, like "latest" and "stable", see the \fIentry1_part\fP entry in
the example below.
.sp
The user of the catalog can always override any template or argument value at the time
that they access a give source.
.SS The Catalog class
.sp
In Intake, a \fBCatalog\fP instance is an object with one or more named entries.
The entries might be read from a static file (e.g., YAML, see the next section), from
an Intake server or from any other data service that has a driver. Drivers which
create catalogs are ordinary DataSource classes, except that they have the container
type "catalog", and do not return data products via the \fBread()\fP method.
.sp
For example, you might choose to instantiate the base class and fill in some entries
explicitly in your code
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry
mycat = Catalog.from_dict({
    \(aqsource1\(aq: LocalCatalogEntry(name, description, driver, args=...),
    ...
    })
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Alternatively, subclasses of \fBCatalog\fP can define how entries are created from
whichever file format or service they interact with, examples including \fBRemoteCatalog\fP
and \fI\%SQLCatalog\fP\&. These generate entries based on their respective targets; some
provide advanced search capabilities executed on the server.
.SS YAML Format
.sp
Intake catalogs can most simply be described with YAML files. This is very common
in the tutorials and this documentation, because it simple to understand, but demonstrate
the many features of Intake. Note that YAML files are also the easiest way to share
a catalog, simply by copying to a publicly\-available location such as a cloud storage
bucket.
Here is an example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
metadata:
  version: 1
  parameters:
    file_name:
      type: str
      description: default file name for child entries
      default: example_file_name
sources:
  example:
    description: test
    driver: random
    args: {}

  entry1_full:
    description: entry1 full
    metadata:
      foo: \(aqbar\(aq
      bar: [1, 2, 3]
    driver: csv
    args: # passed to the open() method
      urlpath: \(aq{{ CATALOG_DIR }}/entry1_*.csv\(aq

  entry1_part:
    description: entry1 part
    parameters: # User parameters
      part:
        description: section of the data
        type: str
        default: "stable"
        allowed: ["latest", "stable"]
    driver: csv
    args:
      urlpath: \(aq{{ CATALOG_DIR }}/entry1_{{ part }}.csv\(aq

  entry2:
    description: entry2
    driver: csv
    args:
      # file_name parameter will be inherited from file\-level parameters, so will
      # default to "example_file_name"
      urlpath: \(aq{{ CATALOG_DIR }}/entry2/{{ file_name }}.csv\(ga
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Metadata
.sp
Arbitrary extra descriptive information can go into the metadata section. Some fields will be
claimed for internal use and some fields may be restricted to local reading; but for now the only
field that is expected is \fBversion\fP, which will be updated when a breaking change is made to the
file format. Any catalog will have \fB\&.metadata\fP and \fB\&.version\fP attributes available.
.sp
Note that each source also has its own metadata.
.sp
The metadata section an also contain \fBparameters\fP which will be inherited by the sources in
the file (note that these sources can augment these parameters, or override them with their own
parameters).
.SS Extra drivers
.sp
The \fBdriver:\fP entry of a data source specification can be a driver name, as has been shown in the examples so far.
It can also be an absolute class path to use for the data source, in which case there will be no ambiguity about how
to load the data. That is the the preferred way to be explicit, when the driver name alone is not enough
(see \fI\%Driver Selection\fP, below).
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
plugins:
  source:
    \- module: intake.catalog.tests.example1_source
sources:
  ...
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
However, you do not, in general, need to do this, since the \fBdriver:\fP field of
each source can also explicitly refer to the plugin class.
.SS Sources
.sp
The majority of a catalog file is composed of data sources, which are named data sets that can be loaded for the user.
Catalog authors describe the contents of data set, how to load it, and optionally offer some customization of the
returned data.  Each data source has several attributes:
.INDENT 0.0
.IP \(bu 2
\fBname\fP: The canonical name of the source.  Best practice is to compose source names from valid Python identifiers.
This allows Intake to support things like tab completion of data source names on catalog objects.
For example, \fBmonthly_downloads\fP is a good source
name.
.IP \(bu 2
\fBdescription\fP: Human readable description of the source.  To help catalog browsing tools, the description should be
Markdown.
.IP \(bu 2
\fBdriver\fP: Name of the Intake \fI\%Driver\fP to use with this source.  Must either already be installed in the current
Python environment (i.e. with conda or pip) or loaded in the \fBplugin\fP section of the file. Can be a simple
driver name like "csv" or the full path to the implementation class like "package.module.Class".
.IP \(bu 2
\fBargs\fP: Keyword arguments to the init method of the driver.  Arguments may use template expansion.
.IP \(bu 2
\fBmetadata\fP: Any metadata keys that should be attached to the data source when opened.  These will be supplemented
by additional metadata provided by the driver.  Catalog authors can use whatever key names they would like, with the
exception that keys starting with a leading underscore are reserved for future internal use by Intake.
.IP \(bu 2
\fBdirect_access\fP: Control whether the data is directly accessed by the client, or proxied through a catalog server.
See \fI\%Server Catalogs\fP for more details.
.IP \(bu 2
\fBparameters\fP: A dictionary of data source parameters.  See below for more details.
.UNINDENT
.SS Caching Source Files Locally
.sp
\fIThis method of defining the cache  with a dedicated block is deprecated, see the Remote Access
section, below\fP
.sp
To enable caching on the first read of remote data source files, add the \fBcache\fP section with the
following attributes:
.INDENT 0.0
.IP \(bu 2
\fBargkey\fP: The args section key which contains the URL(s) of the data to be cached.
.IP \(bu 2
\fBtype\fP: One of the keys in the cache registry [\fIintake.source.cache.registry\fP], referring to an implementation of caching behaviour. The default is "file" for the caching of one or more files.
.UNINDENT
.sp
Example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
test_cache:
  description: cache a csv file from the local filesystem
  driver: csv
  cache:
    \- argkey: urlpath
      type: file
  args:
    urlpath: \(aq{{ CATALOG_DIR }}/cache_data/states.csv\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBcache_dir\fP defaults to \fB~/.intake/cache\fP, and can be specified in the intake configuration
file or \fBINTAKE_CACHE_DIR\fP
environment variable, or at runtime using the \fB"cache_dir"\fP key of the configuration.
The special value \fB"catdir"\fP implies that cached files will appear in the same directory as the
catalog file in which the data source is defined, within a directory named "intake_cache". These will
not appear in the cache usage reported by the CLI.
.sp
Optionally, the cache section can have a \fBregex\fP attribute, that modifies the path of the cache on
the disk. By default, the cache path is made by concatenating \fBcache_dir\fP, dataset name, hash of
the url, and the url itself (without the protocol). \fBregex\fP attribute allows one to remove part of the
url (the matching part).
.sp
Caching can be disabled at runtime for all sources regardless of the catalog specification:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from intake.config import conf

conf[\(aqcache_disabled\(aq] = True
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
By default, progress bars are shown during downloads if the package \fBtqdm\fP is
available, but this can be disabled (e.g., for
consoles that don\(aqt support complex text) with
.INDENT 0.0
.INDENT 3.5
conf[\(aqcache_download_progress\(aq] = False
.UNINDENT
.UNINDENT
.sp
or, equivalently, the environment parameter \fBINTAKE_CACHE_PROGRESS\fP\&.
.sp
The "types" of caching are that supported are listed in \fBintake.source.cache.registry\fP, see
the docstrings of each for specific parameters that should appear in the cache block.
.sp
It is possible to work with compressed source files by setting \fBtype: compression\fP in the cache specification.
By default the compression type is inferred from the file extension, otherwise it can be set by assigning the \fBdecomp\fP
variable to any of the options listed in \fBintake.source.decompress.decomp\fP\&.
This will extract all the file(s) in the compressed file referenced by urlpath and store them in the cache directory.
.sp
In cases where miscellaneous files are present in the compressed file, a \fBregex_filter\fP parameter can be used. Only
the extracted filenames that match the pattern will be loaded. The cache path is appended to the filename so it is
necessary to include a wildcard to the beginning of the pattern.
.sp
Example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
test_compressed:
  driver: csv
  args:
    urlpath: \(aqcompressed_file.tar.gz\(aq
  cache:
    \- type: compressed
      decomp: tgz
      argkey: urlpath
      regex_filter: \(aq.*data.csv\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Templating
.sp
Intake catalog files support Jinja2 templating for driver arguments. Any occurrence of
a substring like \fB{{field}}\fP will be replaced by the value of the user parameters with
that same name, or the value explicitly provided by the user. For how to specify these user parameters,
see the next section.
.sp
Some additional values are available for templating. The following is always available:
\fBCATALOG_DIR\fP, the full path to the directory containing the YAML catalog file.  This is especially useful
for constructing paths relative to the catalog directory to locate data files and custom drivers.
For example, the search for CSV files for the two "entry1" blocks, above, will happen in the same directory as
where the catalog file was found.
.sp
The following functions \fImay\fP be available. Since these execute code, the user of a catalog may decide
whether they trust those functions or not.
.INDENT 0.0
.IP \(bu 2
\fBenv("USER")\fP: look in the set environment variables for the named variable
.IP \(bu 2
\fBclient_env("USER")\fP: exactly the same, except that when using a client\-server topology, the
value will come from the environment of the client.
.IP \(bu 2
\fBshell("get_login thisuser \-t")\fP: execute the command, and use the output as the value. The
output will be trimmed of any trailing whitespace.
.IP \(bu 2
\fBclient_shell("get_login thisuser \-t")\fP: exactly the same, except that when using a client\-server
topology, the value will come from the system of the client.
.UNINDENT
.sp
The reason for the "client" versions of the functions is to prevent leakage of potentially sensitive
information between client and server by controlling where lookups happen. When working without a server,
only the ones without "client" are used.
.sp
An example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  personal_source:
    description: This source needs your username
    args:
      url: "http://server:port/user/{{env(USER)}}"
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here, if the user is named "blogs", the \fBurl\fP argument will resolve to
\fB"http://server:port/user/blogs"\fP; if the environment variable is not defined, it will
resolve to \fB"http://server:port/user/"\fP
.SS Parameter Definition
.SS Source parameters
.sp
A source definition can contain a "parameters" block.
Expressed in YAML, a parameter may look as follows:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
parameters:
  name:
    description: name to use  # human\-readable text for what this parameter means
    type: str  # one of bool, str, int, float, list[str | int | float], datetime, mlist
    default: normal  # optional, value to assume if user does not override
    allowed: ["normal", "strange"]  # optional, list of values that are OK, for validation
    min: "n"  # optional, minimum allowed, for validation
    max: "t"  # optional, maximum allowed, for validation
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
A parameter, not to be confused with an \fI\%argument\fP,
can have one of two uses:
.INDENT 0.0
.IP \(bu 2
to provide values for variables to be used in templating the arguments. \fIIf\fP the pattern "{{name}}" exists in
any of the source arguments, it will be replaced by the value of the parameter. If the user provides
a value (e.g., \fBsource = cat.entry(name=\(aqsomething")\fP), that will be used, otherwise the default value. If
there is no user input or default, the empty value appropriate for type is used. The \fBdefault\fP field allows
for the same function expansion as listed for arguments, above.
.IP \(bu 2
\fIIf\fP an argument with the same name as the parameter exists, its value, after any templating, will be
coerced to the given type of the parameter and validated against the allowed/max/min. It is therefore possible
to use the string templating system (e.g., to get a value from the environment), but pass the final value as,
for example, an integer. It makes no sense to provide a default for this case (the argument already has a value),
but providing a default will not raise an exception.
.IP \(bu 2
the "mlist" type is special: it means that the input must be a list, whose values are chosen from the
allowed list. This is the only type where the parameter value is not the same type as the allowed list\(aqs
values, e.g., if a list of str is set for \fBallowed\fP, a list of str must also be the final value.
.UNINDENT
.sp
Note: the \fBdatetime\fP type accepts multiple values:
Python datetime, ISO8601 string,  Unix timestamp int, "now" and  "today".
.SS Catalog parameters
.sp
You can also define user parameters at the catalog level. This applies the parameter to
all entries within that catalog, without having to define it for each and every entry.
Furthermore, catalogs dested within the catalog will also inherit the parameter(s).
.sp
For example, with the following spec
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
metadata:
  version: 1
  parameters:
    bucket:
      type: str
      description: description
      default: test_bucket
sources:
  param_source:
    driver: parquet
    description: description
    args:
      urlpath: s3://{{bucket}}/file.parquet
  subcat:
    driver: yaml_file
    path: "{{CATALOG_DIR}}/other.yaml"
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If \fBcat\fP is the corresponsing catalog instance,
the URL of source \fBcat.param_source\fP will evaluate to "s3://test_bucket/file.parquet" by default, but
the parameter can be overridden with \fBcat.param_source(bucket="other_bucket")\fP\&. Also, any
entries of \fBsubcat\fP, another catalog referenced from here, would also have the "bucket"\-named
parameter attached to all sources. Of course, those sources do no need to make use of the
parameter.
.sp
To change the default, we can gerenate a new instance
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cat2 = cat(bucket="production")  # sets default value of "bucket" for cat2
subcat = cat.subcat(bucket="production")  # sets default only for the nested catalog
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Of course, in these situations you can still override the value of the parameter for any
source, or pass explicit values for the arguments of the source, as normal.
.sp
For cases where the catalog is not defined in a YAML spec, the argument \fBuser_parameters\fP
to the constructor takes the same form as \fBparameters\fP above: a dict of user parameters,
either as \fBUserParameter\fP instances or as a dictionary spec for each one.
.SS Templating parameters
.sp
Template functions can also be used in parameters (see \fI\%Templating\fP, above), but you can use the available functions directly without the extra \fI{{...}}\fP\&.
.sp
For example, this catalog entry uses the \fBenv("HOME")\fP functionality as described to set a default based on the user\(aqs home directory.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  variabledefault:
    description: "This entry leads to an example csv file in the user\(aqs home directory by default, but the user can pass root="somepath" to override that."
    driver: csv
    args:
      path: "{{root}}/example.csv"
    parameters:
      root:
        description: "root path"
        type: str
        default: "env(HOME)"
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Driver Selection
.sp
In some cases, it may be possible that multiple backends are capable of loading from the same data
format or service. Sometimes, this may mean two drivers with unique names, or a single driver
with a parameter to choose between the different backends.
.sp
However, it is possible that multiple drivers for reading a particular type of data
also share the same driver name: for example, both the
intake\-iris and the intake\-xarray packages contain drivers with the name \fB"netcdf"\fP, which
are capable of reading the same files, but with different backends. Here we will describe the
various possibilities of coping with this situation. Intake\(aqs plugin system makes it easy to encode such choices.
.sp
It may be
acceptable to use any driver which claims to handle that data type, or to give the option of
which driver to use to the user, or it may be necessary to specify which precise driver(s) are
appropriate for that particular data. Intake allows all of these possibilities, even if the
backend drivers require extra arguments.
.sp
Specifying a single driver explicitly, rather than using a generic name, would look like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  example:
    description: test
    driver: package.module.PluginClass
    args: {}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
It is also possible to describe a list of drivers with the same syntax. The first one
found will be the one used. Note that the class imports will only happen at data source
instantiation, i.e., when the entry is selected from the catalog.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  example:
    description: test
    driver:
      \- package.module.PluginClass
      \- another_package.PluginClass2
    args: {}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
These alternative plugins can also be given data\-source specific names, allowing the
user to choose at load time with \fIdriver=\fP as a parameter. Additional arguments may also
be required for each option (which, as usual, may include user parameters); however, the
same global arguments will be passed to all of the drivers listed.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  example:
    description: test
    driver:
      first:
        class: package.module.PluginClass
        args:
          specific_thing: 9
      second:
        class: another_package.PluginClass2
    args: {}
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Remote Access
.sp
(see also \fI\%Remote Data\fP for the implementation details)
.sp
Many drivers support reading directly from remote data sources such as HTTP, S3 or GCS. In these cases,
the path to read from is usually given with a protocol prefix such as \fBgcs://\fP\&. Additional dependencies
will typically be required (\fBrequests\fP, \fBs3fs\fP, \fBgcsfs\fP, etc.), any data package
should specify these.  Further parameters
may be necessary for communicating with the storage backend and, by convention, the driver should take
a parameter \fBstorage_options\fP containing arguments to pass to the backend. Some
remote backends may also make use of environment variables or config files to
determine their default behaviour.
.sp
The special template variable "CATALOG_DIR" may be used to construct relative URLs in the arguments to
a source. In such cases, if the filesystem used to load that catalog contained arguments, then
the \fBstorage_options\fP of that file system will be extracted and passed to the source. Therefore, all
sources which can accept general URLs (beyond just local paths) must make sure to accept this
argument.
.sp
As an example of using \fBstorage_options\fP, the following
two sources would allow for reading CSV data from S3 and GCS backends without
authentication (anonymous access), respectively
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  s3_csv:
    driver: csv
    description: "Publicly accessible CSV data on S3; requires s3fs"
    args:
      urlpath: s3://bucket/path/*.csv
      storage_options:
        anon: true
  gcs_csv:
    driver: csv
    description: "Publicly accessible CSV data on GCS; requires gcsfs"
    args:
      urlpath: gcs://bucket/path/*.csv
      storage_options:
        token: "anon"
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBUsing S3 Profiles\fP
.sp
An AWS profile may be specified as an argument under \fBstorage_options\fP via the following format:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
args:
  urlpath: s3://bucket/path/*.csv
  storage_options:
    profile: aws\-profile\-name
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Caching
.sp
URLs interpreted by \fBfsspec\fP offer \fI\%automatic caching\fP\&. For example, to enable
file\-based caching for the first source above, you can do:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  s3_csv:
    driver: csv
    description: "Publicly accessible CSV data on S3; requires s3fs"
    args:
      urlpath: simplecache::s3://bucket/path/*.csv
      storage_options:
        s3:
          anon: true
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Here we have added the "simplecache" to the URL (this caching backend does not store any
metadata about the cached file) and specified that the "anon" parameter is
meant as an argument to s3, not to the caching mechanism. As each file in
s3 is accessed, it will first be downloaded and then the local version
used instead.
.sp
You can tailor how the caching works. In particular the location of the local
storage can be set with the \fBcache_storage\fP parameter (under the "simplecache"
group of storage_options, of course) \- otherwise they are stored in a temporary
location only for the duration of the current python session. The cache location
is particularly useful in conjunction with an environment variable, or
relative to "{{CATALOG_DIR}}", wherever the catalog was loaded from.
.sp
Please see the \fBfsspec\fP documentation for the full set of cache types and their
various options.
.SS Local Catalogs
.sp
A Catalog can be loaded from a YAML file on the local filesystem by creating a Catalog object:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from intake import open_catalog
cat = open_catalog(\(aqcatalog.yaml\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then sources can be listed:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
list(cat)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
and data sources are loaded via their name:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
data = cat.entry_part1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
and you can optionally configure new instances of the source to define user parameters
or override arguments by calling either of:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
data = cat.entry_part1.configure_new(part=\(aq1\(aq)
data = cat.entry_part1(part=\(aq1\(aq)  # this is a convenience shorthand
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Intake also supports loading a catalog from all of the files ending in \fB\&.yml\fP and \fB\&.yaml\fP in a directory, or by using an
explicit glob\-string. Note that the URL provided may refer to a remote storage systems by passing a protocol
specifier such as \fBs3://\fP, \fBgcs://\fP\&.:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cat = open_catalog(\(aq/research/my_project/catalog.d/\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Intake Catalog objects will automatically reload changes or new additions to catalog files and directories on disk.
These changes will not affect already\-opened data sources.
.SS Catalog Nesting
.sp
A catalog is just another type of data source for Intake. For example, you can print a YAML
specification corresponding to a catalog as follows:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cat = intake.open_catalog(\(aqcat.yaml\(aq)
print(cat.yaml())
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
results in:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  cat:
    args:
      path: cat.yaml
    description: \(aq\(aq
    driver: intake.catalog.local.YAMLFileCatalog
    metadata: {}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fIpoint\fP here, is that this can be included in another catalog.
(It would, of course, be better to include a description and the full path of the catalog
file here.)
If the entry above were saved to another file, "root.yaml", and the
original catalog contained an entry, \fBdata\fP, you could access it as:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
root = intake.open_catalog(\(aqroot.yaml\(aq)
root.cat.data
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
It is, therefore, possible to build up a hierarchy of catalogs referencing each other.
These can, of course, include remote URLs and indeed catalog sources other than simple files (all the
tables on a SQL server, for instance). Plus, since the argument and parameter system also
applies to entries such as the example above, it would be possible to give the user a runtime
choice of multiple catalogs to pick between, or have this decision depend on an environment
variable.
.SS Server Catalogs
.sp
Intake also includes a server which can share an Intake catalog over HTTP
(or HTTPS with the help of a TLS\-enabled reverse proxy).  From the user perspective, remote catalogs function
identically to local catalogs:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cat = open_catalog(\(aqintake://catalog1:5000\(aq)
list(cat)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The difference is that operations on the catalog translate to requests sent to the catalog server.  Catalog servers
provide access to data sources in one of two modes:
.INDENT 0.0
.IP \(bu 2
Direct access: In this mode, the catalog server tells the client how to load the data, but the client uses its
local drivers to make the connection.  This requires the client has the required driver already installed \fIand\fP
has direct access to the files or data servers that the driver will connect to.
.IP \(bu 2
Proxied access: In this mode, the catalog server uses its local drivers to open the data source and stream the data
over the network to the client.  The client does not need \fIany\fP special drivers to read the data, and can read data
from files and data servers that it cannot access, as long as the catalog server has the required access.
.UNINDENT
.sp
Whether a particular catalog entry supports direct or proxied access is determined by the \fBdirect_access\fP option:
.INDENT 0.0
.IP \(bu 2
\fBforbid\fP (default): Force all clients to proxy data through the catalog server
.IP \(bu 2
\fBallow\fP: If the client has the required driver, access the source directly, otherwise proxy the data through the
catalog server.
.IP \(bu 2
\fBforce\fP: Force all clients to access the data directly.  If they do not have the required driver, an exception will
be raised.
.UNINDENT
.sp
Note that when the client is loading a data source via direct access, the catalog server will need to send the driver
arguments to the client.  Do not include sensitive credentials in a data source that allows direct access.
.SS Client Authorization Plugins
.sp
Intake servers can check if clients are authorized to access the catalog as a whole, or individual catalog entries.
Typically a matched pair of server\-side plugin (called an "auth plugin") and a client\-side plugin (called a "client
auth plugin) need to be enabled for authorization checks to work.  This feature is still in early development, but see
module \fBintake.auth.secret\fP for a demonstration pair of server and client classes implementation auth via a shared
secret. See \fI\%Authorization Plugins\fP\&.
.SS Command Line Tools
.sp
The package installs two executable commands: for starting the catalog server; and
a client for accessing catalogs and manipulating the configuration.
.SS Configuration
.sp
A file\-based configuration service is available to Intake. This file is by default
sought at the location \fB~/.intake/conf.yaml\fP, but either of the environment variables
\fBINTAKE_CONF_DIR\fP or \fBINTAKE_CONF_FILE\fP can be used to specify another directory
or file. If both are given, the latter takes priority.
.sp
At present, the configuration file might look as follows:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
auth:
  cls: "intake.auth.base.BaseAuth"
port: 5000
catalog_path:
  \- /home/myusername/special_dir
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
These are the defaults, and any parameters not specified will take the values above
.INDENT 0.0
.IP \(bu 2
the Intake Server will listen on port 5000 (this can be overridden on the command line,
see below)
.IP \(bu 2
and the auth system used will be the fully qualified class given (which, for BaseAuth,
always allows access). For further information on securing
the Intake Server, see the \fI\%Authorization Plugins\fP\&.
.UNINDENT
.sp
See \fBintake.config.defaults\fP for a full list of keys and their default values.
.SS Log Level
.sp
The logging level is configurable using Python\(aqs built\-in logging module.
.sp
The config option \fB\(aqlogging\(aq\fP holds the current level for the intake logger, and
can take values such as \fB\(aqINFO\(aq\fP or \fB\(aqDEBUG\(aq\fP\&. This can be set in the \fBconf.yaml\fP
file of the config directory (e.g., \fB~/.intake/\fP), or overridden by the environment
variable \fBINTAKE_LOG_LEVEL\fP\&.
.sp
Furthermore, the level and settings of the logger can be changed programmatically in code:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import logging
logger = logging.getLogger(\(aqintake\(aq)
logger.setLevel(logging.DEBUG)
logget.addHandler(..)
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Intake Server
.sp
The server takes one or more catalog files as input and makes them available on
port 5000 by default.
.sp
You can see the full description of the server command with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake\-server \-\-help

usage: intake\-server [\-h] [\-p PORT] [\-\-list\-entries] [\-\-sys\-exit\-on\-sigterm]
                     [\-\-flatten] [\-\-no\-flatten] [\-a ADDRESS]
                     FILE [FILE ...]

Intake Catalog Server

positional arguments:
  FILE                  Name of catalog YAML file

optional arguments:
  \-h, \-\-help            show this help message and exit
  \-p PORT, \-\-port PORT  port number for server to listen on
  \-\-list\-entries        list catalog entries at startup
  \-\-sys\-exit\-on\-sigterm
                        internal flag used during unit testing to ensure
                        .coverage file is written
  \-\-flatten
  \-\-no\-flatten
  \-a ADDRESS, \-\-address ADDRESS
                        address to use as a host, defaults to the address in
                        the configuration file, if provided otherwise localhost
  usage: intake\-server [\-h] [\-p PORT] [\-\-list\-entries] [\-\-sys\-exit\-on\-sigterm]
               [\-\-flatten] [\-\-no\-flatten] [\-a ADDRESS]
               FILE [FILE ...]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To start the server with a local catalog file, use the following:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake\-server intake/catalog/tests/catalog1.yml
Creating catalog from:
  \- intake/catalog/tests/catalog1.yml
catalog_args [\(aqintake/catalog/tests/catalog1.yml\(aq]
Entries: entry1,entry1_part,use_example1
Listening on port 5000
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
You can use the catalog client (defined below) using:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ intake list intake://localhost:5000
entry1
entry1_part
use_example1
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Intake Client
.sp
While the Intake data sources will typically be accessed through the Python
API, you can use the client to verify a catalog file.
.sp
Unlike the server command, the client has several subcommands to access a
catalog. You can see the list of available subcommands with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake \-\-help
usage: intake {list,describe,exists,get,discover} ...
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
We go into further detail in the following sections.
.SS List
.sp
This subcommand lists the names of all available catalog entries. This is
useful since other subcommands require these names.
.sp
If you wish to see the details about each catalog entry, use the \fB\-\-full\fP flag.
This is equivalent to running the \fBintake describe\fP subcommand for all catalog
entries.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake list \-\-help
usage: intake list [\-h] [\-\-full] URI

positional arguments:
  URI         Catalog URI

optional arguments:
  \-h, \-\-help  show this help message and exit
  \-\-full
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake list intake/catalog/tests/catalog1.yml
entry1
entry1_part
use_example1
>>> intake list \-\-full intake/catalog/tests/catalog1.yml
[entry1] container=dataframe
[entry1] description=entry1 full
[entry1] direct_access=forbid
[entry1] user_parameters=[]
[entry1_part] container=dataframe
[entry1_part] description=entry1 part
[entry1_part] direct_access=allow
[entry1_part] user_parameters=[{\(aqdefault\(aq: \(aq1\(aq, \(aqallowed\(aq: [\(aq1\(aq, \(aq2\(aq], \(aqtype\(aq: u\(aqstr\(aq, \(aqname\(aq: u\(aqpart\(aq, \(aqdescription\(aq: u\(aqpart of filename\(aq}]
[use_example1] container=dataframe
[use_example1] description=example1 source plugin
[use_example1] direct_access=forbid
[use_example1] user_parameters=[]
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Describe
.sp
Given the name of a catalog entry, this subcommand lists the details of the
respective catalog entry.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake describe \-\-help
usage: intake describe [\-h] URI NAME

positional arguments:
  URI         Catalog URI
  NAME        Catalog name

optional arguments:
  \-h, \-\-help  show this help message and exit
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake describe intake/catalog/tests/catalog1.yml entry1
[entry1] container=dataframe
[entry1] description=entry1 full
[entry1] direct_access=forbid
[entry1] user_parameters=[]
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Discover
.sp
Given the name of a catalog entry, this subcommand returns a key\-value
description of the data source. The exact details are subject to change.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake discover \-\-help
usage: intake discover [\-h] URI NAME

positional arguments:
  URI         Catalog URI
  NAME        Catalog name

optional arguments:
  \-h, \-\-help  show this help message and exit
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake discover intake/catalog/tests/catalog1.yml entry1
{\(aqnpartitions\(aq: 2, \(aqdtype\(aq: dtype([(\(aqname\(aq, \(aqO\(aq), (\(aqscore\(aq, \(aq<f8\(aq), (\(aqrank\(aq, \(aq<i8\(aq)]), \(aqshape\(aq: (None,), \(aqdatashape\(aq:None, \(aqmetadata\(aq: {\(aqfoo\(aq: \(aqbar\(aq, \(aqbar\(aq: [1, 2, 3]}}
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Exists
.sp
Given the name of a catalog entry, this subcommand returns whether or not the
respective catalog entry is valid.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake exists \-\-help
usage: intake exists [\-h] URI NAME

positional arguments:
  URI         Catalog URI
  NAME        Catalog name

optional arguments:
  \-h, \-\-help  show this help message and exit
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake exists intake/catalog/tests/catalog1.yml entry1
True
>>> intake exists intake/catalog/tests/catalog1.yml entry2
False
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Get
.sp
Given the name of a catalog entry, this subcommand outputs the entire data
source to standard output.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake get \-\-help
usage: intake get [\-h] URI NAME

positional arguments:
  URI         Catalog URI
  NAME        Catalog name

optional arguments:
  \-h, \-\-help  show this help message and exit
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>>> intake get intake/catalog/tests/catalog1.yml entry1
       name  score  rank
0    Alice1  100.5     1
1      Bob1   50.3     2
2  Charlie1   25.0     3
3      Eve1   25.0     3
4    Alice2  100.5     1
5      Bob2   50.3     2
6  Charlie2   25.0     3
7      Eve2   25.0     3
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Config and Cache
.sp
CLI functions starting with \fBintake cache\fP and \fBintake config\fP are available to
provide information about the system: the locations and value of configuration
parameters, and the state of cached files.
.SS Persisting Data
.sp
(this is an experimental new feature, expect enhancements and changes)
.SS Introduction
.sp
As defined in the glossary, to \fI\%Persist\fP is to convert data into the storage format
most appropriate for the container type, and save a copy of this for rapid lookup in the future.
This is of great potential benefit where the creation or transfer of the original data source
takes some time.
.sp
This is not to be confused with the file \fI\%Cache\fP\&.
.SS Usage
.sp
Any \fI\%Data Source\fP has a method \fB\&.persist()\fP\&. The only option that you will need to
pick is a \fI\%TTL\fP, the number of seconds that the persisted version lasts before
expiry (leave as \fBNone\fP for no expiry). This creates a local copy in the persist
directory, which may be in \fB"~/.intake/persist\fP, but can be configured.
.sp
Each container type (dataframe, array, ...) will have its own implementation of persistence,
and a particular file storage format associated. The call to \fB\&.persist()\fP may take
arguments to tune how the local files are created, and in some cases may require additional
optional packages to be installed.
.sp
Example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cat = intake.open_catalog(\(aqmycat.yaml\(aq)  # load a remote cat
source = cat.csvsource()  # source pointing to remote data
source.persist()

source = cat.csvsource()  # future use now gives local intake_parquet.ParquetSource
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To control whether a catalog will automatically give you the persisted version of a
source in this way using the argument \fBpersist_mode\fP, e.g., to ignore locally
persisted versions, you could have done:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cat = intake.open_catalog(\(aqmycat.yaml\(aq, persist_mode=\(aqnever\(aq)
or
source = cat.csvsource(persist_mode=\(aqnever\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Note that if you give a TTL (in seconds), then the original source will be accessed
and a new persisted version written transparently when the old persisted version has expired.
.sp
Note that after persisting, the original source will have \fBsource.has_been_persisted == True\fP
and the persisted source (i.e., the one loaded from local files) will have
\fBsource.is_persisted == True\fP\&.
.SS Export
.sp
A similar concept to Persist, Export allows you to make a copy of some data source, in the
format appropriate for its container, and place this data\-set in whichever location suits you,
including remote locations. This functionality (\fBsource.export()\fP) does \fInot\fP touch the persist
store; instead, it returns a YAML text representation of the output, so that you can put it into
a catalog of your own. It would be this catalog that you share with other people.
.sp
Note that "exported" data\-sources like this do contain the information of the original source they
were made from in their metadata, so you can recreate the original source, if you want to, and
read from there.
.SS Persisting to Remote
.sp
If you are typically running your code inside of ephemoral containers, then persisting data\-sets may
be something that you want to do (because the original source is slow, or parsing is CPU/memory intensive),
but the local storage is not useful. In some cases you may have access to some shared network storage
mounted on the instance, but in other cases you will want to persist to a remote store.
.sp
The config value \fB\(aqpersist_path\(aq\fP, which can also be set by the environment variable
\fBINTAKE_PERSIST_PATH\fP can be a remote location such as \fBs3://mybucket/intake\-persist\fP\&. You will
need to install the appropriate package to talk to the external storage (e.g., \fBs3fs\fP, \fBgcsfs\fP,
\fBpyarrow\fP), but otherwise everything should work as before, and you can access the persisted data
from any container.
.SS The Persist Store
.sp
You can interact directly with the class implementing persistence:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from intake.container.persist import store
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This singleton instance, which acts like a catalog, allows you to query the contents of the
instance store and to add and remove entries. It also allows you to find the original
source for any given persisted source, and refresh the persisted version on demand.
.sp
For details on the methods of the persist store, see the API documentation:
\fI\%intake.container.persist.PersistStore()\fP\&. Sources in the store carry a lot of
information about the sources they were made from, so that they can be remade
successfully. This all appears in the source metadata.
The sources use the "token" of the original
data source as their key in the store, a value which can be found by \fBdask.base.tokenize(source)\fP
for the original source, or can be taken from the metadata of a persisted source.
.sp
Note that all of the information about persisted sources is held in a single YAML file in
the persist directory (typically \fB/persisted/cat.yaml\fP within the config directory, but
see \fBintake.config.conf[\(aqpersist_path\(aq]\fP). This file can be edited by hand if you wanted to,
for example, set some persisted source not to expire. This is only recommended for experts.
.SS Future Enhancements
.INDENT 0.0
.IP \(bu 2
CLI functionality to investigate and alter the state of the persist store.
.IP \(bu 2
Time check\-pointing of persisted data, such that you can not only get the "most recent" but
any version in the time\-series.
.IP \(bu 2
(eventually) pipeline functionality, whereby a persisted data source depends on another
persisted data source, and the whole train can be refreshed on a schedule or on demand.
.UNINDENT
.SS Plotting
.sp
Intake provides a plotting API based on the \fI\%hvPlot\fP library, which
closely mirrors the pandas plotting API but generates interactive plots using \fI\%HoloViews\fP
and \fI\%Bokeh\fP\&.
.sp
The \fI\%hvPlot website\fP provides comprehensive documentation on using the
plotting API to quickly visualize and explore small and large datasets. The main features offered by the plotting API
include:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
Support for tabular data stored in pandas and dask dataframes
.IP \(bu 2
Support for gridded data stored in xarray backed nD\-arrays
.IP \(bu 2
Support for plotting large datasets with \fI\%datashader\fP
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Using Intake alongside hvPlot allows declaratively persisting plot declarations and default options in the regular
catalog.yaml files.
.SS Setup
.sp
For detailed installation instructions see the
\fI\%getting started section\fP in the hvPlot documentation.
To start with install hvplot using conda:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
conda install \-c conda\-forge hvplot
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
or using pip:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
pip install hvplot
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Usage
.sp
The plotting API is designed to work well in and outside the Jupyter notebook, however when using it in JupyterLab
the PyViz lab extension must be installed first:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
jupyter labextension install @pyviz/jupyterlab_pyviz
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
For detailed instructions on displaying plots in the notebook and from the Python command prompt see the
\fI\%hvPlot user guide\fP\&.
.SS Python Command Prompt & Scripts
.sp
Assuming the US Crime dataset has been installed (in the
\fI\%intake\-examples repo\fP, or from
conda with \fIconda install \-c intake us_crime\fP):
.sp
Once installed the plot API can be used, by using the \fB\&.plot\fP method on an intake \fBDataSource\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import intake
import hvplot as hp

crime = intake.cat.us_crime
columns = [\(aqBurglary rate\(aq, \(aqLarceny\-theft rate\(aq, \(aqRobbery rate\(aq, \(aqViolent Crime rate\(aq]

violin = crime.plot.violin(y=columns, group_label=\(aqType of crime\(aq,
                           value_label=\(aqRate per 100k\(aq, invert=True)
hp.show(violin)
.ft P
.fi
.UNINDENT
.UNINDENT
[image]
.SS Notebook
.sp
Inside the notebook plots will display themselves, however the notebook extension must be loaded first. The
extension may be loaded by importing \fBhvplot.intake\fP module or explicitly loading the holoviews extension,
or by calling \fBintake.output_notebook()\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# To load the extension run this import
import hvplot.intake

# Or load the holoviews extension directly
import holoviews as hv
hv.extension(\(aqbokeh\(aq)

# convenience function
import intake
intake.output_notebook()

crime = intake.cat.us_crime
columns = [\(aqViolent Crime rate\(aq, \(aqRobbery rate\(aq, \(aqBurglary rate\(aq]
crime.plot(x=\(aqYear\(aq, y=columns, value_label=\(aqRate (per 100k people)\(aq)
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Predefined Plots
.sp
Some catalogs will define plots appropriate to a specific data source. These will be specified
such that the user gets the right view with the right columns and labels, without having to investigate
the data in detail \-\- this is ideal for quick\-look plotting when browsing sources.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import intake
intake.us_crime.plots
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Returns \fI[\(aqexample\(aq]\fP\&. This works whether accessing the entry object or the source instance. To visualise
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
intake.us_crime.plot.example()
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Persisting metadata
.sp
Intake allows catalog yaml files to declare metadata fields for each data source which are made available alongside
the actual dataset. The plotting API reserves certain fields to define default plot options, to label and annotate
the data fields in a dataset and to declare pre\-defined plots.
.SS Declaring defaults
.sp
The first set of metadata used by the plotting API is the \fIplot\fP field in the metadata section. Any options found in
the metadata field will apply to all plots generated from that data source, allowing the definition of plotting
defaults. For example when plotting a fairly large dataset such as the NYC Taxi data, it might be desirable to enable
datashader by default ensuring that any plot that supports it is datashaded. The syntax to declare default plot options
is as follows:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  nyc_taxi:
    description: NYC Taxi dataset
    driver: parquet
    args:
      urlpath: \(aqs3://datashader\-data/nyc_taxi_wide.parq\(aq
    metadata:
      plot:
        datashade: true
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Declaring data fields
.sp
The columns of a CSV or parquet file or the coordinates and data variables in a NetCDF file often have shortened, or
cryptic names with underscores. They also do not provide additional information about the units of the data or the
range of values, therefore the catalog yaml specification also provides the ability to define additional information
about the \fIfields\fP in a dataset.
.sp
Valid attributes that may be defined for the data \fIfields\fP include:
.INDENT 0.0
.IP \(bu 2
\fIlabel\fP: A readable label for the field which will be used to label axes and widgets
.IP \(bu 2
\fIunit\fP: A unit associated with the values inside a data field
.IP \(bu 2
\fIrange\fP: A range associated with a field declaring limits which will override those computed from the data
.UNINDENT
.sp
Just like the default plot options the \fIfields\fP may be declared under the metadata section of a data source:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  nyc_taxi:
    description: NYC Taxi dataset
    driver: parquet
    args:
      urlpath: \(aqs3://datashader\-data/nyc_taxi_wide.parq\(aq
    metadata:
      fields:
        dropoff_x:
          label: Longitude
        dropoff_y:
          label: Latitude
        total_fare:
          label: Fare
          unit: $
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Declaring custom plots
.sp
As shown in the \fI\%hvPlot user guide\fP, the plotting API
provides a variety of plot types, which can be declared using the \fIkind\fP argument or via convenience methods on the
plotting API, e.g. \fIcat.source.plot.scatter()\fP\&. In addition to declaring default plot options and field metadata data
sources may also declare custom plot, which will be made available as methods on the plotting API. In this way a
catalogue may declare any number of custom plots alongside a datasource.
.sp
To make this more concrete consider the following custom plot declaration on the \fIplots\fP field in the metadata section:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  nyc_taxi:
    description: NYC Taxi dataset
    driver: parquet
    args:
      urlpath: \(aqs3://datashader\-data/nyc_taxi_wide.parq\(aq
    metadata:
      plots:
        dropoff_scatter:
          kind: scatter
          x: dropoff_x
          y: dropoff_y
          datashade: True
          width: 800
          height: 600
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This declarative specification creates a new custom plot called \fIdropoff_scatter\fP, which will be available on the
catalog under \fIcat.nyc_taxi.plot.dropoff_scatter()\fP\&. Calling this method on the plot API will automatically generate a
datashaded scatter plot of the dropoff locations in the NYC taxi dataset.
.sp
Of course the three metadata fields may also be used together, declaring global defaults under the \fIplot\fP field,
annotations for the data \fIfields\fP under the \fIfields\fP key and custom plots via the \fIplots\fP field.
.SS Plugin Directory
.sp
This is a list of known projects which install driver plugins for Intake, and the named drivers each
contains in parentheses:
.INDENT 0.0
.IP \(bu 2
builtin to Intake (\fBcatalog\fP, \fBcsv\fP, \fBintake_remote\fP, \fBndzarr\fP,
\fBnumpy\fP, \fBtextfiles\fP, \fByaml_file_cat\fP, \fByaml_files_cat\fP, \fBzarr_cat\fP,
\fBjson\fP, \fBjsonl\fP)
.IP \(bu 2
\fI\%intake\-astro\fP Table and array loading of FITS astronomical data (\fBfits_array\fP, \fBfits_table\fP)
.IP \(bu 2
\fI\%intake\-accumulo\fP Apache Accumulo clustered data storage (\fBaccumulo\fP)
.IP \(bu 2
\fI\%intake\-avro\fP: Apache Avro data serialization format (\fBavro_table\fP, \fBavro_sequence\fP)
.IP \(bu 2
\fI\%intake\-bluesky\fP: search and retrieve data in the \fI\%bluesky\fP data model
.IP \(bu 2
\fI\%intake\-dcat\fP Browse and load data from \fI\%DCAT\fP catalogs. (\fBdcat\fP)
.IP \(bu 2
\fI\%intake\-dynamodb\fP link to Amazon DynamoDB (\fBdynamodb\fP)
.IP \(bu 2
\fI\%intake\-elasticsearch\fP: Elasticsearch search and analytics engine (\fBelasticsearch_seq\fP, \fBelasticsearch_table\fP)
.IP \(bu 2
\fI\%intake\-esm\fP:  Plugin for building and loading intake catalogs for earth system data sets holdings, such as \fI\%CMIP\fP (Coupled Model Intercomparison Project) and CESM Large Ensemble datasets.
.IP \(bu 2
\fI\%intake\-geopandas\fP: load from ESRI Shape Files, GeoJSON, and geospatial databases with geopandas (\fBgeojson\fP, \fBpostgis\fP, \fBshapefile\fP, \fBspatialite\fP) and \fBregionmask\fP for opening shapefiles into \fI\%regionmask\fP\&.
.IP \(bu 2
\fI\%intake\-google\-analytics\fP: run Google Analytics queries and load data as a DataFrame (\fBgoogle_analytics_query\fP)
.IP \(bu 2
\fI\%intake\-hbase\fP: Apache HBase database (\fBhbase\fP)
.IP \(bu 2
\fI\%intake\-iris\fP load netCDF and GRIB files with IRIS (\fBgrib\fP, \fBnetcdf\fP)
.IP \(bu 2
\fI\%intake\-metabase\fP: Generate catalogs and load tables as DataFrames from Metabase (\fBmetabase_catalog\fP, \fBmetabase_table\fP)
.IP \(bu 2
\fI\%intake\-mongo\fP: MongoDB noSQL query (\fBmongo\fP)
.IP \(bu 2
\fI\%intake\-nested\-yaml\-catalog\fP: Plugin supporting a single YAML hierarchical catalog to organize datasets and avoid a data swamp. (\fBnested_yaml_cat\fP)
.IP \(bu 2
\fI\%intake\-netflow\fP: Netflow packet format (\fBnetflow\fP)
.IP \(bu 2
\fI\%intake\-notebook\fP: Experimental plugin to access parameterised notebooks through intake and executed via papermill (\fBipynb\fP)
.IP \(bu 2
\fI\%intake\-odbc\fP: ODBC database (\fBodbc\fP)
.IP \(bu 2
\fI\%intake\-parquet\fP: Apache Parquet file format (\fBparquet\fP)
.IP \(bu 2
\fI\%intake\-pattern\-catalog\fP: Plugin for specifying a file\-path pattern which can represent a number of different entries (\fBpattern_cat\fP)
.IP \(bu 2
\fI\%intake\-pcap\fP: PCAP network packet format (\fBpcap\fP)
.IP \(bu 2
\fI\%intake\-postgres\fP: PostgreSQL database (\fBpostgres\fP)
.IP \(bu 2
\fI\%intake\-s3\-manifests\fP (\fBs3_manifest\fP)
.IP \(bu 2
\fI\%intake\-salesforce\fP: Generate catalogs and load tables as DataFrames from Salesforce (\fBsalesforce_catalog\fP, \fBsalesforce_table\fP)
.IP \(bu 2
\fI\%intake\-sklearn\fP: Load scikit\-learn models from Pickle files (\fBsklearn\fP)
.IP \(bu 2
\fI\%intake\-solr\fP: Apache Solr search platform (\fBsolr\fP)
.IP \(bu 2
\fI\%intake\-stac\fP: Intake Driver for \fI\%SpatioTemporal Asset Catalogs (STAC)\fP\&.
.IP \(bu 2
\fI\%intake\-stripe\fP: Generate catalogs and load tables as DataFrames from Stripe (\fBstripe_catalog\fP, \fBstripe_table\fP)
.IP \(bu 2
\fI\%intake\-spark\fP: data processed by Apache Spark (\fBspark_cat\fP, \fBspark_rdd\fP, \fBspark_dataframe\fP)
.IP \(bu 2
\fI\%intake\-sql\fP: Generic SQL queries via SQLAlchemy (\fBsql_cat\fP, \fBsql\fP, \fBsql_auto\fP, \fBsql_manual\fP)
.IP \(bu 2
\fI\%intake\-sqlite\fP: Local caching of remote SQLite DBs and queries via SQLAlchemy (\fBsqlite_cat\fP, \fBsqlite\fP, \fBsqlite_auto\fP, \fBsqlite_manual\fP)
.IP \(bu 2
\fI\%intake\-splunk\fP: Splunk machine data query (\fBsplunk\fP)
.IP \(bu 2
\fI\%intake\-streamz\fP: real\-time event processing using Streamz (\fBstreamz\fP)
.IP \(bu 2
\fI\%intake\-thredds\fP: Intake interface to THREDDS data catalogs (\fBthredds_cat\fP, \fBthredds_merged_source\fP)
.IP \(bu 2
\fI\%intake\-xarray\fP: load netCDF, Zarr and other multi\-dimensional data (\fBxarray_image\fP, \fBnetcdf\fP, \fBgrib\fP, \fBopendap\fP, \fBrasterio\fP, \fBremote\-xarray\fP, \fBzarr\fP)
.UNINDENT
.sp
The status of these projects is available at \fI\%Status Dashboard\fP\&.
.sp
Don\(aqt see your favorite format?  See \fI\%Making Drivers\fP for how to create new plugins.
.sp
Note that if you want your plugin listed here, open an issue in the \fI\%Intake
issue repository\fP and add an entry to the
\fI\%status dashboard repository\fP\&. We also have a
\fI\%plugin wishlist Github issue\fP
that shows the breadth of plugins we hope to see for Intake.
.SS Server Protocol
.sp
This page gives deeper details on how the Intake \fI\%server\fP is implemented. For those
simply wishing to run and configure a server, see the \fI\%Command Line Tools\fP section.
.sp
Communication between the intake client and server happens exclusively over HTTP, with all
parameters passed using msgpack UTF8 encoding. The
server side is implemented by the module \fBintake.cli.server\fP\&. Currently, only the following
two routes are available:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBhttp://server:port/v1/info\fP
.IP \(bu 2
\fBhttp://server:port/v1/source\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
The server may be configured to use auth services, which, when passed the header of the incoming
call, can determine whether the given request is allowed. See \fI\%Authorization Plugins\fP\&.
.SS GET /info
.sp
Retrieve information about the data\-sets available on this server. The list of data\-sets may be
paginated, in order to avoid excessively long transactions. Notice that the catalog for which a listing
is being requested can itself be a data\-source (when \fBsource\-id\fP is passed) \- this is how nested
sub\-catalogs are handled on the server.
.SS Parameters
.INDENT 0.0
.IP \(bu 2
\fBpage_size\fP, int or none (optional): to enable pagination, set this value. The number of entries returned
will be this value at most. If None, returns all entries. This is passed as a query parameter.
.IP \(bu 2
\fBpage_offset\fP, int (optional): when paginating, start the list from this numerical offset. The order of entries
is guaranteed if the base catalog has not changed. This is passed as a query parameter.
.IP \(bu 2
\fBsource\-id\fP, uuid string (optional): when the catalog being accessed is not the route catalog, but an open data\-source
on the server, this is its unique identifier. See \fBPOST /source\fP for how these IDs are generated.
If the catalog being accessed is the root Catalog, this parameter should be omitted. This is passed as an HTTP header.
.UNINDENT
.SS Returns
.INDENT 0.0
.IP \(bu 2
\fBversion\fP, string: the server\(aqs Intake version
.IP \(bu 2
\fBsources\fP, list of objects: the main payload, where each object contains a \fBname\fP, and the result of calling
\fB\&.describe()\fP on the corresponding data\-source, i.e., the container type, description, metadata.
.IP \(bu 2
\fBmetadata\fP, object: any metadata associated with the whole catalog
.UNINDENT
.SS GET /source
.sp
Fetch information about a specific source. This is the random\-access variant of the \fBGET /info\fP route, by which
a particular data\-source can be accessed without paginating through all of the sources.
.SS Parameters
.INDENT 0.0
.IP \(bu 2
\fBname\fP, string (required): the data source name being accessed, one of the members of the catalog. This is passed as a query parameter.
.IP \(bu 2
\fBsource\-id\fP, uuid string (optional): when the catalog being accessed is not the root catalog, but an open data\-source
on the server, this is its unique identifier. See \fBPOST /source\fP for how these IDs are generated.
If the catalog being accessed is the root Catalog, this parameter should be omitted. This is passed as an HTTP header.
.UNINDENT
.SS Returns
.sp
Same as one of the entries in \fBsources\fP for \fBGET /info\fP: the result of \fB\&.describe()\fP on the given data\-source in the
server
.SS POST /source, action="search"
.sp
Searching a Catalog returns search results in the form of a new Catalog. This
"results" Catalog is cached on the server the same as any other Catalog.
.SS Parameters
.INDENT 0.0
.IP \(bu 2
\fBsource\-id\fP, uuid string (optional): When the catalog being searched is not
the root catalog, but a subcatalog on the server, this is its unique
identifier. If the catalog being searched is the root Catalog, this parameter
should be omitted. This is passed as an HTTP header.
.IP \(bu 2
\fBquery\fP: tuple of \fB(args, kwargs)\fP: These will be unpacked into
\fBCatalog.search\fP on the server to create the "results" Catalog. This is passed in the body of the message.
.UNINDENT
.SS Returns
.INDENT 0.0
.IP \(bu 2
\fBsource_id\fP, uuid string: the identifier of the results Catalog in the
server\(aqs source cache
.UNINDENT
.SS POST /source, action="open"
.sp
This is a more involved processing of a data\-source, and, if successful, returns one of two possible scenarios:
.INDENT 0.0
.IP \(bu 2
direct\-access, in which all the details required for reading the data directly from the client are passed, and
the client then creates a local copy of the data source and needs no further involvement from the server in order
to fetch the data
.IP \(bu 2
remote\-access, in which the client is unable or unwilling to create a local version of the data\-source, and instead
created a remote data\-source which will fetch the data for each partition from the server.
.UNINDENT
.sp
The set of parameters supplied and the server/client policies will define which method of access is employed. In the
case of remote\-access, the data source is instantiated on the server, and \fB\&.discover()\fP run on it. The resulting
information is passed back, and must be enough to instantiate a subclass of \fBintake.container.base.RemoteSource\fP
appropriate for the container of the data\-set in question (e.g., \fBRemoteArray\fP when \fBcontainer="ndarray"\fP).
In this case, the response also includes a UUID string for the open instance on the server, referencing the
cache of open sources maintained by the server.
.sp
Note that "opening" a data entry which is itself is a catalog implies instantiating that catalog object on the
server and returning its UUID, such that a listing can be made using \fBGET/ info\fP or \fBGET /source\fP\&.
.SS Parameters
.INDENT 0.0
.IP \(bu 2
\fBname\fP, string (required): the data source name being accessed, one of the members of the catalog. This is passed in the body of the request.
.IP \(bu 2
\fBsource\-id\fP, uuid string (optional): when the catalog being accessed is not the root catalog, but an open data\-source
on the server, this is its unique identifier. If the catalog being accessed is the root Catalog, this parameter should be omitted. This
is passed as an HTTP header.
.IP \(bu 2
\fBavailable_plugins\fP, list of string (optional): the set of named data drivers supported by the client. If the driver required
by the data\-source is not supported by the client, then the source must be opened remote\-access. This is passed in the body of the request.
.IP \(bu 2
\fBparameters\fP, object (optional): user parameters to pass to the data\-source when instantiating. Whether or not direct\-access
is possible may, in principle, depend on these parameters, but this is unlikely. Note that some parameter default
value functions are designed to be evaluated on the server, which may have access to, for example, some credentials
service (see \fI\%Parameter Definition\fP). This is passed in the body of the request.
.UNINDENT
.SS Returns
.sp
If direct\-access, the driver plugin name and set of arguments for instantiating the data\-soruce in the client.
.sp
If remote\-access, the data\-source container, schema and source\-ID so that further reads can be made from the
server.
.SS POST /source, action="read"
.sp
This route fetches data from the server once a data\-source has been opened in remote\-access mode.
.SS Parameters
.INDENT 0.0
.IP \(bu 2
\fBsource\-id\fP, uuid string (required): the identifier of the data\-source in the server\(aqs source cache. This is returned
when \fBaction="open"\fP\&. This is passed in the body of the request.
.IP \(bu 2
\fBpartition\fP, int or tuple (optional, but necessary for some sources): section/chunk of the data to fetch.
In cases where the data\-source is partitioned,
the client will fetch the data one partition at a time, so that it will appear partitioned in the same manner on
the client side for iteration of passing to Dask. Some data\-sources do not support partitioning, and then this
parameter is not required/ignored. This is passed in the body of the request.
.IP \(bu 2
\fBaccepted_formats\fP, \fBaccepted_compression\fP, list of strings (required): to specify how serialization of data happens. This
is an expert feature, see docs in the module \fBintake.container.serializer\fP\&. This is passed in the body of the request.
.UNINDENT
.SS Dataset Transforms
.sp
aka. derived datasets.
.sp
\fBWARNING:\fP
.INDENT 0.0
.INDENT 3.5
experimental feature, the API may change. The data sources in
\fBintake.source.derived\fP are not yet declared as top\-level
named drivers in the package entrypoints.
.UNINDENT
.UNINDENT
.sp
Intake allows for the definition of data sources which take as their input
another source in the same directory, so that you have the opportunity to
present \fIprocessing\fP to the user of the catalog.
.sp
The "target" or a derived data source will normally be a string. In the
simple case, it is the name of a data source in the same catalog. However,
we use the syntax "catalog:source" to refer to sources in other catalogs,
where the part before ":" will be passed to \fI\%intake.open_catalog()\fP,
together with any keyword arguments from \fBcat_kwargs\fP\&.
.sp
This can be done by defining classes which inherit from
\fBintake.source.derived.DerivedSource\fP, or using one of the pre\-defined classes
in the same module, which usually need to be passed a reference to a function
in a python module. We will demonstrate both.
.SS Example
.sp
Consider the following \fItarget\fP dataset, which loads some simple facts
about US states from a CSV file. This  example is taken from the Intake
test suite.
.sp
We now show two ways to apply a super\-simple transform to this data,
which selects two of the dataframe\(aqs columns.
.SS Class Example
.sp
The first version uses an approach in which the transform is derived in a
data source class, and the parameters passed are specific to the transform type.
Note that the driver is referred to by it\(aqs fully\-qualified name in the
Intake package.
.sp
The source class for this is included in the Intake codebase, but the important
part is:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
class Columns(DataFrameTransform):
    ...

    def pick_columns(self, df):
        return df[self._params["columns"]]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
We see that this specific class inherits from \fBDataFrameTransform\fP,
with \fBtransform=self.pick_columns\fP\&. We know
that the inputs and outputs are both dataframes. This allows for some additional validation
and an automated way to infer the output dataframe\(aqs schema that reduces the number of line
of code required.
.sp
The given method does exactly what you might imagine: it takes and input dataframe and
applies a column selection to it.
.sp
Running \fBcat.derive_cols.read()\fP will indeed, as expected, produce a version of the data
with only the selected columns included. It does this by defining the original dataset,
appying the selection, and then getting Dask to generate the output. For some datasets,
this can mean that the selection is pushed down to the reader, and the data for the dropped
columns is never loaded. The user may choose to do \fB\&.to_dask()\fP instead, and manipulate
the lazy dataframe directly, before loading.
.SS Functional Example
.sp
This second version of the same output uses the more generic and flexible
\fBintake.source.derived.DataFrameTransform\fP\&.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
derive_cols_func:
  driver: intake.source.derived.DataFrameTransform
  args:
    targets:
      \- input_data
    transform: "intake.source.tests.test_derived._pick_columns"
    transform_kwargs:
      columns: ["state", "slug"]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
In this case, we pass a reference to a \fIfunction\fP defined in the Intake test suite.
Normally this would be declared in user modules, where perhaps those declarations
and catalog(s) are distributed together as a package.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def _pick_columns(df, columns):
    return df[columns]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This is, of course, very similar to the method shown in the previous section,
and again applies the selection in the given named argument to the input. Note that
Intake does not support including actual code in your catalog, since we would not
want to allow arbitrary execution of code on catalog load, as opposed to execution.
.sp
Loading this data source proceeds exactly the same way as the class\-based approach,
above. Both Dask and in\-memory (Pandas, via \fB\&.read()\fP) methods work as expected.
The declaration in YAML, above, is slightly more verbose, but the amount of
code is smaller. This demonstrates a tradeoff between flexibility and concision. If
there were validation code to add for the arguments or input dataset, it would be
less obvious where to put these things.
.SS Barebone Example
.sp
The previous two examples both did dataframe to dataframe transforms. However, totally
arbitrary computations are possible. Consider the following:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
barebones:
  driver: intake.source.derived.GenericTransform
  args:
    targets:
      \- input_data
    transform: builtins.len
    transform_kwargs: {}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This applies \fBlen\fP to the input dataframe. \fBcat.barebones.describe()\fP gives
the output container type as "other", i.e., not specified. The result of \fBread()\fP
on this gives the single number 50, the number of rows in the input data. This class,
and \fBDerivedDataSource\fP and included with the intent as superclasses, and probably
will not be used directly often.
.SS Execution engine
.sp
None of the above examples specified explicitly where the compute implied by the
transformation will take place. However, most Intake drivers support in\-memory containers
and Dask; remembering that the input dataset here is a dataframe. However, the behaviour
is defined in the driver class itself \- so it would be fine to write a driver in which
we make different assumptions. Let\(aqs suppose, for instance, that the original source
is to be loaded from \fBspark\fP (see the \fBintake\-spark\fP package), the driver could
explicitly call \fB\&.to_spark\fP on the original source, and be assured that it has a
Spark object to work with. It should, of course, explain in its documentation what
assumptions are being made and that, presumably, the user is expected to also call
\fB\&.to_spark\fP if they wished to directly manipulate the spark object.
.SS Plugin examples
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
call \fI\&.sel\fP on xarray datasets \fI\%xarray\-plugin\-transform\fP
.UNINDENT
.UNINDENT
.UNINDENT
.SS API
.TS
center;
|l|l|.
_
T{
\fI\%intake.source.derived.DerivedSource\fP(*args, ...)
T}	T{
Base source deriving from another source in the same catalog
T}
_
T{
\fI\%intake.source.derived.GenericTransform\fP(...)
T}	T{
T}
_
T{
\fI\%intake.source.derived.DataFrameTransform\fP(...)
T}	T{
Transform where the input and output are both Dask\-compatible dataframes
T}
_
T{
\fI\%intake.source.derived.Columns\fP(*args, **kwargs)
T}	T{
Simple dataframe transform to pick columns
T}
_
.TE
.INDENT 0.0
.TP
.B class  intake.source.derived.DerivedSource(*args, **kwargs)
Base source deriving from another source in the same catalog
.sp
Target picking and parameter validation are performed here, but
you probably want to subclass from one of the more specific
classes like \fBDataFrameTransform\fP\&.
.INDENT 7.0
.TP
.B __init__(targets, target_chooser=<function first>, target_kwargs=None, cat_kwargs=None, container=None, metadata=None, **kwargs)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBtargets: list of string or DataSources\fP
If string(s), refer to entries of the same catalog as this Source
.TP
\fBtarget_chooser: function to choose between targets\fP
function(targets, cat) \-> source, or a fully\-qualified dotted string pointing
to it
.TP
\fBtarget_kwargs: dict of dict with keys matching items of targets\fP
.TP
\fBcat_kwargs: to pass to intake.open_catalog, if the target is in\fP
another catalog
.TP
\fBcontainer: str (optional)\fP
Assumed output container, if known/different from input
.TP
\fB[Note: the exact form of target_kwargs and cat_kwargs may be\fP
.TP
\fBsubject to change]\fP
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.derived.GenericTransform(*args, **kwargs)
.INDENT 7.0
.TP
.B __init__(targets, target_chooser=<function first>, target_kwargs=None, cat_kwargs=None, container=None, metadata=None, **kwargs)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBtargets: list of string or DataSources\fP
If string(s), refer to entries of the same catalog as this Source
.TP
\fBtarget_chooser: function to choose between targets\fP
function(targets, cat) \-> source, or a fully\-qualified dotted string pointing
to it
.TP
\fBtarget_kwargs: dict of dict with keys matching items of targets\fP
.TP
\fBcat_kwargs: to pass to intake.open_catalog, if the target is in\fP
another catalog
.TP
\fBcontainer: str (optional)\fP
Assumed output container, if known/different from input
.TP
\fB[Note: the exact form of target_kwargs and cat_kwargs may be\fP
.TP
\fBsubject to change]\fP
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.derived.DataFrameTransform(*args, **kwargs)
Transform where the input and output are both Dask\-compatible dataframes
.sp
This derives from GenericTransform, and you must supply \fBtransform\fP and
any \fBtransform_kwargs\fP\&.
.INDENT 7.0
.TP
.B __init__(targets, target_chooser=<function first>, target_kwargs=None, cat_kwargs=None, container=None, metadata=None, **kwargs)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBtargets: list of string or DataSources\fP
If string(s), refer to entries of the same catalog as this Source
.TP
\fBtarget_chooser: function to choose between targets\fP
function(targets, cat) \-> source, or a fully\-qualified dotted string pointing
to it
.TP
\fBtarget_kwargs: dict of dict with keys matching items of targets\fP
.TP
\fBcat_kwargs: to pass to intake.open_catalog, if the target is in\fP
another catalog
.TP
\fBcontainer: str (optional)\fP
Assumed output container, if known/different from input
.TP
\fB[Note: the exact form of target_kwargs and cat_kwargs may be\fP
.TP
\fBsubject to change]\fP
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.derived.Columns(*args, **kwargs)
Simple dataframe transform to pick columns
.sp
Given as an example of how to make a specific dataframe transform.
Note that you could use DataFrameTransform directly, by writing a
function to choose the columns instead of a method as here.
.INDENT 7.0
.TP
.B __init__(columns, **kwargs)
.INDENT 7.0
.TP
.B columns: list of labels (usually str) or slice
Columns to choose from the target dataframe
.UNINDENT
.UNINDENT
.UNINDENT
.SH REFERENCE
.SS API
.sp
Auto\-generated reference
.SS End User
.sp
These are reference class and function definitions likely to be useful to everyone.
.TS
center;
|l|l|.
_
T{
\fI\%intake.open_catalog\fP([uri])
T}	T{
Create a Catalog object
T}
_
T{
\fI\%intake.registry\fP
T}	T{
Dict of driver: DataSource class
T}
_
T{
\fBintake.register_driver\fP(name, value[, ...])
T}	T{
Add runtime driver definition
T}
_
T{
\fI\%intake.upload\fP(data, path, **kwargs)
T}	T{
Given a concrete data object, store it at given location return Source
T}
_
T{
\fI\%intake.source.csv.CSVSource\fP(*args, **kwargs)
T}	T{
Read CSV files into dataframes
T}
_
T{
\fI\%intake.source.textfiles.TextFilesSource\fP(...)
T}	T{
Read textfiles as sequence of lines
T}
_
T{
\fI\%intake.source.jsonfiles.JSONFileSource\fP(...)
T}	T{
Read JSON files as a single dictionary or list
T}
_
T{
\fI\%intake.source.jsonfiles.JSONLinesFileSource\fP(...)
T}	T{
Read a JSONL (\fI\%https://jsonlines.org/\fP) file and return a list of objects, each being valid json object (e.g.
T}
_
T{
\fI\%intake.source.npy.NPySource\fP(*args, **kwargs)
T}	T{
Read numpy binary files into an array
T}
_
T{
\fI\%intake.source.zarr.ZarrArraySource\fP(*args, ...)
T}	T{
Read Zarr format files into an array
T}
_
T{
\fI\%intake.catalog.local.YAMLFileCatalog\fP(*args, ...)
T}	T{
Catalog as described by a single YAML file
T}
_
T{
\fI\%intake.catalog.local.YAMLFilesCatalog\fP(*args, ...)
T}	T{
Catalog as described by a multiple YAML files
T}
_
T{
\fI\%intake.catalog.zarr.ZarrGroupCatalog\fP(*args, ...)
T}	T{
A catalog of the members of a Zarr group.
T}
_
.TE
.INDENT 0.0
.TP
.B intake.open_catalog(uri=None, **kwargs)
Create a Catalog object
.sp
Can load YAML catalog files, connect to an intake server, or create any
arbitrary Catalog subclass instance. In the general case, the user should
supply \fBdriver=\fP with a value from the plugins registry which has a
container type of catalog. File locations can generally be remote, if
specifying a URL protocol.
.sp
The default behaviour if not specifying the driver is as follows:
.INDENT 7.0
.IP \(bu 2
if \fBuri\fP is a a single string ending in "yml" or "yaml", open it as a
catalog file
.IP \(bu 2
if \fBuri\fP is a list of strings, a string containing a glob character
("*") or a string not ending in "y(a)ml", open as a set of catalog
files. In the latter case, assume it is a directory.
.IP \(bu 2
if \fBuri\fP beings with protocol \fB"intake:"\fP, connect to a remote
Intake server
.IP \(bu 2
if \fBuri\fP is \fBNone\fP or missing, create a base Catalog object without entries.
.UNINDENT
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBuri: str or pathlib.Path\fP
Designator for the location of the catalog.
.TP
\fBkwargs:\fP
passed to subclass instance, see documentation of the individual
catalog classes. For example, \fByaml_files_cat\fP (when specifying
multiple uris or a glob string) takes the additional
parameter \fBflatten=True|False\fP, specifying whether all data sources
are merged in a single namespace, or each file becomes
a sub\-catalog.
.UNINDENT
.UNINDENT
.sp
\fBSEE ALSO:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.TP
.B \fBintake.open_yaml_files_cat\fP, \fBintake.open_yaml_file_cat\fP
.TP
.B \fBintake.open_intake_remote\fP
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B intake.registry
Mapping from plugin names to the DataSource classes that implement them. These are the
names that should appear in the \fBdriver:\fP key of each source definition in a
catalog. See \fI\%Plugin Directory\fP for more details.
.UNINDENT
.INDENT 0.0
.TP
.B intake.open_
Set of functions, one for each plugin, for direct opening of a data source. The names are derived from the names of
the plugins in the registry at import time.
.UNINDENT
.INDENT 0.0
.TP
.B intake.upload(data, path, **kwargs)
Given a concrete data object, store it at given location return Source
.sp
Use this function to publicly share data which you have created in your
python session. Intake will try each of the container types, to see if
one of them can handle the input data, and write the data to the path
given, in the format most appropriate for the data type, e.g., parquet for
pandas or dask data\-frames.
.sp
With the DataSource instance you get back, you can add this to a catalog,
or just get the YAML representation for editing (\fB\&.yaml()\fP) and
sharing.
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBdata\fP
instance
The object to upload and store. In many cases, the dask or in\-memory
variant are handled equivalently.
.TP
\fBpath\fP
str
Location of the output files; can be, for instance, a network drive
for sharing over a VPC, or a bucket on a cloud storage service
.TP
\fBkwargs\fP
passed to the writer for fine control.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B DataSource instance
.UNINDENT
.UNINDENT
.UNINDENT
.SS Source classes
.INDENT 0.0
.TP
.B class  intake.source.csv.CSVSource(*args, **kwargs)
Read CSV files into dataframes
.sp
Prototype of sources reading dataframe data
.INDENT 7.0
.TP
.B __init__(urlpath, csv_kwargs=None, metadata=None, storage_options=None, path_as_pattern=True)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath\fP
str or iterable, location of data
May be a local path, or remote path if including a protocol specifier
such as \fB\(aqs3://\(aq\fP\&. May include glob wildcards or format pattern strings.
Some examples:
.INDENT 7.0
.IP \(bu 2
\fB{{ CATALOG_DIR }}data/precipitation.csv\fP
.IP \(bu 2
\fBs3://data/*.csv\fP
.IP \(bu 2
\fBs3://data/precipitation_{state}_{zip}.csv\fP
.IP \(bu 2
\fBs3://data/{year}/{month}/{day}/precipitation.csv\fP
.IP \(bu 2
\fB{{ CATALOG_DIR }}data/precipitation_{date:%Y\-%m\-%d}.csv\fP
.UNINDENT
.TP
\fBcsv_kwargs\fP
dict
Any further arguments to pass to Dask\(aqs read_csv (such as block size)
or to the \fI\%CSV parser\fP
in pandas (such as which columns to use, encoding, data\-types)
.TP
\fBstorage_options\fP
dict
Any parameters that need to be passed to the remote data backend,
such as credentials.
.TP
\fBpath_as_pattern\fP
bool or str, optional
Whether to treat the path as a pattern (ie. \fBdata_{field}.csv\fP)
and create new columns in the output corresponding to pattern
fields. If str, is treated as pattern to match on. Default is True.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B discover()
Open resource and populate the source attributes.
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.INDENT 7.0
.TP
.B read_partition(i)
Return a part of the data corresponding to i\-th partition.
.sp
By default, assumes i should be an integer between zero and npartitions;
override for more complex indexing schemes.
.UNINDENT
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.zarr.ZarrArraySource(*args, **kwargs)
Read Zarr format files into an array
.sp
Zarr is an numerical array storage format which works particularly well
with remote and parallel access.
For specifics of the format, see \fI\%https://zarr.readthedocs.io/en/stable/\fP
.INDENT 7.0
.TP
.B __init__(urlpath, storage_options=None, component=None, metadata=None, **kwargs)
The parameters dtype and shape will be determined from the first
file, if not given.
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath\fP
str
Location of data file(s), possibly including protocol
information
.TP
\fBstorage_options\fP
dict
Passed on to storage backend for remote files
.TP
\fBcomponent\fP
str or None
If None, assume the URL points to an array. If given, assume
the URL points to a group, and descend the group to find the
array at this location in the hierarchy.
.TP
\fBkwargs\fP
passed on to dask.array.from_zarr.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B discover()
Open resource and populate the source attributes.
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.INDENT 7.0
.TP
.B read_partition(i)
Return a part of the data corresponding to i\-th partition.
.sp
By default, assumes i should be an integer between zero and npartitions;
override for more complex indexing schemes.
.UNINDENT
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.textfiles.TextFilesSource(*args, **kwargs)
Read textfiles as sequence of lines
.sp
Prototype of sources reading sequential data.
.sp
Takes a set of files, and returns an iterator over the text in each of them.
The files can be local or remote. Extra parameters for encoding, etc.,
go into \fBstorage_options\fP\&.
.INDENT 7.0
.TP
.B __init__(urlpath, text_mode=True, text_encoding=\(aqutf8\(aq, compression=None, decoder=None, read=True, metadata=None, storage_options=None)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath\fP
str or list(str)
Target files. Can be a glob\-path (with "*") and include protocol
specified (e.g., "s3://"). Can also be a list of absolute paths.
.TP
\fBtext_mode\fP
bool
Whether to open the file in text mode, recoding binary
characters on the fly
.TP
\fBtext_encoding\fP
str
If text_mode is True, apply this encoding. UTF* is by far the most
common
.TP
\fBcompression\fP
str or None
If given, decompress the file with the given codec on load. Can
be something like "gzip", "bz2", or to try to guess from the filename,
\(aqinfer\(aq
.TP
\fBdecoder\fP
function, str or None
Use this to decode the contents of files. If None, you will get
a list of lines of text/bytes. If a function, it must operate on
an open file\-like object or a bytes/str instance, and return a
list
.TP
\fBread\fP
bool
If decoder is not None, this flag controls whether bytes/str get
passed to the function indicated (True) or the open file\-like
object (False)
.TP
\fBstorage_options: dict\fP
Options to pass to the file reader backend, including text\-specific
encoding arguments, and parameters specific to the remote
file\-system driver, if using.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B discover()
Open resource and populate the source attributes.
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.INDENT 7.0
.TP
.B read_partition(i)
Return a part of the data corresponding to i\-th partition.
.sp
By default, assumes i should be an integer between zero and npartitions;
override for more complex indexing schemes.
.UNINDENT
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.jsonfiles.JSONFileSource(*args, **kwargs)
Read JSON files as a single dictionary or list
.sp
The files can be local or remote. Extra parameters for encoding, etc.,
go into \fBstorage_options\fP\&.
.INDENT 7.0
.TP
.B __init__(urlpath:  str, text_mode:  bool  =  True, text_encoding:  str  =  \(aqutf8\(aq, compression:  Optional[str]  =  None, read:  bool  =  True, metadata:  Optional[dict]  =  None, storage_options:  Optional[dict]  =  None)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath\fP
str
Target file. Can include protocol specified (e.g., "s3://").
.TP
\fBtext_mode\fP
bool
Whether to open the file in text mode, recoding binary
characters on the fly
.TP
\fBtext_encoding\fP
str
If text_mode is True, apply this encoding. UTF* is by far the most
common
.TP
\fBcompression\fP
str or None
If given, decompress the file with the given codec on load. Can
be something like "zip", "gzip", "bz2", or to try to guess from the
filename, \(aqinfer\(aq
.TP
\fBstorage_options: dict\fP
Options to pass to the file reader backend, including text\-specific
encoding arguments, and parameters specific to the remote
file\-system driver, if using.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B discover()
Open resource and populate the source attributes.
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.jsonfiles.JSONLinesFileSource(*args, **kwargs)
Read a JSONL (\fI\%https://jsonlines.org/\fP) file and return a list of objects,
each being valid json object (e.g. a dictionary or list)
.INDENT 7.0
.TP
.B __init__(urlpath:  str, text_mode:  bool  =  True, text_encoding:  str  =  \(aqutf8\(aq, compression:  Optional[str]  =  None, read:  bool  =  True, metadata:  Optional[dict]  =  None, storage_options:  Optional[dict]  =  None)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath\fP
str
Target file. Can include protocol specified (e.g., "s3://").
.TP
\fBtext_mode\fP
bool
Whether to open the file in text mode, recoding binary
characters on the fly
.TP
\fBtext_encoding\fP
str
If text_mode is True, apply this encoding. UTF* is by far the most
common
.TP
\fBcompression\fP
str or None
If given, decompress the file with the given codec on load. Can
be something like "zip", "gzip", "bz2", or to try to guess from the
filename, \(aqinfer\(aq.
.TP
\fBstorage_options: dict\fP
Options to pass to the file reader backend, including text\-specific
encoding arguments, and parameters specific to the remote
file\-system driver, if using.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B discover()
Open resource and populate the source attributes.
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B head(nrows:  int  =  100)
return the first \fInrows\fP lines from the file
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.npy.NPySource(*args, **kwargs)
Read numpy binary files into an array
.sp
Prototype source showing example of working with arrays
.sp
Each file becomes one or more partitions, but partitioning within a file
is only along the largest dimension, to ensure contiguous data.
.INDENT 7.0
.TP
.B __init__(path, dtype=None, shape=None, chunks=None, storage_options=None, metadata=None)
The parameters dtype and shape will be determined from the first
file, if not given.
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBpath: str of list of str\fP
Location of data file(s), possibly including glob and protocol
information
.TP
\fBdtype: str dtype spec\fP
In known, the dtype (e.g., "int64" or "f4").
.TP
\fBshape: tuple of int\fP
If known, the length of each axis
.TP
\fBchunks: int\fP
Size of chunks within a file along biggest dimension \- need not
be an exact factor of the length of that dimension
.TP
\fBstorage_options: dict\fP
Passed to file\-system backend.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B discover()
Open resource and populate the source attributes.
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.INDENT 7.0
.TP
.B read_partition(i)
Return a part of the data corresponding to i\-th partition.
.sp
By default, assumes i should be an integer between zero and npartitions;
override for more complex indexing schemes.
.UNINDENT
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.catalog.local.YAMLFileCatalog(*args, **kwargs)
Catalog as described by a single YAML file
.INDENT 7.0
.TP
.B __init__(path=None, text=None, autoreload=True, **kwargs)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBpath: str\fP
Location of the file to parse (can be remote)
.TP
\fBtext: str\fP
YAML contents of catalog, takes precedence over path
.TP
\fBreload\fP
bool
Whether to watch the source file for changes; make False if you want
an editable Catalog
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B reload()
Reload catalog if sufficient time has passed
.UNINDENT
.INDENT 7.0
.TP
.B walk(sofar=None, prefix=None, depth=2)
Get all entries in this catalog and sub\-catalogs
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBsofar: dict or None\fP
Within recursion, use this dict for output
.TP
\fBprefix: list of str or None\fP
Names of levels already visited
.TP
\fBdepth: int\fP
Number of levels to descend; needed to truncate circular references
and for cleaner output
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B Dict where the keys are the entry names in dotted syntax, and the
.TP
.B values are entry instances.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.catalog.local.YAMLFilesCatalog(*args, **kwargs)
Catalog as described by a multiple YAML files
.INDENT 7.0
.TP
.B __init__(path, flatten=True, **kwargs)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBpath: str\fP
Location of the files to parse (can be remote), including possible
glob (*) character(s). Can also be list of paths, without glob
characters.
.TP
\fBflatten: bool (True)\fP
Whether to list all entries in the cats at the top level (True)
or create sub\-cats from each file (False).
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B reload()
Reload catalog if sufficient time has passed
.UNINDENT
.INDENT 7.0
.TP
.B walk(sofar=None, prefix=None, depth=2)
Get all entries in this catalog and sub\-catalogs
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBsofar: dict or None\fP
Within recursion, use this dict for output
.TP
\fBprefix: list of str or None\fP
Names of levels already visited
.TP
\fBdepth: int\fP
Number of levels to descend; needed to truncate circular references
and for cleaner output
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B Dict where the keys are the entry names in dotted syntax, and the
.TP
.B values are entry instances.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.catalog.zarr.ZarrGroupCatalog(*args, **kwargs)
A catalog of the members of a Zarr group.
.INDENT 7.0
.TP
.B __init__(urlpath, storage_options=None, component=None, metadata=None, consolidated=False, name=None)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath\fP
str
Location of data file(s), possibly including protocol information
.TP
\fBstorage_options\fP
dict, optional
Passed on to storage backend for remote files
.TP
\fBcomponent\fP
str, optional
If None, build a catalog from the root group. If given, build the
catalog from the group at this location in the hierarchy.
.TP
\fBmetadata\fP
dict, optional
Catalog metadata. If not provided, will be populated from Zarr
group attributes.
.TP
\fBconsolidated\fP
bool, optional
If True, assume Zarr metadata has been consolidated.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B export(path, **kwargs)
Save this data for sharing with other people
.sp
Creates a copy of the data in a format appropriate for its container,
in the location specified (which can be remote, e.g., s3).
.sp
Returns the resultant source object, so that you can, for instance,
add it to a catalog (\fBcatalog.add(source)\fP) or get its YAML
representation (\fB\&.yaml()\fP).
.UNINDENT
.INDENT 7.0
.TP
.B persist(ttl=None, **kwargs)
Save data from this source to local persistent storage
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBttl: numeric, optional\fP
Time to live in seconds. If provided, the original source will
be accessed and a new persisted version written transparently
when more than \fBttl\fP seconds have passed since the old persisted
version was written.
.TP
\fBkargs: passed to the _persist method on the base container.\fP
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B reload()
Reload catalog if sufficient time has passed
.UNINDENT
.INDENT 7.0
.TP
.B walk(sofar=None, prefix=None, depth=2)
Get all entries in this catalog and sub\-catalogs
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBsofar: dict or None\fP
Within recursion, use this dict for output
.TP
\fBprefix: list of str or None\fP
Names of levels already visited
.TP
\fBdepth: int\fP
Number of levels to descend; needed to truncate circular references
and for cleaner output
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B Dict where the keys are the entry names in dotted syntax, and the
.TP
.B values are entry instances.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Base Classes
.sp
This is a reference API class listing, useful mainly for developers.
.TS
center;
|l|l|.
_
T{
\fBintake.source.base.DataSourceBase\fP(*args, ...)
T}	T{
An object which can produce data
T}
_
T{
\fI\%intake.source.base.DataSource\fP(*args, **kwargs)
T}	T{
A Data Source will all optional functionality
T}
_
T{
\fI\%intake.source.base.PatternMixin\fP()
T}	T{
Helper class to provide file\-name parsing abilities to a driver class
T}
_
T{
\fI\%intake.container.base.RemoteSource\fP(*args, ...)
T}	T{
Base class for all DataSources living on an Intake server
T}
_
T{
\fI\%intake.catalog.Catalog\fP(*args, **kwargs)
T}	T{
Manages a hierarchy of data sources as a collective unit.
T}
_
T{
\fI\%intake.catalog.entry.CatalogEntry\fP(*args, ...)
T}	T{
A single item appearing in a catalog
T}
_
T{
\fI\%intake.catalog.local.UserParameter\fP(*args, ...)
T}	T{
A user\-settable item that is passed to a DataSource upon instantiation.
T}
_
T{
\fI\%intake.auth.base.BaseAuth\fP(*args, **kwargs)
T}	T{
Base class for authorization
T}
_
T{
\fI\%intake.source.cache.BaseCache\fP(driver, spec)
T}	T{
Provides utilities for managing cached data files.
T}
_
T{
\fI\%intake.source.base.Schema\fP(**kwargs)
T}	T{
Holds details of data description for any type of data\-source
T}
_
T{
\fI\%intake.container.persist.PersistStore\fP(*args, ...)
T}	T{
Specialised catalog for persisted data\-sources
T}
_
.TE
.INDENT 0.0
.TP
.B class  intake.source.base.DataSource(*args, **kwargs)
A Data Source will all optional functionality
.sp
When subclassed, child classes will have the base data source functionality,
plus caching, plotting and persistence abilities.
.INDENT 7.0
.TP
.B plot
Accessor for HVPlot methods.  See \fI\%Plotting\fP for more details.
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.catalog.Catalog(*args, **kwargs)
Manages a hierarchy of data sources as a collective unit.
.sp
A catalog is a set of available data sources for an individual
entity (remote server, local  file, or a local
directory of files). This can be expanded to include a
collection of subcatalogs, which are then managed as a single unit.
.sp
A catalog is created with a single URI or a group of URIs. A URI can
either be a URL or a file path.
.sp
Each catalog in the hierarchy is responsible for caching the most recent
refresh time to prevent overeager queries.
.INDENT 7.0
.TP
.B Attributes
.INDENT 7.0
.TP
\fBmetadata\fP
dict
Arbitrary information to carry along with the data source specs.
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B configure_new(**kwargs)
Create a new instance of this source with altered arguments
.sp
Enables the picking of options and re\-evaluating templates from any
user\-parameters associated with
this source, or overriding any of the init arguments.
.sp
Returns a new data source instance. The instance will be recreated from
the original entry definition in a catalog \fBif\fP this source was originally
created from a catalog.
.UNINDENT
.INDENT 7.0
.TP
.B discover()
Open resource and populate the source attributes.
.UNINDENT
.INDENT 7.0
.TP
.B filter(func)
Create a Catalog of a subset of entries based on a condition
.sp
\fBWARNING:\fP
.INDENT 7.0
.INDENT 3.5
This function operates on CatalogEntry objects not DataSource
objects.
.UNINDENT
.UNINDENT
.sp
\fBNOTE:\fP
.INDENT 7.0
.INDENT 3.5
Note that, whatever specific class this is performed on,
the return instance is a Catalog. The entries are passed
unmodified, so they will still reference the original catalog
instance and include its details such as directory,.
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBfunc\fP
function
This should take a CatalogEntry and return True or False. Those
items returning True will be included in the new Catalog, with the
same entry names
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B Catalog
New catalog with Entries that still refer to their parents
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B force_reload()
Imperative reload data now
.UNINDENT
.INDENT 7.0
.TP
.B classmethod  from_dict(entries, **kwargs)
Create Catalog from the given set of entries
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBentries\fP
dict\-like
A mapping of name:entry which supports dict\-like functionality,
e.g., is derived from \fBcollections.abc.Mapping\fP\&.
.TP
\fBkwargs\fP
passed on the constructor
Things like metadata, name; see \fB__init__\fP\&.
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B Catalog instance
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B get(**kwargs)
Create a new instance of this source with altered arguments
.sp
Enables the picking of options and re\-evaluating templates from any
user\-parameters associated with
this source, or overriding any of the init arguments.
.sp
Returns a new data source instance. The instance will be recreated from
the original entry definition in a catalog \fBif\fP this source was originally
created from a catalog.
.UNINDENT
.INDENT 7.0
.TP
.B property  gui
Source GUI, with parameter selection and plotting
.UNINDENT
.INDENT 7.0
.TP
.B items()
Get an iterator over (key, source) tuples for the catalog entries.
.UNINDENT
.INDENT 7.0
.TP
.B keys()
Entry names in this catalog as an iterator (alias for __iter__)
.UNINDENT
.INDENT 7.0
.TP
.B pop(key)
Remove entry from catalog and return it
.sp
This relies on the \fI_entries\fP attribute being mutable, which it normally
is. Note that if a catalog automatically reloads, any entry removed here
may soon reappear
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBkey\fP
str
Key to give the entry in the cat
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B reload()
Reload catalog if sufficient time has passed
.UNINDENT
.INDENT 7.0
.TP
.B save(url, storage_options=None)
Output this catalog to a file as YAML
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurl\fP
str
Location to save to, perhaps remote
.TP
\fBstorage_options\fP
dict
Extra arguments for the file\-system
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B serialize()
Produce YAML version of this catalog.
.sp
Note that this is not the same as \fB\&.yaml()\fP, which produces a YAML
block referring to this catalog.
.UNINDENT
.INDENT 7.0
.TP
.B values()
Get an iterator over the sources for catalog entries.
.UNINDENT
.INDENT 7.0
.TP
.B walk(sofar=None, prefix=None, depth=2)
Get all entries in this catalog and sub\-catalogs
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBsofar: dict or None\fP
Within recursion, use this dict for output
.TP
\fBprefix: list of str or None\fP
Names of levels already visited
.TP
\fBdepth: int\fP
Number of levels to descend; needed to truncate circular references
and for cleaner output
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B Dict where the keys are the entry names in dotted syntax, and the
.TP
.B values are entry instances.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.catalog.entry.CatalogEntry(*args, **kwargs)
A single item appearing in a catalog
.sp
This is the base class, used by local entries (i.e., read from a YAML file)
and by remote entries (read from a server).
.INDENT 7.0
.TP
.B describe()
Get a dictionary of attributes of this entry.
.INDENT 7.0
.TP
.B Returns: dict with keys
.INDENT 7.0
.TP
.B name: str
The name of the catalog entry.
.TP
.B container
str
kind of container used by this data source
.TP
.B description
str
Markdown\-friendly description of data source
.TP
.B direct_access
str
Mode of remote access: forbid, allow, force
.TP
.B user_parameters
list[dict]
List of user parameters defined by this entry
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B get(**user_parameters)
Open the data source.
.sp
Equivalent to calling the catalog entry like a function.
.sp
Note: \fBentry()\fP, \fBentry.attr\fP, \fBentry[item]\fP check for persisted
sources, but directly calling \fB\&.get()\fP will always ignore the
persisted store (equivalent to \fBself._pmode==\(aqnever\(aq\fP).
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBuser_parameters\fP
dict
Values for user\-configurable parameters for this data source
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B DataSource
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B property  has_been_persisted
For the source created with the given args, has it been persisted?
.UNINDENT
.INDENT 7.0
.TP
.B property  plots
List custom associated quick\-plots
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.container.base.RemoteSource(*args, **kwargs)
Base class for all DataSources living on an Intake server
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.catalog.local.UserParameter(*args, **kwargs)
A user\-settable item that is passed to a DataSource upon instantiation.
.sp
For string parameters, default may include special functions \fBfunc(args)\fP,
which \fImay\fP be expanded from environment variables or by executing a shell
command.
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBname: str\fP
the key that appears in the DataSource argument strings
.TP
\fBdescription: str\fP
narrative text
.TP
\fBtype: str\fP
one of list \fB(COERSION_RULES)\fP
.TP
\fBdefault: type value\fP
same type as \fBtype\fP\&. It a str, may include special functions
env, shell, client_env, client_shell.
.TP
\fBmin, max: type value\fP
for validation of user input
.TP
\fBallowed: list of type\fP
for validation of user input
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B describe()
Information about this parameter
.UNINDENT
.INDENT 7.0
.TP
.B expand_defaults(client=False, getenv=True, getshell=True)
Compile env, client_env, shell and client_shell commands
.UNINDENT
.INDENT 7.0
.TP
.B validate(value)
Does value meet parameter requirements?
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.auth.base.BaseAuth(*args, **kwargs)
Base class for authorization
.sp
Subclass this and override the methods to implement a new type of auth.
.sp
This basic class allows all access.
.INDENT 7.0
.TP
.B allow_access(header, source, catalog)
Is the given HTTP header allowed to access given data source
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBheader: dict\fP
The HTTP header from the incoming request
.TP
\fBsource: CatalogEntry\fP
The data source the user wants to access.
.TP
\fBcatalog: Catalog\fP
The catalog object containing this data source.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B allow_connect(header)
Is the requests header given allowed to talk to the server
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBheader: dict\fP
The HTTP header from the incoming request
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B get_case_insensitive(dictionary, key, default=None)
Case\-insensitive search of a dictionary for key.
.sp
Returns the value if key match is found, otherwise default.
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.cache.BaseCache(driver, spec, catdir=None, cache_dir=None, storage_options={})
Provides utilities for managing cached data files.
.sp
Providers of caching functionality should derive from this, and appear
as entries in \fBregistry\fP\&. The principle methods to override are
\fB_make_files()\fP and \fB_load()\fP and \fB_from_metadata()\fP\&.
.INDENT 7.0
.TP
.B clear_all()
Clears all cache and metadata.
.UNINDENT
.INDENT 7.0
.TP
.B clear_cache(urlpath)
Clears cache and metadata for a given urlpath.
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath: str, location of data\fP
May be a local path, or remote path if including a protocol specifier
such as \fB\(aqs3://\(aq\fP\&. May include glob wildcards.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B get_metadata(urlpath)
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath: str, location of data\fP
May be a local path, or remote path if including a protocol specifier
such as \fB\(aqs3://\(aq\fP\&. May include glob wildcards.
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B Metadata (dict) about a given urlpath.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B load(urlpath, output=None, **kwargs)
Downloads data from a given url, generates a hashed filename,
logs metadata, and caches it locally.
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBurlpath: str, location of data\fP
May be a local path, or remote path if including a protocol specifier
such as \fB\(aqs3://\(aq\fP\&. May include glob wildcards.
.TP
\fBoutput: bool\fP
Whether to show progress bars; turn off for testing
.UNINDENT
.TP
.B Returns
.INDENT 7.0
.TP
.B List of local cache_paths to be opened instead of the remote file(s). If
.TP
.B caching is disable, the urlpath is returned.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.base.PatternMixin
Helper class to provide file\-name parsing abilities to a driver class
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.base.Schema(**kwargs)
Holds details of data description for any type of data\-source
.sp
This should always be pickleable, so that it can be sent from a server
to a client, and contain all information needed to recreate a RemoteSource
on the client.
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.container.persist.PersistStore(*args, **kwargs)
Specialised catalog for persisted data\-sources
.INDENT 7.0
.TP
.B add(key, source)
Add the persisted source to the store under the given key
.INDENT 7.0
.TP
.B key
str
The unique token of the un\-persisted, original source
.TP
.B source
DataSource instance
The thing to add to the persisted catalogue, referring to persisted
data
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B backtrack(source)
Given a unique key in the store, recreate original source
.UNINDENT
.INDENT 7.0
.TP
.B get_tok(source)
Get string token from object
.sp
Strings are assumed to already be a token; if source or entry, see
if it is a persisted thing ("original_tok" is in its metadata), else
generate its own token.
.UNINDENT
.INDENT 7.0
.TP
.B needs_refresh(source)
Has the (persisted) source expired in the store
.sp
Will return True if the source is not in the store at all, if it\(aqs
TTL is set to None, or if more seconds have passed than the TTL.
.UNINDENT
.INDENT 7.0
.TP
.B refresh(key)
Recreate and re\-persist the source for the given unique ID
.UNINDENT
.INDENT 7.0
.TP
.B remove(source, delfiles=True)
Remove a dataset from the persist store
.INDENT 7.0
.TP
.B source
str or DataSource or Lo
If a str, this is the unique ID of the original source, which is
the key of the persisted dataset within the store. If a source,
can be either the original or the persisted source.
.TP
.B delfiles
bool
Whether to remove the on\-disc artifact
.UNINDENT
.UNINDENT
.UNINDENT
.SS Other Classes
.SS Cache Types
.TS
center;
|l|l|.
_
T{
\fI\%intake.source.cache.FileCache\fP(driver, spec)
T}	T{
Cache specific set of files
T}
_
T{
\fI\%intake.source.cache.DirCache\fP(driver, spec[, ...])
T}	T{
Cache a complete directory tree
T}
_
T{
\fI\%intake.source.cache.CompressedCache\fP(driver, spec)
T}	T{
Cache files extracted from downloaded compressed source
T}
_
T{
\fI\%intake.source.cache.DATCache\fP(driver, spec[, ...])
T}	T{
Use the DAT protocol to replicate data
T}
_
T{
\fI\%intake.source.cache.CacheMetadata\fP(*args, ...)
T}	T{
Utility class for managing persistent metadata stored in the Intake config directory.
T}
_
.TE
.INDENT 0.0
.TP
.B class  intake.source.cache.FileCache(driver, spec, catdir=None, cache_dir=None, storage_options={})
Cache specific set of files
.sp
Input is a single file URL, URL with glob characters or list of URLs. Output
is a specific set of local files.
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.cache.DirCache(driver, spec, catdir=None, cache_dir=None, storage_options={})
Cache a complete directory tree
.sp
Input is a directory root URL, plus a \fBdepth\fP parameter for how many
levels of subdirectories to search. All regular files will be copied. Output
is the resultant local directory tree.
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.cache.CompressedCache(driver, spec, catdir=None, cache_dir=None, storage_options={})
Cache files extracted from downloaded compressed source
.sp
For one or more remote compressed files, downloads to local temporary dir and
extracts all contained files to local cache. Input is URL(s) (including
globs) pointing to remote compressed files, plus optional \fBdecomp\fP,
which is "infer" by default (guess from file extension) or one of the
key strings in \fBintake.source.decompress.decomp\fP\&. Optional \fBregex_filter\fP
parameter is used to load only the extracted files that match the pattern.
Output is the list of extracted files.
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.cache.DATCache(driver, spec, catdir=None, cache_dir=None, storage_options={})
Use the DAT protocol to replicate data
.sp
For details of the protocol, see \fI\%https://docs.datproject.org/\fP
The executable \fBdat\fP must be available.
.sp
Since in this case, it is not possible to access the remote files
directly, this cache mechanism takes no parameters. The expectation
is that the url passed by the driver is of the form:
.INDENT 7.0
.INDENT 3.5
.sp
.nf
.ft C
dat://<dat hash>/file_pattern
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
where the file pattern will typically be a glob string like "*.json".
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.source.cache.CacheMetadata(*args, **kwargs)
Utility class for managing persistent metadata stored in the Intake config directory.
.INDENT 7.0
.TP
.B keys() -> a set\-like object providing a view on D\(aqs keys
.UNINDENT
.INDENT 7.0
.TP
.B pop(k[, d]) -> v, remove specified key and return the corresponding value.
If key is not found, d is returned if given, otherwise KeyError is raised.
.UNINDENT
.INDENT 7.0
.TP
.B update([E], **F) -> None.  Update D from mapping/iterable E and F.
If E present and has a .keys() method, does:     for k in E: D[k] = E[k]
If E present and lacks .keys() method, does:     for (k, v) in E: D[k] = v
In either case, this is followed by: for k, v in F.items(): D[k] = v
.UNINDENT
.UNINDENT
.SS Auth
.TS
center;
|l|l|.
_
T{
\fI\%intake.auth.secret.SecretAuth\fP(*args, **kwargs)
T}	T{
A very simple auth mechanism using a shared secret
T}
_
T{
\fI\%intake.auth.secret.SecretClientAuth\fP(secret)
T}	T{
Matching client auth plugin to SecretAuth
T}
_
.TE
.INDENT 0.0
.TP
.B class  intake.auth.secret.SecretAuth(*args, **kwargs)
A very simple auth mechanism using a shared secret
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBsecret: str\fP
The string that must be matched in the requests. If None, a random UUID
is generated and logged.
.TP
\fBkey: str\fP
Header entry in which to seek the secret
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B allow_access(header, source, catalog)
Is the given HTTP header allowed to access given data source
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBheader: dict\fP
The HTTP header from the incoming request
.TP
\fBsource: CatalogEntry\fP
The data source the user wants to access.
.TP
\fBcatalog: Catalog\fP
The catalog object containing this data source.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B allow_connect(header)
Is the requests header given allowed to talk to the server
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBheader: dict\fP
The HTTP header from the incoming request
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.auth.secret.SecretClientAuth(secret, key=\(aqintake\-secret\(aq)
Matching client auth plugin to SecretAuth
.INDENT 7.0
.TP
.B Parameters
.INDENT 7.0
.TP
\fBsecret: str\fP
The string that must be included requests.
.TP
\fBkey: str\fP
HTTP Header key for the shared secret
.UNINDENT
.UNINDENT
.INDENT 7.0
.TP
.B get_headers()
Returns a dictionary of HTTP headers for the remote catalog request.
.UNINDENT
.UNINDENT
.SS Containers
.TS
center;
|l|l|.
_
T{
\fI\%intake.container.dataframe.RemoteDataFrame\fP(...)
T}	T{
Dataframe on an Intake server
T}
_
T{
\fI\%intake.container.ndarray.RemoteArray\fP(*args, ...)
T}	T{
nd\-array on an Intake server
T}
_
T{
\fI\%intake.container.semistructured.RemoteSequenceSource\fP(...)
T}	T{
Sequence\-of\-things source on an Intake server
T}
_
.TE
.INDENT 0.0
.TP
.B class  intake.container.dataframe.RemoteDataFrame(*args, **kwargs)
Dataframe on an Intake server
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.container.ndarray.RemoteArray(*args, **kwargs)
nd\-array on an Intake server
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.INDENT 7.0
.TP
.B read_partition(i)
Return a part of the data corresponding to i\-th partition.
.sp
By default, assumes i should be an integer between zero and npartitions;
override for more complex indexing schemes.
.UNINDENT
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.container.semistructured.RemoteSequenceSource(*args, **kwargs)
Sequence\-of\-things source on an Intake server
.INDENT 7.0
.TP
.B read()
Load entire dataset into a container and return it
.UNINDENT
.INDENT 7.0
.TP
.B to_dask()
Return a dask container for this data source
.UNINDENT
.UNINDENT
.SS Server
.TS
center;
|l|l|.
_
T{
\fI\%intake.cli.server.server.IntakeServer\fP(catalog)
T}	T{
Main intake\-server tornado application
T}
_
T{
\fI\%intake.cli.server.server.ServerInfoHandler\fP(...)
T}	T{
Basic info about the server
T}
_
T{
\fI\%intake.cli.server.server.SourceCache\fP()
T}	T{
Stores DataSources requested by some user
T}
_
T{
\fI\%intake.cli.server.server.ServerSourceHandler\fP(...)
T}	T{
Open or stream data source
T}
_
.TE
.INDENT 0.0
.TP
.B class  intake.cli.server.server.IntakeServer(catalog)
Main intake\-server tornado application
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.cli.server.server.ServerInfoHandler(application:  tornado.web.Application, request:  tornado.httputil.HTTPServerRequest, **kwargs:  Any)
Basic info about the server
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.cli.server.server.SourceCache
Stores DataSources requested by some user
.INDENT 7.0
.TP
.B peek(uuid)
Get the source but do not change the last access time
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B class  intake.cli.server.server.ServerSourceHandler(application:  tornado.web.Application, request:  tornado.httputil.HTTPServerRequest, **kwargs:  Any)
Open or stream data source
.sp
The requests "action" field (open|read) specified what the request wants
to do. Open caches the source and created an ID for it, read uses that
ID to reference the source and read a partition.
.INDENT 7.0
.TP
.B get()
Access one source\(aqs info.
.sp
This is for direct access to an entry by name for random access, which
is useful to the client when the whole catalog has not first been
listed and pulled locally (e.g., in the case of pagination).
.UNINDENT
.UNINDENT
.SS GUI
.TS
center;
||.
_
.TE
.SS Changelog
.SS 0.6.6
.sp
Released on August 26, 2022.
.INDENT 0.0
.IP \(bu 2
Fixed bug in json and jsonl driver.
.IP \(bu 2
Ensure description is retained in the catalog.
.IP \(bu 2
Fix cache issue when running inside a notebook.
.IP \(bu 2
Add templating parameters.
.IP \(bu 2
Plotting api keeps hold of hvplot calls to allow other plots to be made.
.IP \(bu 2
docs updates
.IP \(bu 2
fix urljoin for server via proxy
.UNINDENT
.SS 0.6.5
.sp
Released on January 9, 2022.
.INDENT 0.0
.IP \(bu 2
Added link to intake\-google\-analytics.
.IP \(bu 2
Add tiled driver.
.IP \(bu 2
Add json and jsonl drivers.
.IP \(bu 2
Allow parameters to be passed through catalog.
.IP \(bu 2
Add mlist type which allows inputs from a known list of values.
.UNINDENT
.SS Making Drivers
.sp
The goal of the Intake plugin system is to make it very simple to implement a \fI\%Driver\fP for a new data source, without
any special knowledge of Dask or the Intake catalog system.
.SS Assumptions
.sp
Although Intake is very flexible about data, there are some basic assumptions that a driver must satisfy.
.SS Data Model
.sp
Intake currently supports 3 kinds of containers, represented the most common data models used in Python:
.INDENT 0.0
.IP \(bu 2
dataframe
.IP \(bu 2
ndarray
.IP \(bu 2
python (list of Python objects, usually dictionaries)
.UNINDENT
.sp
Although a driver can load \fIany\fP type of data into any container, and new container types can be added to the list
above, it is reasonable to expect that the number of container types remains small. Declaring a container type is
only informational for the user when read locally, but streaming of data from a server requires that the container type
be known to both server and client.
.sp
A given driver must only return one kind of container.  If a file format (such as HDF5) could reasonably be
interpreted as two different data models depending on usage (such as a dataframe or an ndarray), then two different
drivers need to be created with different names.  If a driver returns the \fBpython\fP container, it should document
what Python objects will appear in the list.
.sp
The source of data should be essentially permanent and immutable.  That is, loading the data should not destroy or
modify the data, nor should closing the data source destroy the data either.  When a data source is serialized and
sent to another host, it will need to be reopened at the destination, which may cause queries to be re\-executed and
files to be reopened.  Data sources that treat readers as "consumers" and remove data once read will cause erratic
behavior, so Intake is not suitable for accessing things like FIFO message queues.
.SS Schema
.sp
The schema of a data source is a detailed description of the data, which can be known by loading only metadata or by
loading only some small representative portion of the data. It is information to present to the user about the data
that they are considering loading, and may be important in the case of server\-client communication. In the latter
context, the contents of the schema must be serializable by \fBmsgpack\fP (i.e., numbers, strings, lists and
dictionaries only).
.sp
There may be unknown parts of
the schema before the whole data is read.  drivers may require this unknown information in the
\fI__init__()\fP method (or the catalog spec), or do some kind of partial data inspection to determine the schema; or
more simply, may be given as unknown \fBNone\fP values.
Regardless of method used, the
time spent figuring out the schema ahead of time should be short and not scale with the size of the data.
.sp
Typical fields in a schema dictionary are \fBnpartitions\fP, \fBdtype\fP, \fBshape\fP, etc., which will be more appropriate
for some drivers/data\-types than others.
.SS Partitioning
.sp
Data sources are assumed to be \fIpartitionable\fP\&.  A data partition is a randomly accessible fragment of the data.
In the case of sequential and data\-frame sources, partitions are numbered, starting from zero, and correspond to
contiguous chunks of data divided along the first
dimension of the data structure. In general, any partitioning scheme is conceivable, such as a tuple\-of\-ints to
index the chunks of a large numerical array.
.sp
Not all data sources can be partitioned.  For example, file
formats without sufficient indexing often can only be read from beginning to end.  In these cases, the DataSource
object should report that there is only 1 partition.  However, it often makes sense for a data source to be able to
represent a directory of files, in which case each file will correspond to one partition.
.SS Metadata
.sp
Once opened, a DataSource object can have arbitrary metadata associated with it.  The metadata for a data source
should be a dictionary that can be serialized as JSON.  This metadata comes from the following sources:
.INDENT 0.0
.IP 1. 3
A data catalog entry can associate fixed metadata with the data source.  This is helpful for data formats that do
not have any support for metadata within the file format.
.IP 2. 3
The driver handling the data source may have some general metadata associated with the state of the system at the
time of access, available even before loading any data\-specific information.
.UNINDENT
.INDENT 0.0
.IP 2. 3
A driver can add additional metadata when the schema is loaded for the data source.  This allows metadata embedded
in the data source to be exported.
.UNINDENT
.sp
From the user perspective, all of the metadata should be loaded once the data source has loaded the rest of the
schema (after \fBdiscover()\fP, \fBread()\fP, \fBto_dask()\fP, etc have been called).
.SS Subclassing \fBintake.source.base.DataSourceBase\fP
.sp
Every Intake driver class should be a subclass of \fBintake.source.base.DataSource\fP\&.
The class should have the following attributes to identify itself:
.INDENT 0.0
.IP \(bu 2
\fBname\fP: The short name of the driver.  This should be a valid python identifier.
You should not include the
word \fBintake\fP in the driver name.
.IP \(bu 2
\fBversion\fP: A version string for the driver.  This may be reported to the user by tools
based on Intake, but has
no semantic importance.
.IP \(bu 2
\fBcontainer\fP: The container type of data sources created by this object, e.g.,
\fBdataframe\fP, \fBndarray\fP, or
\fBpython\fP, one of the keys of \fBintake.container.container_map\fP\&.
For simplicity, a driver many only return one typed of container.  If a particular
source of data could
be used in multiple ways (such as HDF5 files interpreted as dataframes or as ndarrays),
two drivers must be created.
These two drivers can be part of the same Python package.
.IP \(bu 2
\fBpartition_access\fP: Do the data sources returned by this driver have multiple
partitions?  This may help tools in
the future make more optimal decisions about how to present data.  If in doubt
(or the answer depends on init
arguments), \fBTrue\fP will always result in correct behavior, even if the data
source has only one partition.
.UNINDENT
.sp
The \fB__init()__\fP method should always accept a keyword argument \fBmetadata\fP, a
dictionary of metadata from the
catalog to associate with the source.  This dictionary must be serializable as JSON.
.sp
The \fIDataSourceBase\fP class has a small number of methods which should be overridden.
Here is an example producing a
data\-frame:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
class FooSource(intake.source.base.DataSource):
    container = \(aqdataframe\(aq
    name = \(aqfoo\(aq
    version = \(aq0.0.1\(aq
    partition_access = True

    def __init__(self, a, b, metadata=None):
        # Do init here with a and b
        super(FooSource, self).__init__(
            metadata=metadata
        )

    def _get_schema(self):
        return intake.source.base.Schema(
            datashape=None,
            dtype={\(aqx\(aq: "int64", \(aqy\(aq: "int64"},
            shape=(None, 2),
            npartitions=2,
            extra_metadata=dict(c=3, d=4)
        )

    def _get_partition(self, i):
        # Return the appropriate container of data here
        return pd.DataFrame({\(aqx\(aq: [1, 2, 3], \(aqy\(aq: [10, 20, 30]})

    def read(self):
        self._load_metadata()
        return pd.concat([self.read_partition(i) for i in range(self.npartitions)])

    def _close(self):
        # close any files, sockets, etc
        pass
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Most of the work typically happens in the following methods:
.INDENT 0.0
.IP \(bu 2
\fB__init__()\fP: Should be very lightweight and fast.  No files or network resources should be opened, and no
significant memory should be allocated yet.  Data sources might be serialized immediately.  The default implementation
of the pickle protocol in the base class will record all the arguments to \fB__init__()\fP and recreate the object with
those arguments when unpickled, assuming the class has no side effects.
.IP \(bu 2
\fB_get_schema()\fP: May open files and network resources and return as much of the schema as possible in small
amount of \fIapproximately\fP constant  time. Typically, imports of packages needed by the source only happen here.
The \fBnpartitions\fP and \fBextra_metadata\fP attributes must be correct
when \fB_get_schema\fP returns.  Further keys such as \fBdtype\fP, \fBshape\fP, etc., should reflect the container type of
the data\-source, and can be \fBNone\fP if not easily knowable, or include \fBNone\fP for some elements. File\-based
sources should use fsspec to open a local or remote URL, and pass \fBstorage_options\fP to it. This ensures
compatibility and extra features such as caching. If the backend can only deal with local files, you may
still want to use \fBfsspec.open_local\fP to allow for caching.
.IP \(bu 2
\fB_get_partition(self, i)\fP: Should return all of the data from partition id \fBi\fP, where \fBi\fP is typically an
integer, but may be something more complex.
The base class will automatically verify that \fBi\fP is in the range \fB[0, npartitions)\fP, so no range checking is
required in the typical case.
.IP \(bu 2
\fB_close(self)\fP: Close any network or file handles and deallocate any significant memory.  Note that these
resources may be need to be reopened/reallocated if a read is called again later.
.UNINDENT
.sp
The full set of user methods of interest are as follows:
.INDENT 0.0
.IP \(bu 2
\fBdiscover(self)\fP: Read the source attributes, like \fBnpartitions\fP, etc.  As with \fB_get_schema()\fP above, this
method is assumed to be fast, and make a best effort to set attributes. The output should be serializable, if the
source is to be used on a server; the details contained will be used for creating a remote\-source on the client.
.IP \(bu 2
\fBread(self)\fP: Return all the data in memory in one in\-memory container.
.IP \(bu 2
\fBread_chunked(self)\fP: Return an iterator that returns contiguous chunks of the data.  The chunking is generally
assumed to be at the partition level, but could be finer grained if desired.
.IP \(bu 2
\fBread_partition(self, i)\fP: Returns the data for a given partition id.  It is assumed that reading a given
partition does not require reading the data that precedes it.  If \fBi\fP is out of range, an \fBIndexError\fP should
be raised.
.IP \(bu 2
\fBto_dask(self)\fP: Return a (lazy) Dask data structure corresponding to this data source.  It should be assumed
that the data can be read from the Dask workers, so the loads can be done in future tasks.  For further information,
see the \fI\%Dask documentation\fP\&.
.IP \(bu 2
\fBclose(self)\fP: Close network or file handles and deallocate memory.  If other methods are called after \fBclose()\fP,
the source is automatically reopened.
.IP \(bu 2
\fBto_*\fP: for some sources, it makes sense to provide alternative outputs aside from the base container
(dataframe, array, ...) and Dask variants.
.UNINDENT
.sp
Note that all of these methods typically call \fB_get_schema\fP, to make sure that the source has been
initialised.
.SS Subclassing \fBintake.source.base.DataSource\fP
.sp
\fBDataSource\fP provides the same functionality as \fBDataSourceBase\fP, but has some additional mixin
classes to provide some extras. A developer may choose to derive from \fBDataSource\fP to get all of
these, or from \fBDataSourceBase\fP and make their own choice of mixins to support.
.INDENT 0.0
.IP \(bu 2
\fBHoloviewsMixin\fP: provides plotting and GUI capabilities via the \fI\%holoviz\fP stack
.IP \(bu 2
\fBPersistMixin\fP: allows for storing a local copy in a default format for the given
container type
.IP \(bu 2
\fBCacheMixin\fP: allows for local storage of data files for a source. Deprecated,
you should use one of the caching mechanisms in \fBfsspec\fP\&.
.UNINDENT
.SS Driver Discovery
.sp
Intake discovers available drivers in three different ways, described below.
After the discovery phase, Intake will automatically create
\fBopen_[driver_name]\fP convenience functions under the \fBintake\fP module
namespace.  Calling a function like \fBopen_csv()\fP is equivalent to
instantiating the corresponding data\-source class.
.SS Entrypoints
.sp
If you are packaging your driver into an installable package to be shared, you
should add the following to the package\(aqs \fBsetup.py\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
setup(
    ...
    entry_points={
        \(aqintake.drivers\(aq: [
            \(aqsome_format_name = some_package.and_maybe_a_submodule:YourDriverClass\(aq,
            ...
        ]
    },
)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBIMPORTANT:\fP
.INDENT 0.0
.INDENT 3.5
Some critical details of Python\(aqs entrypoints feature:
.INDENT 0.0
.IP \(bu 2
Note the unusual syntax of the entrypoints. Each item is given as one long
string, with the \fB=\fP as part of the string. Modules are separated by
\fB\&.\fP, and the final object name is preceded by \fB:\fP\&.
.IP \(bu 2
The right hand side of the equals sign must point to where the object is
\fIactually defined\fP\&. If \fBYourDriverClass\fP is defined in
\fBfoo/bar.py\fP and imported into \fBfoo/__init__.py\fP you might expect
\fBfoo:YourDriverClass\fP to work, but it does not. You must spell out
\fBfoo.bar:YourDriverClass\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Entry points are a way for Python packages to advertise objects with some
common interface. When Intake is imported, it discovers all packages installed
in the current environment that advertise \fB\(aqintake.drivers\(aq\fP in this way.
.sp
Most packages that define intake drivers have a dependency on \fBintake\fP
itself, for example in order to use intake\(aqs base classes. This can create a
circular dependency: importing the package imports intake, which tries
to discover and import packages that define drivers. To avoid this pitfall,
just ensure that \fBintake\fP is imported first thing in your package\(aqs
\fB__init__.py\fP\&. This ensures that the driver\-discovery code runs first. Note
that you are \fInot\fP required to make your package depend on intake. The rule is
that \fIif\fP you import \fBintake\fP you must import it first thing. If you do not
import intake, there is no circularity.
.SS Configuration
.sp
The intake configuration file can be used to:
.INDENT 0.0
.IP \(bu 2
Specify precedence in the event of name collisions\-\-\-for example, if two different
\fBcsv\fP drivers are installed.
.IP \(bu 2
Disable a troublesome driver.
.IP \(bu 2
Manually make intake aware of a driver, which can be useful for
experimentation and early development until a \fBsetup.py\fP with an
entrypoint is prepared.
.IP \(bu 2
Assign a driver to a name other than the one assigned by the driver\(aqs
author.
.UNINDENT
.sp
The commandline invocation
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
intake drivers enable some_format_name some_package.and_maybe_a_submodule.YourDriverClass
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
is equivalent to adding this to your intake configuration file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
drivers:
  some_format_name: some_package.and_maybe_a_submodule.YourDriverClass
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
You can also disable a troublesome driver
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
intake drivers disable some_format_name
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
which is equivalent to
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
drivers:
  your_format_name: false
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Deprecated: Package Scan
.sp
When Intake is imported, it will search the Python module path (by default includes \fBsite\-packages\fP and other
directories in your \fB$PYTHONPATH\fP) for packages starting with \fBintake\e_\fP and discover DataSource subclasses inside
those packages to register.  drivers will be registered based on the\(ga\(ganame\(ga\(ga attribute of the object.
By convention, drivers should have names that are lowercase, valid Python identifiers that do not contain the word
\fBintake\fP\&.
.sp
This approach is deprecated because it is limiting (requires the package to
begin with "intake_") and because the package scan can be slow. Using
entrypoints is strongly encouraged. The package scan \fImay\fP be disabled by
default in some future release of intake. During the transition period, if a
package named \fBintake_*\fP provides an entrypoint for a given name, that will
take precedence over any drivers gleaned from the package scan having that
name. If intake discovers any names from the package scan for which there are
no entrypoints, it will issue a \fBFutureWarning\fP\&.
.SS Python API to Driver Discovery
.SS Remote Data
.sp
For drivers loading from files, the author should be aware that it is easy to implement loading
from files stored in remote services. A simplistic case is demonstrated by the included CSV driver,
which simply passes a URL to Dask, which in turn can interpret the URL as a remote data service,
and use the \fBstorage_options\fP as required (see the Dask documentation on \fI\%remote data\fP).
.sp
More advanced usage, where a Dask loader does not already exist, will likely rely on
\fI\%fsspec.open_files\fP . Use this function to produce lazy \fBOpenFile\fP object for local
or remote data, based on a URL, which will have a protocol designation and possibly contain
glob "*" characters. Additional parameters may be passed to \fBopen_files\fP, which should,
by convention, be supplied by a driver argument named \fBstorage_options\fP (a dictionary).
.sp
To use an \fBOpenFile\fP object, make it concrete by using a context:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# at setup, to discover the number of files/partitions
set_of_open_files = fsspec.open_files(urlpath, mode=\(aqrb\(aq, **storage_options)

# when actually loading data; here we loop over all files, but maybe we just do one partition
for an_open_file in set_of_open_files:
    # \(gawith\(ga causes the object to become concrete until the end of the block
    with an_open_file as f:
        # do things with f, which is a file\-like object
        f.seek(); f.read()
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBtextfiles\fP builtin drivers implements this mechanism, as an example.
.SS Structured File Paths
.sp
The CSV driver sets up an example of how to gather data which is encoded in file paths
like (\fB\(aqdata_{site}_.csv\(aq\fP) and return that data in the output.
Other drivers could also follow the same structure where data is being loaded from a
set of filenames. Typically this would apply to data\-frame output.
This is possible as long as the driver has access to each of the file paths at some
point in \fB_get_schema\fP\&. Once the file paths are known, the driver developer can use the helper
functions defined in \fBintake.source.utils\fP to get the values for each field in the pattern
for each file in the list. These values should then be added to the data, a process which
normally would happen within the _get_schema method.
.sp
The PatternMixin defines driver properties such as urlpath, path_as_pattern, and pattern.
The implementation might look something like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from intake.source.utils import reverse_formats

class FooSource(intake.source.base.DataSource, intake.source.base.PatternMixin):
    def __init__(self, a, b, path_as_pattern, urlpath, metadata=None):
        # Do init here with a and b
        self.path_as_pattern = path_as_pattern
        self.urlpath = urlpath

        super(FooSource, self).__init__(
            container=\(aqdataframe\(aq,
            metadata=metadata
        )
    def _get_schema(self):
        # read in the data
        values_by_field = reverse_formats(self.pattern, file_paths)
        # add these fields and map values to the data
        return data
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Since dask already has a specific method for including the file paths in the output dataframe,
in the CSV driver we set \fBinclude_path_column=True\fP, to get a dataframe where one of the
columns contains all the file paths. In this case, \fIadd these fields and values to data\fP
is a mapping between the categorical file paths column and the \fBvalues_by_field\fP\&.
.sp
In other drivers where each file is read in independently the driver developer
can set the new fields on the data from each file before concattenating.
This pattern looks more like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from intake.source.utils import reverse_format

class FooSource(intake.source.base.DataSource):
    ...

    def _get_schema(self):
        # get list of file paths
        for path in file_paths:
            # read in the file
            values_by_field = reverse_format(self.pattern, path)
            # add these fields and values to the data
        # concatenate the datasets
        return data
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To toggle on and off this path as pattern behavior, the CSV and intake\-xarray drivers
uses the bool \fBpath_as_pattern\fP keyword argument.
.SS Authorization Plugins
.sp
Authorization plugins are classes that can be used to customize access permissions to the Intake catalog server.
The Intake server and client communicate over HTTP, so when security is a concern, the \fImost important\fP step to take
is to put a TLS\-enabled reverse proxy (like \fBnginx\fP) in front of the Intake server to encrypt all communication.
.sp
Whether or not the connection is encrypted, the Intake server by default allows all clients to list the full catalog,
and open any of the entries.  For many use cases, this is sufficient, but if the visibility of catalog entries needs
to be limited based on some criteria, a server\- (and/or client\-) side authorization plugin can be used.
.SS Server Side
.sp
An Intake server can have exactly one server side plugin enabled at startup.  The plugin is activated using the Intake
configuration file, which lists the class name and the keyword arguments it takes.  For example, the "shared secret"
plugin would be configured this way:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
auth:
  cls: intake.auth.secret.SecretAuth
  kwargs:
    secret: A_SECRET_HASH
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This plugin is very simplistic, and exists as a demonstration of how an auth plugin might function for more realistic
scenarios.
.sp
For more information about configuring the Intake server, see \fI\%Configuration\fP\&.
.sp
The server auth plugin has two methods.  The \fBallow_connect()\fP method decides whether to allow a client to make any
request to the server at all, and the \fBallow_access()\fP method decides whether the client is allowed to see a
particular catalog entry in the listing and whether they are allowed to open that data source.  Note that for catalog entries which allow direct access to the data (via network or shared filesystem), the Intake authorization plugins have no impact on the visibility of the underlying data, only the entries in the catalog.
.sp
The actual implementation of a plugin is very short.  Here is a simplified version of the shared secret auth plugin:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
class SecretAuth(BaseAuth):
    def __init__(self, secret, key=\(aqintake\-secret\(aq):
        self.secret = secret
        self.key = key

    def allow_connect(self, header):
        try:
            return self.get_case_insensitive(header, self.key, \(aq\(aq) \e
                        == self.secret
        except:
            return False

    def allow_access(self, header, source, catalog):
        try:
            return self.get_case_insensitive(header, self.key, \(aq\(aq) \e
                        == self.secret
        except:
            return False
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fIheader\fP argument is a dictionary of HTTP headers that were present in the client request.  In this case, the
plugin is looking for a special \fBintake\-secret\fP header which contains the shared secret token.  Because HTTP header
names are not case sensitive, the \fBBaseAuth\fP class provides a helper method \fBget_case_insensitive()\fP, which will
match dictionary keys in a case\-insensitive way.
.sp
The \fBallow_access\fP method also takes two additional arguments.  The \fBsource\fP argument is the instance of
\fBLocalCatalogEntry\fP for the data source being checked.  Most commonly auth plugins will want to inspect the
\fB_metadata\fP dictionary for information used to make the authorization decision.  Note that it is entirely up to the
plugin author to decide what sections they want to require in the metadata section.  The \fBcatalog\fP argument is the
instance of \fBCatalog\fP that contains the catalog entry.  Typically, plugins will want to use information from the
\fBcatalog.metadata\fP dictionary to control global defaults, although this is also up to the plugin.
.SS Client Side
.sp
Although server side auth plugins can function entirely independently, some authorization schemes will require the
client to add special HTTP headers for the server to look for.  To facilitate this, the Catalog constructor accepts
an optional \fBauth\fP parameter with an instance of a client auth plugin class.
.sp
The corresponding client plugin for the shared secret use case describe above looks like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
class SecretClientAuth(BaseClientAuth):
    def __init__(self, secret, key=\(aqintake\-secret\(aq):
        self.secret = secret
        self.key = key

    def get_headers(self):
        return {self.key: self.secret}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
It defines a single method, \fBget_headers()\fP, which is called to get a dictionary of additional headers to add to the
HTTP request to the catalog server.  To use this plugin, we would do the following:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
import intake
from intake.auth.secret import SecretClientAuth

auth = SecretClientAuth(\(aqA_SECRET_HASH\(aq)
cat = intake.Catalog(\(aqhttp://example.com:5000\(aq, auth=auth)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Now all requests made to the remote catalog will contain the \fBintake\-secret\fP header.
.SS Making Data Packages
.sp
Intake can used to create \fI\%Data packages\fP, so that you can easily distribute
your catalogs \- others can just "install data". Since you may also want to distribute
custom catalogues, perhaps with visualisations, and driver code, packaging these things
together is a great convenience. Indeed, packaging gives you the opportunity to
version\-tag your distribution and to declare the requirements needed to be able to
use the data. This is a common pattern for distributing code for python and other
languages, but not commonly seen for data artifacts.
.sp
The current version of Intake allows making data packages using standard python
tools (to be installed, for example, using \fBpip\fP).
The previous, now deprecated, technique is still described below, under
\fI\%Pure conda solution\fP and is specific to the \fIconda\fP packaging system.
.SS Python packaging solution
.sp
Intake allows you to register data artifacts (catalogs and data sources) in the
metadata of a python package. This means, that when you install that package, intake
will automatically know of the registered items, and they will appear within the
"builtin" catalog \fBintake.cat\fP\&.
.sp
Here we assume that you understand what is meant by a python package (i.e., a
folder containing \fB__init__.py\fP and other code, config and data files).
Furthermore, you should familiarise yourself with what is required for
bundling such a package into a \fIdistributable\fP package (one with a \fBsetup.py\fP)
by reading the \fI\%official packaging documentation\fP
.sp
The \fI\%intake examples\fP contains a full tutorial for packaging and distributing
intake data and/or catalogs for \fBpip\fP and \fBconda\fP, see the directory
"data_package/".
.SS Entry points definition
.sp
Intake uses the concept of \fIentry points\fP to define the entries that are defined
by a given package. Entry points provide a mechanism to register metadata about a
package at install time, so that it can easily be found by other packages such as Intake.
Entry points was originally a \fI\%separate package\fP, but is included in the standard
library as of python 3.8 (you will not need to install it, as Intake requires it).
.sp
All you need to do to register an entry in \fBintake.cat\fP is:
.INDENT 0.0
.IP \(bu 2
define a data source somewhere in your package. This object can
be of any ttype that makes sense to Intake, including Catalogs, and sources
that have drivers defined in the very same package. Obviously, if you can have
catalogs, you can populate these however you wish, including with more catalogs.
You need not be restricted to simply loading in YAML files.
.IP \(bu 2
include a block in your call to \fBsetp\fP in \fBsetup.py\fP with code something like
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
  entry_points={
      \(aqintake.catalogs\(aq: [
          \(aqsea_cat = intake_example_package:cat\(aq,
          \(aqsea_data = intake_example_package:data\(aq
      ]
  }

Here only the lines with "sea_cat" and "sea_data" are specific to the example
package, the rest is required boilerplate. Each of those two lines defines a name
for the data entry (before the "=" sign) and the location to load from, in
module:object format.
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
install the package using \fBpip\fP, \fBpython setup.py\fP, or package it for \fBconda\fP
.UNINDENT
.SS Intake\(aqs process
.sp
When Intake is imported, it investigates all registered entry points with the
\fB"intake.catalogs"\fP group. It will go through and assign each name to the
given location of the final object. In the above example, \fBintake.cat.sea_cat\fP
would be associated with the \fBcat\fP object in the \fBintake_example_package\fP
package, and so on.
.sp
Note that Intake does \fBnot\fP immediately import the given package or module, because imports
can sometimes be expensive, and if you have a lot of data packages, it might cause
a slow\-down every time that Intake is imported. Instead, a placeholder entry is
created, and whenever the entry is accessed, that\(aqs when the particular package
will be imported.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
In [1]: import intake

In [2]: intake.cat.sea_cat  # does not import yet
Out[2]: <Entry containing Catalog named sea_cat>

In [3]: cat = intake.cat.sea_cat()  # imports now

In [4]: cat   # this data source happens to be a catalog
Out[4]: <Intake catalog: sea>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
(note here the parentheses \- this explicitly initialises the source, and normally
you don\(aqt have to do this)
.SS Pure conda solution
.sp
This packaging method is deprecated, but still available.
.sp
Combined with the \fI\%Conda Package Manger\fP, Intake
makes it possible to create \fI\%Data packages\fP which can be installed and upgraded just like
software packages.  This offers several advantages:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
Distributing Catalogs and Drivers becomes as easy as \fBconda install\fP
.IP \(bu 2
Data packages can be versioned, improving reproducibility in some cases
.IP \(bu 2
Data packages can depend on the libraries required for reading
.IP \(bu 2
Data packages can be self\-describing using Intake catalog files
.IP \(bu 2
Applications that need certain Catalogs can include data packages in their dependency list
.UNINDENT
.UNINDENT
.UNINDENT
.sp
In this tutorial, we give a walk\-through to enable you to distribute any
Catalogs to others, so that they can access the data using Intake without worrying about where it
resides or how it should be loaded.
.SS Implementation
.sp
The function \fBintake.catalog.default.load_combo_catalog\fP searches for YAML catalog files in a number
of place at import. All entries in these catalogs are flattened and placed in the "builtin"
\fBintake.cat\fP\&.
.sp
The places searched are:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
a platform\-specific user directory as given by the \fI\%appdirs\fP package
.IP \(bu 2
in the environment\(aqs "/share/intake" data directory, where the location of the current environment
is found from virtualenv or conda environment variables
.IP \(bu 2
in directories listed in the "INTAKE_PATH" environment variable or "catalog_path" config parameter
.UNINDENT
.UNINDENT
.UNINDENT
.SS Defining a Package
.sp
The steps involved in creating a data package are:
.INDENT 0.0
.IP 1. 3
Identifying a dataset, which can be accessed via a URL or included directly as one or more files in the package.
.IP 2. 3
Creating a package containing:
.INDENT 3.0
.IP \(bu 2
an intake catalog file
.IP \(bu 2
a \fBmeta.yaml\fP file (description of the data, version, requirements, etc.)
.IP \(bu 2
a script to copy the data
.UNINDENT
.IP 3. 3
Building the package using the command \fBconda build\fP\&.
.IP 4. 3
Uploading the package to a package repository such as \fI\%Anaconda Cloud\fP or your own private
repository.
.UNINDENT
.sp
Data packages are standard conda packages that install an Intake catalog file into the user\(aqs conda environment
(\fB$CONDA_PREFIX/share/intake\fP).  A data package does not necessarily imply there are data files inside the package.
A data package could describe remote data sources (such as files in S3) and take up very little space on disk.
.sp
These packages are considered \fBnoarch\fP packages, so that one package can be installed on any platform, with any
version of Python (or no Python at all).  The easiest way to create such a package is using a
\fI\%conda build\fP recipe.
.sp
Conda\-build recipes are stored in a directory that contains a files like:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBmeta.yaml\fP \- description of package metadata
.IP \(bu 2
\fBbuild.sh\fP \- script for building/installing package contents (on Linux/macOS)
.IP \(bu 2
other files needed by the package (catalog files and data files for data packages)
.UNINDENT
.UNINDENT
.UNINDENT
.sp
An example that packages up data from a Github repository would look like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# meta.yaml
package:
  version: \(aq1.0.0\(aq
  name: \(aqdata\-us\-states\(aq

source:
  git_rev: v1.0.0
  git_url: https://github.com/CivilServiceUSA/us\-states

build:
  number: 0
  noarch: generic

requirements:
  run:
    \- intake
  build: []

about:
  description: Data about US states from CivilServices (https://civil.services/)
  license: MIT
  license_family: MIT
  summary: Data about US states from CivilServices
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The key parts of a data package recipe (different from typical conda recipes) is the \fBbuild\fP section:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
build:
  number: 0
  noarch: generic
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will create a package that can be installed on any platform, regardless of the platform where the package is
built.  If you need to rebuild a package, the build number can be incremented to ensure users get the latest version when they conda update.
.sp
The corresponding \fBbuild.sh\fP file in the recipe looks like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
#!/bin/bash

mkdir \-p $CONDA_PREFIX/share/intake/civilservices
cp $SRC_DIR/data/states.csv $PREFIX/share/intake/civilservices
cp $RECIPE_DIR/us_states.yaml $PREFIX/share/intake/
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fB$SRC_DIR\fP variable refers to any source tree checked out (from Github or other service), and the
\fB$RECIPE_DIR\fP refers to the directory where the \fBmeta.yaml\fP is located.
.sp
Finishing out this example, the catalog file for this data source looks like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
sources:
  states:
    description: US state information from [CivilServices](https://civil.services/)
    driver: csv
    args:
      urlpath: \(aq{{ CATALOG_DIR }}/civilservices/states.csv\(aq
    metadata:
      origin_url: \(aqhttps://github.com/CivilServiceUSA/us\-states/blob/v1.0.0/data/states.csv\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fB{{ CATALOG_DIR }}\fP Jinja2 variable is used to construct a path relative to where the catalog file was installed.
.sp
To build the package, you must have conda\-build installed:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
conda install conda\-build
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Building the package requires no special arguments:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
conda build my_recipe_dir
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Conda\-build will display the path of the built package, which you will need to upload it.
.sp
If you want your data package to be publicly available on \fI\%Anaconda Cloud\fP, you can install
the anaconda\-client utility:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
conda install anaconda\-client
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then you can register your Anaconda Cloud credentials and upload the package:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
anaconda login
anaconda upload /Users/intake_user/anaconda/conda\-bld/noarch/data\-us\-states\-1.0.0\-0.tar.bz2
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Best Practices
.SS Versioning
.INDENT 0.0
.IP \(bu 2
Versions for data packages should be used to indicate changes in the data values or schema.  This allows applications
to easily pin to the specific data version they depend on.
.IP \(bu 2
Putting data files into a package ensures reproducibility by allowing a version number to be associated with files
on disk.  This can consume quite a bit of disk space for the user, however. Large data files are not generally
included in pip or conda packages so, if possible, you should reference the data assets in an external place where they
can be loaded.
.UNINDENT
.SS Packaging
.INDENT 0.0
.IP \(bu 2
Packages that refer to remote data sources (such as databases and REST APIs) need to think about authentication.
Do not include authentication credentials inside a data package.  They should be obtained from the environment.
.IP \(bu 2
Data packages should depend on the Intake plugins required to read the data, or Intake itself.
.IP \(bu 2
You may well want to break any driver code code out into a separate package so that it can be updated
independent of the data. The data package would then depend on the driver package.
.UNINDENT
.SS Nested catalogs
.sp
As noted above, entries will appear in the users\(aq builtin
catalog as \fBintake.cat.*\fP\&. In the case that the catalog has multiple entries, it may be desirable
to put the entries below a namespace as \fBintake.cat.data_package.*\fP\&. This can be achieved by having
one catalog containing the (several) data sources, with only a single top\-level entry pointing to
it. This catalog could be defined in a YAML file, created using any other catalog driver, or constructed
in the code, e.g.:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry as Entry
cat = intake.catalog.Catalog()
cat._entries = {name: Entry(name, descr, driver=\(aqpackage.module.driver\(aq,
                              args={"urlpath": url})
                          for name, url in my_input_list}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If your package contains many sources of different types, you may even nest the catalogs, i.e.,
have a top\-level whose contents are also catalogs.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
e = Entry(\(aqfirst_cat\(aq, \(aqsample\(aq, driver=\(aqcatalog\(aq)
e._default_source = cat
top_level = Catalog()
top_level._entries = {\(aqfist_cat\(aq: e, ...}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
where your entry point might look something like: \fB"my_cat = my_package:top_level"\fP\&. You could achieve the same
with multiple YAML files.
.SH ROADMAP
.sp
Some high\-level work that we expect to be achieved on the time\-scale of months. This list
is not exhaustive, but rather aims to whet the appetite for what Intake can be in the future.
.sp
Since Intake aims to be a community of data\-oriented pythoneers, nothing written here is laid in
stone, and users and devs are encouraged to make their opinions known!
.SS Broaden the coverage of formats
.sp
Data\-type drivers are easy to write, but still require some effort, and therefore reasonable
impetus to get the work done. Conversations over the coming months can help determine the
drivers that should be created by the Intake team, and those that might be contributed by the
community.
.sp
The next type that we would specifically like to consider is machine learning
model artifacts.  \fBEDIT\fP see \fI\%https://github.com/AlbertDeFusco/intake\-sklearn\fP , and
hopefully more to come.
.SS Streaming Source
.sp
Many data sources are inherently time\-sensitive and event\-wise. These are not covered well by existing
Python tools, but the \fBstreamz\fP library may present a nice way to model them. From the Intake point of
view, the task would be to develop a streaming type, and at least one data driver that uses it.
.sp
The most obvious place to start would be read a file: every time a new line appears in the file, an event
is emitted. This is appropriate, for instance, for watching the log files of a web\-server, and indeed could
be extended to read from an arbitrary socket.
.sp
\fBEDIT\fP see: \fI\%https://github.com/intake/intake\-streamz\fP
.SS Server publish hooks
.sp
To add API endpoints to the server, so that a user (with sufficient privilege) can post data
specifications to a running server, optionally saving the specs to a catalog server\-side. Furthermore,
we will consider the possibility of being able to upload and/or transform data
(rather than refer to it in a third\-party location), so that you would have a one\-line "publish"
ability from the client.
.sp
The server, in general, could do with a lot of work to become more than the current
demonstration/prototype. In particular, it should be able to be performant and scalable,
meaning that the server implementation ought to keep as little local state as possible.
.SS Simplify dependencies and class hierarchy
.sp
We would like the make it easier to write Intake drivers which don\(aqt need any
persist or GUI functionality, and to be able to install Intake core
functionality (driver registry, data loading and catalog traversal) without
needing many other packages at all.
.sp
\fBEDIT\fP this has been partly done, you can derive from \fBDataSourceBase\fP and
not have to use the full set of Intake\(aqs features for simplicity. We have also gone
some distance to separate out dependencies for parts of the package, so that you
can install Intake and only use some of the subpackages/modules \- imports don\(aqt
happen until those parts of the code are used. We have \fInot\fP yet split the
intake conda package into, for example, intake\-base, intake\-server, intake\-gui...
.SS Reader API
.sp
For those that wish to provide Intake\(aqs data source API, and make data sources
available to Intake cataloguing, but don\(aqt wish to take Intake as a direct dependency.
The actual API of \fBDataSources\fP is rather simple:
.INDENT 0.0
.IP \(bu 2
\fB__init__\fP: collect arguments, minimal IO at this point
.IP \(bu 2
\fBdiscover()\fP: get metadata from the source, by querying the files/service itself
.IP \(bu 2
\fBread()\fP: return in\-memory version of the data
.IP \(bu 2
\fBto_*\fP: return reference objects for the given compute engine, typically Dask
.IP \(bu 2
\fBread_partition(...)\fP: read part of the data into memory, where the argument
makes sense for the given type of data
.IP \(bu 2
\fBconfigure_new()\fP: create new instance with different arguments
.IP \(bu 2
\fByaml()\fP: representation appropriate for inclusion in a YAML catalogue
.IP \(bu 2
\fBclose()\fP: release any resources
.UNINDENT
.sp
Of these, only the first three are really necessary for a iminal interface, so
Intake might do well to publish this \fIprotocol specification\fP, so that new drivers
can be written that can be used by Intake but do not need Intake, and so help
adoption.
.SH GLOSSARY
.INDENT 0.0
.TP
.B Argument
One of a set of values passed to a function or class. In the Intake sense, this usually is the
set of key\-value pairs defined in the "args" section of a source definition; unless the user
overrides, these will be used for instantiating the source.
.TP
.B Cache
Local copies of remote files. Intake allows for download\-on\-first\-use for data\-sources,
so that subsequent access is much faster, see caching\&. The
format of the files is unchanged in this case, but may be decompressed.
.TP
.B Catalog
An inventory of entries, each of which corresponds to a specific \fI\%Data\-set\fP\&. Within these docs, a catalog is
most commonly defined in a \fI\%YAML\fP file, for simplicity, but there are other possibilities, such as connecting to an Intake
server or another third\-party data service, like a SQL database. Thus, catalogs form a hierarchy: any
catalog can contain other, nested catalogs.
.TP
.B Catalog file
A \fI\%YAML\fP specification file which contains a list of named entries describing how to load data
sources. \fI\%Catalogs\fP\&.
.TP
.B Conda
A package and environment management package for the python ecosystem, see the \fI\%conda website\fP\&. Conda ensures
dependencies and correct versions are installed for you, provides precompiled, binary\-compatible software,
and extends to many languages beyond python, such as R, javascript and C.
.TP
.B Conda package
A single installable item which the \fI\%Conda\fP application can install. A package may include
a \fI\%Catalog\fP, data\-files and maybe some additional code. It will also include a specification of the
dependencies that it requires (e.g., Intake and any additional \fI\%Driver\fP), so that Conda can install those
automatically. Packages can be created locally, or can be found on \fI\%anaconda.org\fP or other package
repositories.
.TP
.B Container
One of the supported data formats. Each \fI\%Driver\fP outputs its data in one of these. The
containers correspond to familiar data structures for end\-analysis, such as list\-of\-dicts, Numpy nd\-array or
Pandas data\-frame.
.TP
.B Data\-set
A specific assemblage of data. The type of data (tabular, multi\-dimensional or something else) and the format
(file type, data service type) are all attributes of the data\-set. In addition, in the context of Intake,
data\-sets are usually entries within a \fI\%Catalog\fP with additional descriptive text and metadata and
a specification of \fIhow\fP to load the data.
.TP
.B Data Source
An Intake specification for a specific \fI\%Data\-set\fP\&. In most cases, the two terms are
synonymous.
.TP
.B Data User
A person who uses data to produce models and other inferences/conclusions. This
person generally uses standard python analysis packages like Numpy, Pandas, SKLearn and may produce
graphical output. They will want to be able to find the right data for a given job, and for
the data to be available in a standard format as quickly and
easily as possible. In many organisations, the appropriate job title may be Data Scientist, but
research scientists and BI/analysts also fit this description.
.TP
.B Data packages
Data packages are standard conda packages that install an Intake catalog file into the user’s conda
environment ($CONDA_PREFIX/share/intake). A data package does not necessarily imply there are data files
inside the package. A data package could describe remote data sources (such as files in S3) and take up
very little space on disk.
.TP
.B Data Provider
A person whose main objective is to curate data sources, get them into appropriate
formats, describe the contents, and disseminate the data to those that need to use them. Such a person
may care about the specifics of the storage format and backing store, the right number of fields
to keep and removing bad data. They may have a good idea of the best way to visualise any give
data\-set. In an organisation, this job may be known as Data Engineer, but it could as easily be
done by a member of the IT team. These people are the most likely to author \fI\%Catalogs\fP\&.
.TP
.B Developer
A person who writes or fixes code. In the context of Intake, a developer may make new format
\fI\%Drivers\fP, create authentication systems or add functionality to Intake itself. They can
take existing code for loading data in other projects, and use Intake to add extra functionality to it,
for instance, remote data access, parallel processing, or file\-name parsing.
.TP
.B Driver
The thing that does the work of reading the data for a catalog entry is known as a driver, often referred
to using a simple name such as "csv". Intake
has a plugin architecture, and new drivers can be created or installed, and specific catalogs/data\-sets may
require particular drivers for their contained data\-sets. If installed as \fI\%Conda\fP packages, then
these requirements will be automatically installed for you. The driver\(aqs output will be a \fI\%Container\fP,
and often the code is a simpler layer over existing functionality in a third\-party package.
.TP
.B GUI
A Graphical User Interface. Intake comes with a GUI for finding and selecting data\-sets, see \fI\%GUI\fP\&.
.TP
.B IT
The Information Technology team for an organisation. Such a team may have
control of the computing infrastructure and security (sys\-ops), and may well act as gate\-keepers when
exposing data for use by other colleagues. Commonly, IT has stronger policy enforcement requirements
that other groups, for instance requiring all data\-set copy actions to be logged centrally.
.TP
.B Persist
A process of making a local version of a data\-source. One canonical format is used for each
of the container types, optimised for quick and parallel access. This is particularly useful
if the data takes a long time to acquire, perhaps because it is the result of a complex
query on a remote service. The resultant output can be set to expire and be automatically
refreshed, see \fI\%Persisting Data\fP\&. Not to be confused with the \fI\%cache\fP\&.
.TP
.B Plugin
Modular extra functionality for Intake, provided by a package that is installed separately. The most common type of
plugin will be for a \fI\%Driver\fP to load some particular data format; but other parts of Intake are
pluggable, such as authentication mechanisms for the server.
.TP
.B Server
A remote source for Intake catalogs. The server will
provide data source specifications (i.e., a remote \fI\%Catalog\fP), and may also provide the raw data, in situations
where the client is not able or not allowed to access it directly. As such, the server can act as a gatekeeper of
the data for security and monitoring purposes. The implementation of the server in Intake is accessible as the
\fBintake\-server\fP command, and acts as a reference: other implementations can easily be created for
specific circumstances.
.TP
.B TTL
Time\-to\-live, how long before the given entity is considered to have expired. Usually in seconds.
.TP
.B User Parameter
A data source definition can contain a "parameters" section, which can act as explicit decision indicators
for the user, or as validation and type coersion for the definition\(aqs \fI\%Argument\fP s. See
\fI\%Parameter Definition\fP\&.
.TP
.B YAML
A text\-based format for expressing data with a dictionary (key\-value) and list structure, with a limited
number of data\-types, such as strings and numbers. YAML uses indentations to nest objects, making it easy
to read and write for humans, compared to JSON. Intake\(aqs catalogs and config are usually expressed in YAML
files.
.UNINDENT
.SH COMMUNITY
.sp
Intake is used and developed by individuals at a variety of institutions.  It
is open source (\fI\%license\fP)
and sits within the broader Python numeric ecosystem commonly referred to as
PyData or SciPy.
.SS Discussion
.sp
Conversation happens in the following places:
.INDENT 0.0
.IP 1. 3
\fBUsage questions\fP are directed to \fI\%Stack Overflow with the #intake tag\fP\&.
Intake developers monitor this tag.
.IP 2. 3
\fBBug reports and feature requests\fP are managed on the \fI\%GitHub issue
tracker\fP\&. Individual intake plugins are managed in separate repositories
each with its own issue tracker. Please consult the \fI\%Plugin Directory\fP
for a list of available plugins.
.IP 3. 3
\fBChat\fP occurs on at \fI\%gitter.im/ContinuumIO/intake\fP\&.  Note that
because gitter chat is not searchable by future users we discourage usage
questions and bug reports on gitter and instead ask people to use Stack
Overflow or GitHub.
.IP 4. 3
\fBMonthly community meeting\fP happens the first Thursday of the month at
9:00 US Central Time. See \fI\%https://github.com/intake/intake/issues/596\fP,
with a reminder sent out on the gitter channel. Strictly informal chatter.
.UNINDENT
.SS Asking for help
.sp
We welcome usage questions and bug reports from all users, even those who are
new to using the project.  There are a few things you can do to improve the
likelihood of quickly getting a good answer.
.INDENT 0.0
.IP 1. 3
\fBAsk questions in the right place\fP:  We strongly prefer the use
of Stack Overflow or GitHub issues over Gitter chat.  GitHub and
Stack Overflow are more easily searchable by future users, and therefore is more
efficient for everyone\(aqs time.  Gitter chat is strictly reserved for
developer and community discussion.
.sp
If you have a general question about how something should work or
want best practices then use Stack Overflow.  If you think you have found a
bug then use GitHub
.IP 2. 3
\fBAsk only in one place\fP: Please restrict yourself to posting your
question in only one place (likely Stack Overflow or GitHub) and don\(aqt post
in both
.IP 3. 3
\fBCreate a minimal example\fP:  It is ideal to create \fI\%minimal, complete,
verifiable examples\fP\&.  This
significantly reduces the time that answerers spend understanding your
situation, resulting in higher quality answers more quickly.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%Index\fP
.IP \(bu 2
\fI\%Module Index\fP
.IP \(bu 2
\fI\%Search Page\fP
.UNINDENT
.SH AUTHOR
Anaconda
.SH COPYRIGHT
2022, Anaconda
.\" Generated by docutils manpage writer.
.