Scroll to navigation

datalad search(1) General Commands Manual datalad search(1)

NAME

datalad search - search dataset metadata

SYNOPSIS

datalad search [-h] [-d DATASET] [--reindex] [--max-nresults MAX_NRESULTS] [--mode {egrep,textblob,autofield}] [--full-record] [--show-keys {name,short,full}] [--show-query] [--metadata-source {legacy,gen4,all}] [--version] [QUERY ...]

DESCRIPTION

DataLad can search metadata extracted from a dataset and/or aggregated into a superdataset (see the `aggregate-metadata` command). This makes it possible to discover datasets, or individual files in a dataset even when they are not available locally.

Ultimately DataLad metadata are a graph of linked data structures. However, this command does not (yet) support queries that can exploit all information stored in the metadata. At the moment the following search modes are implemented that represent different trade-offs between the expressiveness of a query and the computational and storage resources required to execute a query.

- egrep (default)

- egrepcs [case-sensitive egrep]

- textblob

- autofield

An alternative default mode can be configured by tuning the configuration variable 'datalad.search.default-mode'::

[datalad "search"]
default-mode = egrepcs

Each search mode has its own default configuration for what kind of documents to query. The respective default can be changed via configuration variables::

[datalad "search"]
index-<mode_name>-documenttype = (all|datasets|files)

Mode: egrep/egrepcs

These search modes are largely ignorant of the metadata structure, and simply perform matching of a search pattern against a flat string-representation of metadata. This is advantageous when the query is simple and the metadata structure is irrelevant, or precisely known. Moreover, it does not require a search index, hence results can be reported without an initial latency for building a search index when the underlying metadata has changed (e.g. due to a dataset update). By default, these search modes only consider datasets and do not investigate records for individual files for speed reasons. Search results are reported in the order in which they were discovered.

Queries can make use of Python regular expression syntax (https://docs.python.org/3/library/re.html). In EGREP mode, matching is case-insensitive when the query does not contain upper case characters, but is case-sensitive when it does. In `egrepcs` mode, matching is always case-sensitive. Expressions will match anywhere in a metadata string, not only at the start.

When multiple queries are given, all queries have to match for a search hit (AND behavior).

It is possible to search individual metadata key/value items by prefixing the query with a metadata key name, separated by a colon (':'). The key name can also be a regular expression to match multiple keys. A query match happens when any value of an item with a matching key name matches the query (OR behavior). See examples for more information.

Examples:

Query for (what happens to be) an author::


% datalad search haxby

Queries are case-INsensitive when the query contains no upper case characters, and can be regular expressions. Use EGREPCS mode when it is desired to perform a case-sensitive lowercase match::


% datalad search --mode egrepcs halchenko.*haxby

This search mode performs NO analysis of the metadata content. Therefore queries can easily fail to match. For example, the above query implicitly assumes that authors are listed in alphabetical order. If that is the case (which may or may not be true), the following query would yield NO hits::


% datalad search Haxby.*Halchenko

The ``textblob`` search mode represents an alternative that is more robust in such cases.

For more complex queries multiple query expressions can be provided that all have to match to be considered a hit (AND behavior). This query discovers all files (non-default behavior) that match 'bids.type=T1w' AND 'nifti1.qform_code=scanner'::


% datalad -c datalad.search.index-egrep-documenttype=all search bids.type:T1w nifti1.qform_code:scanner

Key name selectors can also be expressions, which can be used to select multiple keys or construct "fuzzy" queries. In such cases a query matches when any item with a matching key matches the query (OR behavior). However, multiple queries are always evaluated using an AND conjunction. The following query extends the example above to match any files that have either 'nifti1.qform_code=scanner' or 'nifti1.sform_code=scanner'::


% datalad -c datalad.search.index-egrep-documenttype=all search bids.type:T1w nifti1.(q|s)form_code:scanner

Mode: textblob

This search mode is very similar to the ``egrep`` mode, but with a few key differences. A search index is built from the string-representation of metadata records. By default, only datasets are included in this index, hence the indexing is usually completed within a few seconds, even for hundreds of datasets. This mode uses its own query language (not regular expressions) that is similar to other search engines. It supports logical conjunctions and fuzzy search terms. More information on this is available from the Whoosh project (search engine implementation):

- Description of the Whoosh query language:
http://whoosh.readthedocs.io/en/latest/querylang.html)

- Description of a number of query language customizations that are
enabled in DataLad, such as, fuzzy term matching:
http://whoosh.readthedocs.io/en/latest/parsing.html#common-customizations

Importantly, search hits are scored and reported in order of descending relevance, hence limiting the number of search results is more meaningful than in the 'egrep' mode and can also reduce the query duration.

Examples:

Search for (what happens to be) two authors, regardless of the order in which those names appear in the metadata::


% datalad search --mode textblob halchenko haxby

Fuzzy search when you only have an approximate idea what you are looking for or how it is spelled::


% datalad search --mode textblob haxbi~

Very fuzzy search, when you are basically only confident about the first two characters and how it sounds approximately (or more precisely: allow for three edits and require matching of the first two characters)::


% datalad search --mode textblob haksbi~3/2

Combine fuzzy search with logical constructs::


% datalad search --mode textblob 'haxbi~ AND (hanke OR halchenko)'

Mode: autofield

This mode is similar to the 'textblob' mode, but builds a vastly more detailed search index that represents individual metadata variables as individual fields. By default, this search index includes records for datasets and individual fields, hence it can grow very quickly into a huge structure that can easily take an hour or more to build and require more than a GB of storage. However, limiting it to documents on datasets (see above) retains the enhanced expressiveness of queries while dramatically reducing the resource demands.

Examples:

List names of search index fields (auto-discovered from the set of indexed datasets) which either have a field starting with "age" or "gender"::


% datalad search --mode autofield --show-keys name '.age' '.gender'

Fuzzy search for datasets with an author that is specified in a particular metadata field::


% datalad search --mode autofield bids.author:haxbi~ type:dataset

Search for individual files that carry a particular description prefix in their 'nifti1' metadata::


% datalad search --mode autofield nifti1.description:FSL* type:file

Reporting

Search hits are returned as standard DataLad results. On the command line the '--output-format' (or '-f') option can be used to tweak results for further processing.

Examples:

Format search hits as a JSON stream (one hit per line)::


% datalad -f json search haxby

Custom formatting: which terms matched the query of particular results. Useful for investigating fuzzy search results::


$ datalad -f '{path}: {query_matched}' search --mode autofield bids.author:haxbi~

OPTIONS

query string, supported syntax and features depends on the selected search mode (see documentation).

show this help message. --help-np forcefully disables the use of a pager for displaying the help message
specify the dataset to perform the query operation on. If no dataset is given, an attempt is made to identify the dataset based on the current working directory and/or the PATH given. Constraints: Value must be a Dataset or a valid identifier of a Dataset (e.g. a path) or value must be NONE
force rebuilding the search index, even if no change in the dataset's state has been detected, for example, when the index documenttype configuration has changed.
maximum number of search results to report. Setting this to 0 will report all search matches. Depending on the mode this can search substantially slower. If not specified, a mode-specific default setting will be used. Constraints: value must be convertible to type 'int' or value must be NONE
Mode of search index structure and content. See section SEARCH MODES for details.
If set, return the full metadata record for each search hit. Depending on the search mode this might require additional queries. By default, only data that is available to the respective search modes is returned. This always includes essential information, such as the path and the type.
if given, a list of known search keys is shown. If 'name' - only the name is printed one per line. If 'short' or 'full', statistics (in how many datasets, and how many unique values) are printed. 'short' truncates the listing of unique values. QUERY, if provided, is regular expressions any of which keys should contain. No other action is performed (except for reindexing), even if other arguments are given. Each key is accompanied by a term definition in parenthesis (TODO). In most cases a definition is given in the form of a URL. If an ontology definition for a term is known, this URL can resolve to a webpage that provides a comprehensive definition of the term. However, for speed reasons term resolution is solely done on information contained in a local dataset's metadata, and definition URLs might be outdated or point to no longer existing resources.
if given, the formal query that was generated from the given query string is shown, but not actually executed. This is mostly useful for debugging purposes.
if given, defines which metadata source will be used to search. 'legacy' will limit search to metadata in the old format, i.e. stored in '$DATASET/.datalad/metadata'. 'gen4' will limit search to metadata stored by the git-backend of 'datalad-metadata-model'. If 'all' is given, metadata from all supported sources will be included in the search. The default is 'legacy'. [Default: 'legacy']
show the module and its version which provides the command

AUTHORS


datalad is developed by The DataLad Team and Contributors <team@datalad.org>.

2022-12-14 datalad search 0.17.10