.\" -*- mode: troff; coding: utf-8 -*-
.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
.ie n \{\
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "CDB_File 3pm"
.TH CDB_File 3pm 2024-03-07 "perl v5.38.2" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH NAME
CDB_File \- Perl extension for access to cdb databases
.SH SYNOPSIS
.IX Header "SYNOPSIS"
.Vb 2
\&    use CDB_File;
\&    $c = tie(%h, \*(AqCDB_File\*(Aq, \*(Aqfile.cdb\*(Aq) or die "tie failed: $!\en";
\&
\&    # If accessing a utf8 stored CDB_File
\&    $c = tie(%h, \*(AqCDB_File\*(Aq, \*(Aqfile.cdb\*(Aq, utf8 => 1) or die "tie failed: $!\en";
\&
\&    $fh = $c\->handle;
\&    sysseek $fh, $c\->datapos, 0 or die ...;
\&    sysread $fh, $x, $c\->datalen;
\&    undef $c;
\&    untie %h;
\&
\&    $t = CDB_File\->new(\*(Aqt.cdb\*(Aq, "t.$$") or die ...;
\&    $t\->insert(\*(Aqkey\*(Aq, \*(Aqvalue\*(Aq);
\&    $t\->finish;
\&
\&    CDB_File::create %t, $file, "$file.$$";
.Ve
.PP
or
.PP
.Vb 2
\&    use CDB_File \*(Aqcreate\*(Aq;
\&    create %t, $file, "$file.$$";
\&
\&    # If you want to store the data in utf8 mode.
\&    create %t, $file, "$file.$$", utf8 => 1;
\&=head1 DESCRIPTION
.Ve
.PP
\&\fBCDB_File\fR is a module which provides a Perl interface to Dan
Bernstein's \fBcdb\fR package:
.PP
.Vb 2
\&    cdb is a fast, reliable, lightweight package for creating and
\&    reading constant databases.
.Ve
.SS "Reading from a cdb"
.IX Subsection "Reading from a cdb"
After the \f(CW\*(C`tie\*(C'\fR shown above, accesses to \f(CW%h\fR will refer
to the \fBcdb\fR file \f(CW\*(C`file.cdb\*(C'\fR, as described in "tie" in perlfunc.
.PP
Low level access to the database is provided by the three methods
\&\f(CW\*(C`handle\*(C'\fR, \f(CW\*(C`datapos\*(C'\fR, and \f(CW\*(C`datalen\*(C'\fR.  To use them, you must remember
the \f(CW\*(C`CDB_File\*(C'\fR object returned by the \f(CW\*(C`tie\*(C'\fR call: \f(CW$c\fR in the
example above.  The \f(CW\*(C`datapos\*(C'\fR and \f(CW\*(C`datalen\*(C'\fR methods return the
file offset position and length respectively of the most recently
visited key (for example, via \f(CW\*(C`exists\*(C'\fR).
.PP
Beware that if you create an extra reference to the \f(CW\*(C`CDB_File\*(C'\fR object
(like \f(CW$c\fR in the example above) you must destroy it (with \f(CW\*(C`undef\*(C'\fR)
before calling \f(CW\*(C`untie\*(C'\fR on the hash.  This ensures that the object's
\&\f(CW\*(C`DESTROY\*(C'\fR method is called.  Note that \f(CW\*(C`perl \-w\*(C'\fR will check this for
you; see perltie for further details.
.SS "Creating a cdb"
.IX Subsection "Creating a cdb"
A \fBcdb\fR file is created in three steps.  First call \f(CW\*(C`new CDB_File
($final, $tmp)\*(C'\fR, where \f(CW$final\fR is the name of the database to be
created, and \f(CW$tmp\fR is the name of a temporary file which can be
atomically renamed to \f(CW$final\fR.  Secondly, call the \f(CW\*(C`insert\*(C'\fR method
once for each (\fIkey\fR, \fIvalue\fR) pair.  Finally, call the \f(CW\*(C`finish\*(C'\fR
method to complete the creation and renaming of the \fBcdb\fR file.
.PP
Alternatively, call the \f(CWinsert()\fR method with multiple key/value
pairs. This can be significantly faster because there is less crossing
over the bridge from perl to C code. One simple way to do this is to pass
in an entire hash, as in: \f(CW\*(C`$cdbmaker\->insert(%hash);\*(C'\fR.
.PP
A simpler interface to \fBcdb\fR file creation is provided by
\&\f(CW\*(C`CDB_File::create %t, $final, $tmp\*(C'\fR.  This creates a \fBcdb\fR file named
\&\f(CW$final\fR containing the contents of \f(CW%t\fR.  As before,  \f(CW$tmp\fR must
name a temporary file which can be atomically renamed to \f(CW$final\fR.
\&\f(CW\*(C`CDB_File::create\*(C'\fR may be imported.
.SS "UTF8 support."
.IX Subsection "UTF8 support."
When CDB_File was created in 1997 (prior even to Perl 5.6), Perl SVs
didn't really deal with UTF8. In order to properly store mixed
bytes and utf8 data in the file, we would normally need to store a bit
for each string which clarifies the encoding of the key / values.
This would be useful since Perl hash keys are downgraded to bytes when
possible so as to normalize the hash key access regardless of encoding.
.PP
The CDB_File format is used outside of Perl and so must maintain file
format compatibility with those systems. As a result this module provides
a utf8 mode which must be enabled at database generation and then later
at read. Keys will always be stored as UTF8 strings which is the opposite
of how Perl stores the strings. This approach had to be taken to assure no
data corruption happened due to accidentally downgraded SVs before they
are stored or on retrieval.
.PP
You can enable utf8 mode by passing \f(CW\*(C`utf8 => 1\*(C'\fR to \fBnew\fR, \fBtie\fR,
or \fBcreate\fR. All returned SVs while in this mode will be encoded in utf8.
This feature is not available below 5.14 due to lack of Perl macro support.
.PP
\&\fBNOTE:\fR read/write of databases not stored in utf8 mode will often be
incompatible with any non-ascii data.
.SH EXAMPLES
.IX Header "EXAMPLES"
These are all complete programs.
.PP
1. Convert a Berkeley DB (B\-tree) database to \fBcdb\fR format.
.PP
.Vb 2
\&    use CDB_File;
\&    use DB_File;
\&
\&    tie %h, DB_File, $ARGV[0], O_RDONLY, undef, $DB_BTREE or
\&            die "$0: can\*(Aqt tie to $ARGV[0]: $!\en";
\&
\&    CDB_File::create %h, $ARGV[1], "$ARGV[1].$$" or
\&            die "$0: can\*(Aqt create cdb: $!\en";
.Ve
.PP
2. Convert a flat file to \fBcdb\fR format.  In this example, the flat
file consists of one key per line, separated by a colon from the value.
Blank lines and lines beginning with \fB#\fR are skipped.
.PP
.Vb 1
\&    use CDB_File;
\&
\&    $cdb = new CDB_File("data.cdb", "data.$$") or
\&            die "$0: new CDB_File failed: $!\en";
\&    while (<>) {
\&            next if /^$/ or /^#/;
\&            chop;
\&            ($k, $v) = split /:/, $_, 2;
\&            if (defined $v) {
\&                    $cdb\->insert($k, $v);
\&            } else {
\&                    warn "bogus line: $_\en";
\&            }
\&    }
\&    $cdb\->finish or die "$0: CDB_File finish failed: $!\en";
.Ve
.PP
3. Perl version of \fBcdbdump\fR.
.PP
.Vb 1
\&    use CDB_File;
\&
\&    tie %data, \*(AqCDB_File\*(Aq, $ARGV[0] or
\&            die "$0: can\*(Aqt tie to $ARGV[0]: $!\en";
\&    while (($k, $v) = each %data) {
\&            print \*(Aq+\*(Aq, length $k, \*(Aq,\*(Aq, length $v, ":$k\->$v\en";
\&    }
\&    print "\en";
.Ve
.PP
4. For really enormous data values, you can use \f(CW\*(C`handle\*(C'\fR, \f(CW\*(C`datapos\*(C'\fR,
and \f(CW\*(C`datalen\*(C'\fR, in combination with \f(CW\*(C`sysseek\*(C'\fR and \f(CW\*(C`sysread\*(C'\fR, to
avoid reading the values into memory.  Here is the script \fIbun\-x.pl\fR,
which can extract uncompressed files and directories from a \fBbun\fR
file.
.PP
.Vb 1
\&    use CDB_File;
\&
\&    sub unnetstrings {
\&        my($netstrings) = @_;
\&        my @result;
\&        while ($netstrings =~ s/^([0\-9]+)://) {
\&                push @result, substr($netstrings, 0, $1, \*(Aq\*(Aq);
\&                $netstrings =~ s/^,//;
\&        }
\&        return @result;
\&    }
\&
\&    my $chunk = 8192;
\&
\&    sub extract {
\&        my($file, $t, $b) = @_;
\&        my $head = $$b{"H$file"};
\&        my ($code, $type) = $head =~ m/^([0\-9]+)(.)/;
\&        if ($type eq "/") {
\&                mkdir $file, 0777;
\&        } elsif ($type eq "_") {
\&                my ($total, $now, $got, $x);
\&                open OUT, ">$file" or die "open for output: $!\en";
\&                exists $$b{"D$code"} or die "corrupt bun file\en";
\&                my $fh = $t\->handle;
\&                sysseek $fh, $t\->datapos, 0;
\&                $total = $t\->datalen;
\&                while ($total) {
\&                        $now = ($total > $chunk) ? $chunk : $total;
\&                        $got = sysread $fh, $x, $now;
\&                        if (not $got) { die "read error\en"; }
\&                        $total \-= $got;
\&                        print OUT $x;
\&                }
\&                close OUT;
\&        } else {
\&                print STDERR "warning: skipping unknown file type\en";
\&        }
\&    }
\&
\&    die "usage\en" if @ARGV != 1;
\&
\&    my (%b, $t);
\&    $t = tie %b, \*(AqCDB_File\*(Aq, $ARGV[0] or die "tie: $!\en";
\&    map { extract $_, $t, \e%b } unnetstrings $b{""};
.Ve
.PP
5. Although a \fBcdb\fR file is constant, you can simulate updating it
in Perl.  This is an expensive operation, as you have to create a
new database, and copy into it everything that's unchanged from the
old database.  (As compensation, the update does not affect database
readers.  The old database is available for them, till the moment the
new one is \f(CW\*(C`finish\*(C'\fRed.)
.PP
.Vb 1
\&    use CDB_File;
\&
\&    $file = \*(Aqdata.cdb\*(Aq;
\&    $new = new CDB_File($file, "$file.$$") or
\&            die "$0: new CDB_File failed: $!\en";
\&
\&    # Add the new values; remember which keys we\*(Aqve seen.
\&    while (<>) {
\&            chop;
\&            ($k, $v) = split;
\&            $new\->insert($k, $v);
\&            $seen{$k} = 1;
\&    }
\&
\&    # Add any old values that haven\*(Aqt been replaced.
\&    tie %old, \*(AqCDB_File\*(Aq, $file or die "$0: can\*(Aqt tie to $file: $!\en";
\&    while (($k, $v) = each %old) {
\&            $new\->insert($k, $v) unless $seen{$k};
\&    }
\&
\&    $new\->finish or die "$0: CDB_File finish failed: $!\en";
.Ve
.SH "REPEATED KEYS"
.IX Header "REPEATED KEYS"
Most users can ignore this section.
.PP
A \fBcdb\fR file can contain repeated keys.  If the \f(CW\*(C`insert\*(C'\fR method is
called more than once with the same key during the creation of a \fBcdb\fR
file, that key will be repeated.
.PP
Here's an example.
.PP
.Vb 4
\&    $cdb = new CDB_File ("$file.cdb", "$file.$$") or die ...;
\&    $cdb\->insert(\*(Aqcat\*(Aq, \*(Aqgato\*(Aq);
\&    $cdb\->insert(\*(Aqcat\*(Aq, \*(Aqchat\*(Aq);
\&    $cdb\->finish;
.Ve
.PP
Normally, any attempt to access a key retrieves the first value
stored under that key.  This code snippet always prints \fBgato\fR.
.PP
.Vb 2
\&    $catref = tie %catalogue, CDB_File, "$file.cdb" or die ...;
\&    print "$catalogue{cat}";
.Ve
.PP
However, all the usual ways of iterating over a hash\-\-\-\f(CW\*(C`keys\*(C'\fR,
\&\f(CW\*(C`values\*(C'\fR, and \f(CW\*(C`each\*(C'\fR\-\-\-do the Right Thing, even in the presence of
repeated keys.  This code snippet prints \fBcat cat gato chat\fR.
.PP
.Vb 1
\&    print join(\*(Aq \*(Aq, keys %catalogue, values %catalogue);
.Ve
.PP
And these two both print \fBcat:gato cat:chat\fR, although the second is
more efficient.
.PP
.Vb 3
\&    foreach $key (keys %catalogue) {
\&            print "$key:$catalogue{$key} ";
\&    }
\&
\&    while (($key, $val) = each %catalogue) {
\&            print "$key:$val ";
\&    }
.Ve
.PP
The \f(CW\*(C`multi_get\*(C'\fR method retrieves all the values associated with a key.
It returns a reference to an array containing all the values.  This code
prints \fBgato chat\fR.
.PP
.Vb 1
\&    print "@{$catref\->multi_get(\*(Aqcat\*(Aq)}";
.Ve
.PP
\&\f(CW\*(C`multi_get\*(C'\fR always returns an array reference.  If the key was not
found in the database, it will be a reference to an empty array.  To
test whether the key was found, you must test the array, and not the
reference.
.PP
.Vb 3
\&    $x = $catref\->multiget($key);
\&    warn "$key not found\en" unless $x; # WRONG; message never printed
\&    warn "$key not found\en" unless @$x; # Correct
.Ve
.PP
The \f(CW\*(C`fetch_all\*(C'\fR method returns a hashref of all keys with the first
value in the cdb.  This is useful for quickly loading a cdb file where
there is a 1:1 key mapping.  In practice it proved to be about 400%
faster then iterating a tied hash.
.PP
.Vb 2
\&    # Slow
\&    my %copy = %tied_cdb;
\&
\&    # Much Faster
\&    my $copy_hashref = $catref\->fetch_all();
.Ve
.SH "RETURN VALUES"
.IX Header "RETURN VALUES"
The routines \f(CW\*(C`tie\*(C'\fR, \f(CW\*(C`new\*(C'\fR, and \f(CW\*(C`finish\*(C'\fR return \fBundef\fR if the
attempted operation failed; \f(CW$!\fR contains the reason for failure.
.SH DIAGNOSTICS
.IX Header "DIAGNOSTICS"
The following fatal errors may occur.  (See "eval" in perlfunc if
you want to trap them.)
.IP "Modification of a CDB_File attempted" 4
.IX Item "Modification of a CDB_File attempted"
You attempted to modify a hash tied to a \fBCDB_File\fR.
.IP "CDB database too large" 4
.IX Item "CDB database too large"
You attempted to create a \fBcdb\fR file larger than 4 gigabytes.
.IP "[ Write to | Read of | Seek in ] CDB_File failed: <error string>" 4
.IX Item "[ Write to | Read of | Seek in ] CDB_File failed: <error string>"
If \fBerror string\fR is \fBProtocol error\fR, you tried to \f(CW\*(C`use CDB_File\*(C'\fR to
access something that isn't a \fBcdb\fR file.  Otherwise a serious OS level
problem occurred, for example, you have run out of disk space.
.SH PERFORMANCE
.IX Header "PERFORMANCE"
Sometimes you need to get the most performance possible out of a
library. Rumour has it that perl's \fBtie()\fR interface is slow. In order
to get around that you can use CDB_File in an object oriented
fashion, rather than via \fBtie()\fR.
.PP
.Vb 1
\&  my $cdb = CDB_File\->TIEHASH(\*(Aq/path/to/cdbfile.cdb\*(Aq);
\&
\&  if ($cdb\->EXISTS(\*(Aqkey\*(Aq)) {
\&      print "Key is: ", $cdb\->FETCH(\*(Aqkey\*(Aq), "\en";
\&  }
.Ve
.PP
For more information on the methods available on tied hashes see
perltie.
.SH "THE ALGORITHM"
.IX Header "THE ALGORITHM"
This algorithm is described at <http://cr.yp.to/cdb/cdb.txt> It is
small enough that it is included inline in the event that the
internet loses the page:
.SS "A structure for constant databases"
.IX Subsection "A structure for constant databases"
Copyright (c) 1996 D. J. Bernstein, djb@pobox.com
.PP
A cdb is an associative array: it maps strings ('keys'') to strings
('data'').
.PP
A cdb contains 256 pointers to linearly probed open hash tables. The
hash tables contain pointers to (key,data) pairs. A cdb is stored in
a single file on disk:
.PP
.Vb 3
\&    +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-\-\-+
\&    | p0 p1 ... p255 | records | hash0 | hash1 | ... | hash255 |
\&    +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-\-\-+\-\-\-\-\-+\-\-\-\-\-\-\-\-\-+
.Ve
.PP
Each of the 256 initial pointers states a position and a length. The
position is the starting byte position of the hash table. The length
is the number of slots in the hash table.
.PP
Records are stored sequentially, without special alignment. A record
states a key length, a data length, the key, and the data.
.PP
Each hash table slot states a hash value and a byte position. If the
byte position is 0, the slot is empty. Otherwise, the slot points to
a record whose key has that hash value.
.PP
Positions, lengths, and hash values are 32\-bit quantities, stored in
little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes.
.PP
A record is located as follows. Compute the hash value of the key in
the record. The hash value modulo 256 is the number of a hash table.
The hash value divided by 256, modulo the length of that table, is a
slot number. Probe that slot, the next higher slot, and so on, until
you find the record or run into an empty slot.
.PP
The cdb hash function is \f(CW\*(C`h = ((h << 5) + h) ^ c\*(C'\fR, with a starting
hash of 5381.
.SH BUGS
.IX Header "BUGS"
The \f(CWcreate()\fR interface could be done with \f(CW\*(C`TIEHASH\*(C'\fR.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
\&\fBcdb\fR\|(3)
.SH AUTHOR
.IX Header "AUTHOR"
Tim Goodwin, <tjg@star.le.ac.uk>.  \fBCDB_File\fR began on 1997\-01\-08.
.PP
Work provided through 2008 by Matt Sergeant, <matt@sergeant.org>
.PP
Now maintained  by Todd Rinaldo, <toddr@cpan.org>