.\" $MawkId: mawk-arrays.7,v 1.14 2024/01/24 00:12:25 tom Exp $
.\" ###########################################################################
.\" # copyright 2008-2020,2024, Thomas E. Dickey
.\" # copyright 1996, Michael D. Brennan
.\" #
.\" # This is a source file for mawk, an implementation of
.\" # the AWK programming language.
.\" #
.\" # Mawk is distributed without warranty under the terms of
.\" # the GNU General Public License, version 2, 1991.
.\" ###########################################################################
.ds N Mawk
.ds n mawk
.TH MAWK-ARRAYS 7 2024-01-23 "Version 1.3.4" Miscellaneous
.\" Bulleted paragraph
.de bP
.ie n  .IP \(bu 4
.el    .IP \(bu 2
..
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g \{\
.ds `` \(lq
.ds '' \(rq
.ds '  \(aq
.\}
.el \{\
.ie t .ds `` ``
.el   .ds `` ""
.ie t .ds '' ''
.el   .ds '' ""
.ie t .ds '  \(aq
.el   .ds '  '
.\}
.\" superscript
.de SU
.ie n \\$1**(\\$2)\\$3
.el   \\$1\u\s-1\\$2\s+1\d\\$3
..
.\" **************************************************************************
.SH NAME
mawk-arrays \-
design notes for mawk's array implementation
.\" **************************************************************************
.SH SYNOPSIS
This is the documentation for the \fB\*n\fP implementation
of awk arrays.
Arrays in awk are associations of strings to awk scalar
values.
The \fB\*n\fP implementation stores the associations in
hash tables.
The hash table scheme was influenced by
and is similar to
the design presented in Griswold and Townsend,
.IR "The Design and Implementation of Dynamic Hashing Sets and Tables in Icon" ,
.BR "Software Practice and Experience" ,
23, 351-367, 1993.
.SH "DATA STRUCTURES"
.SS "Array Structure"
The type \fBARRAY\fP is a pointer to a \fBstruct array\fP.
The \fIsize\fP field is the number of elements in the table.
The meaning of the other fields depends on the \fItype\fP field.
.PP
There are three types of arrays and these are distinguished by the
\fItype\fP field in the structure.
The types are:
.TP 5
\fBAY_NULL\fP
The array is empty and the \fIsize\fP field is always zero.
The other fields have no meaning.
.TP 5
\fBAY_SPLIT\fP
The array was created by the \fIAWK\fP built-in
\fIsplit\fP.
The return value from \fIsplit\fP is stored in the \fIsize\fP
field.
The \fIptr\fP field points at a vector of \fBCELL\fPs.
The number
of \fBCELL\fPs is the \fIlimit\fP field.
It is always true that
\fIsize\fP\ \[<=]\ \fIlimit\fP.
The address of \fIA[i\fP] is \fI(CELL*)A->ptr+i-1\fP
for 1\[<=]\ \fIi\fP\ \[<=] \fIsize\fP.
The \fIhmask\fP field has no meaning.
.TP 5
\fIHash Table\fP
The array is a hash table.
If the \fBAY_STR\fP bit in the \fItype\fP field is set,
then the table is keyed on strings.
If the \fBAY_INT\fP bit in the \fItype\fP field is set, then the table is
keyed on integers.
Both bits can be set, and then the two keys are
consistent, i.e., look up of \fIA[-14\fP] and \fIA["-14"\fP] will
return identical \fICELL\fP pointers although the look up methods will
be different.
In this case, the \fIsize\fP field is the number of hash nodes
in the table.
When insertion of a new element would cause \fIsize\fP to
exceed \fIlimit\fP, the table grows by doubling the number of hash chains.
The invariant,
(\fIhmask\fP+1)\fImax\_ave\_list\_length\fP=\fIlimit\fP is always true.
\fIMax\_ave\_list\_length\fP is a tunable constant.
.SS "Hash Tables"
The hash tables are linked lists of nodes, called \fBANODE\fPs.
The number of lists is \fIhmask+1\fP which is always a power of two.
The \fIptr\fP field points at a vector of list heads.
Since there are
potentially two types of lists, integer lists and strings lists,
each list head is a structure, \fIDUAL_LINK\fP.
.PP
The string lists are chains connected by \fIslinks\fP and the integer
lists are chains connected by \fIilinks\fP.
We sometimes refer to these
lists as slists and ilists, respectively.
The elements on the lists are \fBANODE\fPs.
The fields of an \fBANODE\fP are:
.PP
\fIslink\fP The link field for slists.
\fIilink\fP The link field for ilists.
\fIsval\fP If non-null, then \fIsval\fP is a pointer to a string
key.
For a given table, if the \fBAY_STR\fP bit is set then every
\fBANODE\fP has a non-null \fIsval\fP field and conversely, if \fBAY_STR\fP
is not set, then every \fIsval\fP field is null.
.PP
\fIhval\fP The hash value of \fIsval\fP.
This field has no
meaning if \fIsval\fP is null.
.PP
\fIival\fP The integer key.
The field has no meaning if set
to the constant, \fINOT_AN_IVALUE\fP.
If the \fBAY_STR\fP bit is off,
then every \fBANODE\fP will have a valid \fIival\fP field.
If the
\fBAY_STR\fP bit is on, then the \fIival\fP field may or may not be
valid.
.PP
\fIcell\fP The data field in the hash table.
\endhitems
.PP
So the value of \fIA\fP[\fIexpr\fP is stored in the \fIcell\fP field, and if
\fIexpr\fP is an integer, then \fIexpr\fP is stored in \fIival\fP, else it
is stored in \fIsval\fP.
.SH "ARRAY OPERATIONS"
The functions that operate on arrays are,
.TP 5
\fICELL* array_find(ARRAY A, CELL *cp, int create_flag)\fP
returns a pointer to \fIA\fP[\fIexpr\fP]
where \fIcp\fP is a pointer to the \fICELL\fP holding \fIexpr\fP.
If the \fIcreate_flag\fP is on and \fIexpr\fP is not
an element of \fIA\fP, then the element is created with value \fInull\fP.
.TP 5
\fIvoid array_delete(ARRAY A, CELL *cp)\fP
removes an element \fIA\fP[\fIexpr\fP from the array \fIA\fP.
\fIcp\fP points at the \fICELL\fP holding
\fIexpr\fP.
.TP 5
\fIvoid array_load(ARRAY A, size_t cnt)\fP
builds a split array.
The values \fIA[1..cnt\fP] are moved into \fIA\fP from an anonymous
buffer with \fItransfer_to_array()\fP which is declared in
\fIsplit.h\fP.
.TP 5
\fIvoid array_clear(ARRAY A)\fP removes all elements of \fIA\fP.
The type of \fIA\fP is then \fBAY_NULL\fP.
.TP 5
\fISTRING** array_loop_vector(ARRAY A, size_t *sizep)\fP
returns a pointer to a linear vector
that holds all the strings that are indices of \fIA\fP.
The size of the the vector is returned indirectly in \fI*sizep\fP.
If \fIA->size\fP\[==]\fB0\fP, a \fInull\fP pointer is returned.
.TP 5
\fICELL* array_cat(CELL *sp, int cnt)\fP
concatenates the elements
of \fIsp[1-cnt..0]\fP, with each element separated by \fISUBSEP\fP, to
compute an array index.
For example, on a reference to \fIA\fP[i,j],
\fIarray_cat\fP computes \fIi\fP \[ci] \fISUBSEP \[ci] \fIj\fP where
\[ci] denotes concatenation.
.SS "Array Find"
Any reference to A[\fIexpr\fP] creates a call to
\fIarray_find(A,cp,CREATE)\fP where \fIcp\fP points at the cell holding
\fIexpr\fP.
The test, \fIexpr\fP \fBin\fP \fIA\fP, creates a call to
\fIarray_find(A,cp,NO_CREATE)\fP.

\fIArray_find\fP is a hash-table lookup function that handles two cases:
.TP 5
1.
If \fI*cp\fP is numeric and integer valued, then lookup by
integer value using \fIfind_by_ival\fP.
If \fI*cp\fP is numeric, but not
integer valued, then convert to string with \fIsprintf(CONVFMT,...)\fP and
go to case~2.
.TP 5
2.
If \fI*cp\fP is string valued, then lookup by string value
using \fIfind_by_sval\fP.
\endlist
.PP
To test whether \fIcp->dval\fP is integer, we convert to the nearest
integer by rounding towards zero (done by \fIdo_to_I\fP) and then cast
back to double.
If we get the same number we started with, then
\fIcp->dval\fP is integer valued.
.PP
When we get to the function \fIfind_by_ival\fP, the search has been reduced
to lookup in a hash table by integer value.
.PP
When a search by integer value fails, we have to check by string
value to correctly
handle the case insertion by \fIA["123"\fP] and later search as
\fIA[123\fP].
This string search is necessary if and only if the
\fBAY_STR\fP bit is set.
An important point is that all \fBANODE\fPs get
created with a valid \fIsval\fP if \fBAY_STR\fP is set, because then creation
of new nodes always occurs in a call to \fIfind_by_sval\fP.
.PP
Searching by string value is easier because \fIAWK\fP arrays are really
string associations.
If the array does not have the \fBAY_STR\fP bit set,
then we have to convert the array to a dual hash table with strings
which is done by the function \fIadd_string_associations\fP.
.PP
One \fIInt\fP value is reserved to show that the \fIival\fP field is invalid.
This works because \fId_to_I\fP returns a value in \fI[-Max_Int, Max_Int\fP].
.PP
On entry to \fIadd_string_associations\fP, we know that the \fBAY_STR\fP bit
is not set.
We convert to a dual hash table, then walk all the integer
lists and put each \fBANODE\fP on a string list.
.SS "Array Delete"
The execution of the statement, \fBdelete\fP \fIA\fP[\fIexpr\fP], creates a
call to \fIarray_delete(ARRAY A, CELL *cp)\fP.
Depending on the
type of \fI*cp\fP, the call is routed
to \fIfind_by_sval\fP or \fIfind_by_ival\fP.
Each of these functions leaves its return value on the front of an
\fIslist\fP or \fIilist\fP, respectively,
and then it is deleted from the front of
the list.
The case where \fIA\fP[\fIexpr\fP is on two lists, e.g.,
\fIA[12\fP] and \fIA["12"\fP] is checked by examining the \fIsval\fP and
\fIival\fP fields of the returned \fBANODE*\fP.
.PP
Even though we found a node by searching an \fIilist\fP it might also
be on an \fIslist\fP and vice-versa.
.PP
When the size of a hash table drops below a certain value, it might
be profitable to shrink the hash table.
Currently we don't do this,
because our guess is that it would be a waste of time for most
\fIAWK\fP applications.
However, we do convert an array to \fBAY_NULL\fP
when the size goes to zero which would resize a large hash table
that had been completely cleared by successive deletions.
.SS "Building an Array with Split"
A simple operation is to create an array with the \fIAWK\fP
primitive \fIsplit\fP.
The code that performs \fIsplit\fP puts the
pieces in an anonymous buffer.
\fIarray_load(A, cnt)\fP moves the \fIcnt\fP elements from the anonymous
buffer into \fIA\fP.
This is the only way an array of type \fBAY_SPLIT\fP is
created.
.PP
If the array \fIA\fP is a split array and big enough then we reuse it,
otherwise we need to allocate a new split array.
When we allocate a block of \fBCELL\fPs for a split array, we round up
to a multiple of 4.
.SS "Array Clear"
The function \fIarray_clear(ARRAY A)\fP converts \fIA\fP to type \fBAY_NULL\fP
and frees all storage used by \fIA\fP except for the \fIstruct array\fP
itself.
This function gets called in three contexts:
.TP 5
(1)
when an array local to a user function goes out of scope,
.TP 5
(2)
execution of the \fIAWK\fP statement, \fIdelete A\fP and
.TP 5
(3)
when an existing changes type or size from \fIsplit()\fP.
.SS "Constructor and Conversions"
Arrays are always created as empty arrays of type \fBAY_NULL\fP.
Global arrays are never destroyed although they can go empty or have
their type change by conversion.
The only constructor function is
a macro.
.PP
Hash tables only get constructed by conversion.
This happens in two
ways.
The function \fImake_empty_table\fP converts an empty array of type
\fBAY_NULL\fP to an empty hash table.
The number of lists in the table
is a power of 2 determined by the constant \fISTARTING_HMASK\fP.
The limit size of the table is determined by the constant
\fIMAX_AVE_LIST_LENGTH\fP which is the largest average size of the hash
lists that we are willing to tolerate before enlarging the table.
When \fIA->size\fP exceeds \fIA->limit\fP,
the hash table grows in size by doubling the number of lists.
\fIA->limit\fP is then reset to \fIMAX_AVE_LIST_LENGTH\fP times
\fIA->hmask+1\fP.
.PP
The other way a hash table gets constructed is when a split array is
converted to a hash table of type \fBAY_INT\fP.
.PP
To determine the size of the table, we set the initial size to
\fISTARTING_HMASK+1\fP and then double the size until
\fIA->size\fP\ \[<=]\ \fIA->limit\fP.
.SS "Doubling the Size of a Hash Table"
The whole point of making the table size a power of two is to
facilitate resizing the table.
If the table size is
.SU 2 n
and
\fIh\fP is the hash key, then \fIh\fP\ \fBmod\fP
.SU 2 n
is the hash chain index
which can be calculated with bit-wise and,
\fIh\fP &
.SU (2 n-1 ).
When the table size doubles, the new bit-mask has one more bit
turned on.
Elements of an old hash chain whose hash value have this bit
turned on get moved to a new chain.
Elements with this bit turned off
stay on the same chain.
On average only half the old chain moves to the
new chain.
If the old chain is at \fItable\fP[i],\ 0\ \[<=]\ \fIi\fP <
.SU 2 n ,
then the elements that move, all move to the new chain at
\fItable\fP[i +
.SU 2 n ].
.PP
As we walk an old string list with pointer \fIp\fP, the expression
\fIp->hval & new_hmask\fP takes one of two values.
If it is equal
to \fIp->hval & old_hmask\fP (which equals \fIi\fP),
then the node stays otherwise it gets moved
to a new string list at \fIj\fP.
The new string list preserves order so that
the positions of the move-to-the-front heuristic are preserved.
Nodes moving to the new list are appended at pointer \fItail\fP.
The \fBANODE\fPs, \fIdummy0\fP~and \fIdummy1\fP, are sentinels that remove
special handling of boundary conditions.
.PP
The doubling of the integer lists is exactly the same except that
\fIslink\fP is replaced by \fIilink\fP and \fIhval\fP is replaced by \fIival\fP.
.SS "Array Loops"
Our mechanism for dealing with execution of the statement,
.RS
.PP
\fBfor \fP(\fIi\fP in \fIA\fP) { \fIstatements\fP }
.RE
.PP
is simple.
We allocate a vector of \fISTRING*\fP of size,
\fIA->size\fP.
Each element of the vector is a string key for~\fIA\fP.
Note that if the \fBAY_STR\fP bit of \fIA\fP is not set, then \fIA\fP
has to be converted to a string hash table, because the index
\fIi\fP walks string indices.
.PP
To execute the loop, the only state that needs to be saved is the
address of \fIi\fP and an index into the vector of string keys.
Since
nothing about \fIA\fP is saved as state, the user
program can do anything to \fIA\fP inside the body of
the loop, even \fIdelete A\fP, and the loop
still works.
Essentially, we have traded data space (the string vector)
in exchange for implementation simplicity.
On a 32-bit system, each
\fBANODE\fP is 36 bytes, so the extra memory needed for the array loop is
11\% more than the memory consumed by the \fBANODE\fPs of the array.
Note that the large size of the \fBANODE\fPs is indicative of our whole
design which pays data space for integer lookup speed and algorithm
simplicity.

The only aspect of array loops that occurs in \fIarray.c\fP is construction
of the string vector.
The rest of the implementation
is in the file \fIexecute.c\fP.
.PP
As we walk over the hash table \fBANODE\fPs, putting each \fIsval\fP in
\fIret\fP, we need to increment each reference count.
The user of the
return value is responsible for these new reference counts.
.SS "Concatenating Array Indices"
In \fIAWK\fP, an array expression \fIA[i,j\fP] is equivalent to the
expression \fIA[i SUBSEP j\fP], i.e., the index is the
concatenation of the three
elements \fIi\fP, \fISUBSEP\fP and \fIj\fP.
This is performed by the
function \fIarray_cat\fP.
On entry, \fIsp\fP points at the top of a
stack of \fBCELL\fPs.
\fICnt\fP cells are popped off the stack and concatenated together
separated by \fISUBSEP\fP and the result is pushed back on the stack.
On entry, the first multi-index is in \fIsp[1-cnt\fP] and the last is
in \fIsp[0\fP].
The return value is the new stack top.
(The stack is the run-time evaluation stack.
This operation really has nothing to do with array structure, so
logically this code belongs in \fIexecute.c\fP, but remains here for
historical reasons.)
.PP
We make a copy of \fISUBSEP\fP which we can cast to string in the
unlikely event the user has assigned a number to \fISUBSEP\fP.
.PP
Set \fIsp\fP and \fItop\fP so the cells to concatenate are inclusively
between \fIsp\fP and \fItop\fP.
.PP
The \fItotal_len\fP is the sum of the lengths of the \fIcnt\fP
strings and the \fIcnt-1\fP copies of \fIsubsep\fP.
.PP
The return value is \fIsp\fP and it is already set correctly.
We just need to free the strings and set the contents of \fIsp\fP.