NAME¶
ent - pseudorandom number sequence test
SYNOPSIS¶
ent [options] [file]
DESCRIPTION¶
ENT Logo
ent performs a variety of tests on the stream of bytes in
file (or
standard input if no
file is specified) and produces output on standard
output; for example:
Entropy = 7.980627 bits per character.
Optimum compression would reduce the size
of this 51768 character file by 0 percent.
Chi square distribution for 51768 samples is 1542.26, and randomly
would exceed this value 0.01 percent of the times.
Arithmetic mean value of data bytes is 125.93 (127.5 = random).
Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).
The values calculated are as follows:
ENTROPY¶
The information density of the contents of the file, expressed as a number of
bits per character. The results above, which resulted from processing an image
file compressed with JPEG, indicate that the file is extremely dense in
information—essentially random. Hence, compression of the file is
unlikely to reduce its size. By contrast, the C source code of the program has
entropy of about 4.9 bits per character, indicating that optimal compression
of the file would reduce its size by 38%. [Hamming, pp. 104-108]
CHI-SQUARE TEST¶
The chi-square test is the most commonly used test for the randomness of data,
and is extremely sensitive to errors in pseudorandom sequence generators. The
chi-square distribution is calculated for the stream of bytes in the file and
expressed as an absolute number and a percentage which indicates how
frequently a truly random sequence would exceed the value calculated. We
interpret the percentage as the degree to which the sequence tested is
suspected of being non-random. If the percentage is greater than 99% or less
than 1%, the sequence is almost certainly not random. If the percentage is
between 99% and 95% or between 1% and 5%, the sequence is suspect. Percentages
between 90% and 95% and 5% and 10% indicate the sequence is "almost
suspect". Note that our JPEG file, while very dense in information, is
far from random as revealed by the chi-square test.
Applying this test to the output of various pseudorandom sequence generators is
interesting. The low-order 8 bits returned by the standard Unix
rand(1)
function, for example, yields:
Chi square distribution for 500000 samples is 0.01, and randomly
would exceed this value 99.99 percent of the times.
While an improved generator [Park & Miller] reports:
Chi square distribution for 500000 samples is 212.53, and randomly
would exceed this value 95.00 percent of the times.
Thus, the standard Unix generator (or at least the low-order bytes it returns)
is unacceptably non-random, while the improved generator is much better but
still sufficiently non-random to cause concern for demanding applications.
Contrast both of these software generators with the chi-square result of a
genuine random sequence created by timing radioactive decay events[1]:
Chi square distribution for 32768 samples is 237.05, and randomly
would exceed this value 75.00 percent of the times.
See [Knuth, pp. 35-40] for more information on the chi-square test. An
interactive chi-square calculator[2] is available at this site.
ARITHMETIC MEAN¶
This is simply the result of summing all the bytes (bits if the
-b option
is specified) in the file and dividing by the file length. If the data are
close to random, this should be about 127.5 (0.5 for
-b option output).
If the mean departs from this value, the values are consistently high or low.
MONTE CARLO VALUE FOR PI¶
Each successive sequence of six bytes is used as 24 bit X and Y coordinates
within a square. If the distance of the randomly-generated point is less than
the radius of a circle inscribed within the square, the six-byte sequence is
considered a "hit". The percentage of hits can be used to calculate
the value of Pi. For very large streams (this approximation converges very
slowly), the value will approach the correct value of Pi if the sequence is
close to random. A 32768 byte file created by radioactive decay yielded:
Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).
SERIAL CORRELATION COEFFICIENT¶
This quantity measures the extent to which each byte in the file depends upon
the previous byte. For random sequences, this value (which can be positive or
negative) will, of course, be close to zero. A non-random byte stream such as
a C program will yield a serial correlation coefficient on the order of 0.5.
Wildly predictable data such as uncompressed bitmaps will exhibit serial
correlation coefficients approaching 1. See [Knuth, pp. 64-65] for more
details.
OPTIONS¶
- -b
- The input is treated as a stream of bits rather than of 8-bit bytes.
Statistics reported reflect the properties of the bitstream.
- -c
- Print a table of the number of occurrences of each possible byte (or bit,
if the -b option is also specified) value, and the fraction of the
overall file made up by that value. Printable characters in the ISO-8859-1
(Latin-1) character set are shown along with their decimal byte values. In
non-terse output mode, values with zero occurrences are not printed.
- -f
- Fold upper case letters to lower case before computing statistics. Folding
is done based on the ISO-8859-1 (Latin-1) character set, with accented
letters correctly processed.
- -t
- Terse mode: output is written in Comma Separated Value (CSV) format,
suitable for loading into a spreadsheet and easily read by any programming
language. See Terse Mode Output Format below for additional details.
- -u
- Print how-to-call information.
FILES¶
If no
file is specified,
ent obtains its input from standard
input. Output is always written to standard output.
TERSE MODE¶
Terse mode is selected by specifying the
-t option on the command line.
Terse mode output is written in Comma Separated Value (CSV) format, which can
be directly loaded into most spreadsheet programs and is easily read by any
programming language. Each record in the CSV file begins with a record type
field, which identifies the content of the following fields. If the
-c
option is not specified, the terse mode output will consist of two records, as
follows:
0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation
1,file_length,entropy,chi_square,mean,Pi_value,correlation
where the italicised values in the type 1 record are the numerical values for
the quantities named in the type 0 column title record. If the_ -b_ option is
specified, the second field of the type 0 record will be
"File-bits", and the file_length field in type 1 record will be
given in bits instead of bytes. If the
-c option is specified,
additional records are appended to the terse mode output which contain the
character counts:
2,Value,Occurrences,Fraction
3,v,count,fraction
...
If the
-b option is specified, only two type 3 records will appear for
the two bit values v=0 and v=1. Otherwise, 256 type 3 records are included,
one for each possible byte value. The second field of a type 3 record
indicates how many bytes (or bits) of value v appear in the input, and
fraction gives the decimal fraction of the file which has value v (which is
equal to the count value of this record divided by the file_length field in
the type 1 record).
BUGS¶
Note that the "optimal compression" shown for the file is computed
from the byte- or bit-stream entropy and thus reflects compressibility based
on a reading frame of the chosen width (8-bit bytes or individual bits if the
-b option is specified). Algorithms which use a larger reading frame,
such as the Lempel-Ziv [Lempel & Ziv] algorithm, may achieve greater
compression if the file contains repeated sequences of multiple bytes.
COPYING¶
This software is in the public domain. Permission to use, copy, modify, and
distribute this software and its documentation for any purpose and without fee
is hereby granted, without any conditions or restrictions. This software is
provided "as is" without express or implied warranty.
Original text and program by John Walker[3] October 20th, 1998
Modifications by Wesley J. Landaker < wjl@icecavern.net
⟨mailto:wjl@icecavern.net⟩ >, released under the same terms
as above.
SEE ALSO¶
Introduction to Probability and Statistics[4]
- [Hamming]
- Hamming, Richard W. Coding and Information Theory. Englewood Cliffs NJ:
Prentice-Hall, 1980.
- [Knuth]
- Knuth, Donald E. The Art of Computer Programming, Volume 2 / Seminumerical
Algorithms. Reading MA: Addison-Wesley, 1969. ISBN 0-201-89684-2.
- [Lempel & Ziv]
- Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data
Compression". IEEE Transactions on Information Theory 23, 3, pp.
337-343.
- [Park & Miller]
- Park, Stephen K. and Keith W. Miller. "Random Number Generators: Good
Ones Are Hard to Find". Communications of the ACM, October 1988, p.
1192.
- [1]
- ⟨http://www.fourmilab.ch/hotbits/⟩
- [2]
- ⟨http://www.fourmilab.ch/rpkp/experiments/analysis/chiCalc.html⟩
- [3]
- ⟨http://www.fourmilab.ch/⟩
- [4]
- ⟨http://www.fourmilab.ch/rpkp/experiments/statistics.html⟩