NAME¶
PopulationPathScan - apply PathScan test to populations rather than just single
individuals
SYNOPSIS¶
use PopulationPathScan;
my $obj = PopulationPathScan->new ($ref_to_list_of_gene_lengths);
$obj->assign ($number_of_compartments);
$obj->preprocess ($background_mutation_rate);
$pval = $obj->population_pval_approx ($ref_to_list_of_hits_per_sample);
$pval = $obj->population_pval_exact ($ref_to_list_of_hits_per_sample);
DESCRIPTION¶
The "PathScan" package is implemented strictly as a test of a set of
genes, e.g. a pathway, for a
single individual. Specifically, knowing
the gene lengths in the pathway, the number of genes that have at least one
mutation, and the estimated background mutation rate, one can test the null
hypothesis that these observed mutations are well-explained simply by the
mechanism of random background mutation. However, it will often be the case
that data for a pathway will be available for many individuals, meaning that
we now have many tests of the given (single) hypothesis. (This should not be
confused with the scenario of multiple hypothesis testing.) The set of values
contains much more information than a single value, suggesting that
significance must be judged on the basis of the collective result. For
example, while no single p-value by itself may exceed the chosen statistical
threshold, the overall set of probabilities may still give the impression of
significance. Properly combining such numbers is a necessary, but not entirely
trivial task. This package basically serves as a high-level interface to first
perform individual tests using the methods of "PathScan", and then
to properly combine the resulting p-values using the methods of
"CombinePvals".
AUTHOR¶
Michael C. Wendl
mwendl@wustl.edu
Copyright (C) 2009 Washington University
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place - Suite 330, Boston, MA 02111-1307, USA.
METHODS¶
The available methods are listed below.
new¶
The object constructor takes a mandatory, but otherwise un-ordered reference to
a list of gene lengths comprising the biological group (e.g. a pathway) whose
mutation significance is to be analyzed using the PathScan paradigm.
my $obj = PopulationPathScan->new ([474, 1038, 285, ...]);
The method checks to make sure that all elements are legitimate lengths, i.e.
integers exceeding 3.
assign¶
This method assigns the manner in which genes will be internally organized for
passing to the PathScan calculation component. The main consideration here is
how the list may be compartmentalized for greater computational efficiency,
though at some loss of accuracy, for the PathScan calculation. If the gene
list is long, exact calculation is generally infeasible. The method takes a
single argument representing the number of compartments (or sub-lists) the
lengths will be divided into, e.g. 1 represents a single list, i.e. exact
computation, 2 indicates two lists, 3 three lists, etc.
$obj->assign (3);
The values are then organized internally such that the smallest genes are
grouped together, then the slightly larger ones, and so forth. Generally, 3 or
4 lists give reasonable balance between accuracy and computation (Wendl et
al., in progress).
preprocess¶
This method pre-processes the population-level calculation, specifically, it
sets up and executes the PathScan module to obtain the CDF associated with the
given gene set and background mutation rate. It takes the latter as an
argument.
$obj->preprocess (0.0000027);
Executing this method will take various amounts of CPU time, depending upon the
level of accuracy and the number of genes in the calculation.
The method optionally takes the list of the number of mutated genes in the group
for each sample as a second argument, if this information is known at this
point
$obj->preprocess (0.0000027, [4, 5, 7, 3, 0, ...]);
and it is usually better to use this form because the internals will compute
only a truncated CDF that is just sufficient to process this list, rather than
computing the full CDF. Not only is speed improved, but this helps avoid
overflow errors for large pathways.
population_pval_exact¶
This method performs the population-level calculation using exact enumeration.
It takes the list of the number of mutated genes in the group for each sample,
e.g. each patient's whole genome sequence, for example
patient 1: 4 genes in the pathway are mutated
patient 2: 5 genes in the pathway are mutated
patient 3: 7 genes in the pathway are mutated
patient 4: 3 genes in the pathway are mutated
patient 5: 0 genes in the pathway are mutated
: : : : : : : : :
which is invoked as
$pval = $obj->population_pval_exact ([4, 5, 7, 3, 0, ...]);
Most scenarios will not actually be able to make use of this method because
enumeration of all possible cases is rarely computationally feasible. This
method will mostly be useful for examining small test cases.
population_pval_approx¶
This method performs the population-level calculation using Lancaster's
approximate transform correction. It takes, as a mandatory argument, the list
of the number of mutated genes in the group for each sample, e.g. each
patient's whole genome sequence.
$pval = $obj->population_pval_approx ([4, 5, 7, 3, 0, ...]);
You must pass the list of hits, even if you already passed this list earlier to
the pre-processing method. Most cases will use this method because exact
combination of individual probability values is rarely computationally
feasible. Note that Lancaster's method typically gives much better (more
accurate) results than Fisher's "standard" chi-square transform.
- •
- Fisher, R. A. (1958) Statistical Methods for Research Workers,
13-th Ed. Revised, Hafner Publishing Co., New York.
- •
- Lancaster, H. O. (1949) The Combination of Probabilities Arising from
Data in Discrete Distributions, Biometrika 36(3/4),
370-382.