To quote Albert Einstein: *"Erst die Theorie entscheidet darüber, was man beobachten kann"*.
The theory decides first what can be observed. The deductive approach starts by building assumptions and first principles upon which a mechanistic model
is built that predicts what could be observed as a consequence of the theory.

Statistics is the science of learning from experience; learning from data. The approach is inductive: data drive the building of knowledge. These data driven methods are collectively called statistical inference. In statistics, observations decide which theory is correct. In contemporary biomedical research and in the Big Data era, both approaches are useful and crossfertilize each other.

We provide consulting to help researchers prepare their experimental design and before they collect data.

Never ask a statistician to analyse the data you already have.

To quote R.A. Fisher, most often it will be a *post-mortem* diagnostic.

It is always good practice to meet a ** biostatistician** when designing your experiments and
be ready to discuss the following:

What is your primary research question ? What is the nature of your expected outcome (continuous, categorical, binary, count variables or survival/censored variables or ranked values) ? What should be your quantitative measure of success ? What are the sources of

**? What do you expect the outcome to be depending on ? What is your covariates (predictors) list ? Are your covariates possibly correlated ? Pay attention to confounding and possible multicolinearity issues. Do you have a statistical model to fit ? How good is the fit of the model to the data ? Are you looking for outliers ? To determine a**

*variability***sample size**,

*n*, you will need to know the variability (σ) of your outcome, fix the

**effect size**(δ) you want to detect, set a value for the

**and require a minimal**

*risk of false positive results (type I error, α)***of your setup (i.e.**

*power***). Maybe is it advisable to carry out a pilot test before you proceed further with the full study ?**

*the probability to detect an effect if there is truly an effect = 1-β*Can you trust a panel of raters to monitor the quality of a food or beverage manufactured product ? How do you assess the consistency of a panel of raters or the objectivity among the jury's members ? This is where ranked outcome and

**statistical methods come into play. Permutation and**

*non-parametric***methods could be helpful to get the empirical distribution of your outcome variable, at least under particular assumptions. You might as well require**

*bootstrap***to test the performances of your statistical analytical toolbox.**

*simulated datasets* Quite often in molecular biology experiments, the measures made on the experimental units (observations) do not
comply with the classical assumptions on the statistical distributions. As an example, codon usage frequencies for a given amino acid observed
across transcripts expressed in cells in a given condition (treatment vs. control) may not be normally distributed and may be skewed.
For those situations, nonparametric statistical methods are of interest:
** Wilcoxon rank sum tests (Mann-Whitney U test statistics)** when variances are equal,

**when the variances are not equal,**

*Fligner-Policello median test***when medians are equal and**

*Ansari-Bradley rank test for dispersion***for general differences between two populations.**

*Kolmogorov-Smirnov distribution-free test*Will you suffer the ** curse of dimensionality** with your big data ?
If the number of variables is much larger than the the number of experimental subjects, you certainly will.
Should you filter out and prune some possible irrelevant variables ?
There are

**or**

*unsupervised***techniques which could be useful to help you get better insights in your big data :**

*supervised machine learning***(Least Absolute Shrinkage and Selection Operator), just to mention a few.**

*classification and regression tree*,*hierarchical clustering*,*nearest neighbours (kNN)*,*principal components analysis (PCA)*,*support vector machine (SVM)*,*random forest*(RF),*Lasso*We illustrate hereafter in a few selected examples some of the above issues and how they are dealt with:

to select the best possible chemical additive to increase a product shelf life (download this report here);*Logistic regression with generalized estimating equation (GEE)*supporting evidence that protein translation efficiency depends on positively charged amino acid location and on transcripts codon usage (download this report soon here);*Nonparametric statistical analysis and unsupervised learning with principal components analysis*to build a classifier for a lung cancer metabolomic signature in patients blood samples (download this report here);*Support Vector Machine (SVM)*and data mining methods in breast cancer diagnostics (download this report here).*Unsupervised and supervised machine learning*

Should you consider ** Bayesian methods** instead of the frequentist approach? How reliable is the prior expert knowledge?

Two examples are given below providing a flavour of the Bayesian approach to statistical analysis.

The full power of Bayesian methods has emerged in the electronic computing era over the last three decades.

We present hereafter, in a very intuitive way, the main sampling algorithms useful in Bayesian advanced analysis to evaluate the posterior probability density (when problems are not amenable to closed form analytical solutions), and know as:

- MCMC: Markov Chain Monte Carlo among which belong the two following:
- Metropolis algorithm
- Gibbs sampler