Short Introduction
JProGO is a new microarray interpretation tool for identifying significantly changed biological functions or processes within prokaryotic microarray expression data.
The main principle of the program is to 'translate genes into functions':
for this purpose a solid functional classification of gene products, the
Gene Ontology
, is used.
The Gene Ontology is hierarchically organized as directed acyclic graph (similar to a tree, but more powerful) starting with the three broadest functional categories
'Molecular Function', 'Biological Process' and 'Cellular Component' on top of the ontology.
Like the nodes in a tree the branching points represent a specific biological function, process or component. In total there are about 18000 such nodes of which about 2000 play a role in bacteria which means that genes are assigned to this node (=function).
Examples for nodes are
'transcriptional activity', 'protein binding' (both 'Molecular Function)'
'citrate cycle', 'protein biosynthesis' (both 'Biological Process')
'cytoplasm', 'periplasm' (both 'Cellular Component')
To each of the relevant nodes a probability (p-value) is assigned that the expression values of its genes are could be observed just by chance.
This means nodes with low p-values could play an important biological role under the studied conditions which are represented by the microarray data analyzed.
Alphabetical Help Index
Alternative Hypothesis
In most statistical tests the concept of choosing an 'alternative hypothesis' is common. It has to be choosen for the distribution-based tests 'KS-Test' and 'U-Test'.
There are three options: 'Less within node' means that the distribution of the expression values of the genes BELONGING the node under consideration is below the distribution of the expression values of the genes NOT BELONGING to this node. 'Greater within node' is the opposite of this. 'Two.sided' means that you don't look at the direction, so it does not matter which of the distributions is below the other. This means it only looks wether the two distributions are different, negelecting directionality. For pde's (probabilities of differential expression, see also
Microarray data type
) only 'two.sided' makes sense as alternative hypothesis.
Analysis Method
You can choose between 5 different kinds of analysis methods which can be subdivided in:
a) Methods for which a threshold value for the gene expression values has to be set.
This means not all gene expression values of the microarray are considered but only those above or below this threshold (e.g. only genes up-regulated by at least two-fold).
The methods are the hypergeometric test and Fisher's exact test.
b) Methods which consider all gene expression values (=distribution-based): Here no information is lost since, all expression values are considered. For this purpose, pre-filtering for consistently measured genes (e.g. genes expressed above background level in all measurements) is advisable.
The methods are Kolmogorov-Smirnov test' and unpaired Wilcoxon's test (also known as 'Mann-Whitney U-test), and the Student's t-test. For these methods an
alternative hypothesis
can be choosen, the default is 'two sided'.
Correction for Multiple Testing
You can choose between 2 different methods that perform correction for multiple hypothesis testing: control of the of the Family Wise Error Rate and control of the False Discovery. The default correction is the FDR method (see below).
a) Control of the FWER (Family Wise Error Rate) according to Bonferroni:
This method is sometimes also called the Bonferroni test. It tightly controls the propensity of making false discoveries and keeps the overall Type I error low that means it reduces the risk for a false rejection of a null hypothesis that is true. It adjusts the p-value for each individual test by decreasing it taking the number of all performed tests into account. One drawback of the Bonferroni test is, that it becomes the more conservative the more tests are performed (it tries to control the cance of making even a single false discovery among all performed tests). Therefore, this method misses many real detections, e.g. fails to reject null hypotheses that are not true (low power).
b) Control of the FDR (False Discovery Rate):
The FDR method controls the proportion of wrongly recejected null hypothesis among all rejected hypothesis. For example, controlling the FDR on a level of 5% guarantees that no more than 5% of all detected GO nodes (significant hits) are wrongly detected. Thus, the FDR method has on the one hand a higher power than Bonferroni correction and on the other hand it controls errors (false positive hits) better than testing without any adjustment. It controls the most relevant errors.
The table below illustrates the interrelations between the Type I error (V), the Type II error(T) and the FDR.
Number of errors committed when testing m null hypothesis
from Benjamini,Y. and Hochberg Y. (1995): "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing", J. Royal Statistical Society, Vol. 57, 289-300
| | Declared non-significant | Declared significant | Total |
| True null hypotheses | U | V | m0 |
| Non-true null hypothesis | T | S | m - m0 |
| m - R | R | m |
R is an observable random variable; U, V, S, T are unobservable random variables.
Q is an unobserved random variable with Q = V (V + S) = V / R
The FDR is the expectation of Q. FDR = E (Q) = E (V / R).
It is the proportion of the rejected null hypotheses that are erroneously rejected
Microarray Data Type
As type of gene expression data you can either submit fold changes also called 'ratios' or probablities of differential expression abbreviated as 'pde'.
a) fold changes / ratios / log ratios: the ratio between the microarray control condition and the condition under investigation
Consider for example you investigated the aerobic-to-anaerobic switch of E. coli wildtype cells. In this example the 'control condition' are represented by the cells grown under aerobically and the 'condition under investigation' by the cells are grown anaerobically.
Here a direction is indicated, meaning if the ratio is greater than 1 the gene is up-regulated and if it is less than 1 it is down-regulated.
Log ratios specify the logarithm of the ratios for base 2.
b) pde's: If you performed several (biological) replicates of your experiments (which should be the normal case), you can also calculate the probability of differential expression for each gene. Therefore you can usespecialized software which is based mostly based on a (modified) t-Test, for example using a Bayesian stastical approach. The pde value considers the deviation in gene expression of the same gene which are due to biological and technical variability in the microarray experiments.
Microarray File
The microarray file should be plain text containing two 'columns' separated by a tabulator, which is the default column delimiter.
The first 'column' should contain the gene short names (e.g. Fnr for E. coli, PA0002 for P. aeruginosa) and for P. aeruginosa the PA-Numbers. Therefore, we also call it the 'Gene Name Column' which can be specified in our web interface (default 1=first column).
The second 'column', which we also call 'Data Column' (by default 2), should contain the measured expression values:
either
Ia) ratios of differential expression (e.g. 0.01, 0.25, 2.0, 45.0, 120.3)
Ib) log ratios of differential expression (e.g. -3, -2.5, 2, 4.5)
or
II) probabilities of differential expression (e.g. 0.03, 0.45, 0.98), see also
Microarray Data Type
.
Example for E. coli microarray data (probabilities of differential expression):
aroD
0.55
artI
0.65
b2016
0.89
b3042
0.17
hdeB
0.12
yabP
0.99
yceI
0.04
ydaR
0.98
...
...
...
...
...
...
Organism
You have to choose the organism where the microarray data you are analyzing are derived from.
Currently the following 23 prokaroytic organisms are supported:
- Bacillus cereus (strain ATCC 10987)
- Bacillus subtilis (strain 168)
- Caulobacter crescentus (strain CB15 ATCC 19089)
- Clostridium tetani (strain Massachusetts E88)
- Corynebacterium glutamicum (strain DSM 20300 ATCC 13032 NCIB 10025 Nakagawa)
- Escherichia coli (strain K12)
- Helicobacter pylori (strain ATCC 700392 26695)
- Listeria innocua (serovar 6a strain CLIP 11262)
- Listeria monocytogenes (serovar 1 2a strain EGD-e)
- Methanococcus jannaschii (strain JAL-1 ATCC 43067 DSM 2661)
- Mycobacterium tuberculosis (strain H37Rv)
- Mycobacterium tuberculosis (strain Oshkosh CDC 1551)
- Mycoplasma pneumoniae (strain M129 ATCC 29342)
- Mycoplasma genitalium (strain ATCC 33530 G-37)
- Pseudomonas aeruginosa (strain ATCC 15692 PAO1)
- Pseudomonas putida (strain KT2440)
- Rickettsia conorii (strain Malish 7)
- Rickettsia prowazekii (strain Madrid E)
- Salmonella typhimurium (strain ATCC 700720 SGSC1412 LT2)
- Staphylococcus aureus (strain N315)
- Streptococcus pneumoniae (strain TIGR4 ATCC BAA-334)
- Streptococcus pyogenes M1 GAS
- Yersinia pestis (biovar Mediaevalis strain KIM5)
Search &
Filter Functions of the Result Table View
After you have analysis of your microarray data has finished you are going to proceed to the
result table view
. At the top all analysis parameter are specified. Below you find a
Search / Filter field
which offer you the following powerful search functionalities:
- Search for GO Terms with p-values below or above a certain value or within a certain range of p-values:
For example, in order to mark the significant nodes with p-value which are significant on the 0.05-level chose as "p-value" in the Selected Column box and "smaller than" in the Compare Option box (see below). By default, the Bonferroni-corrected level of significance is listed in the box containing the search text, in this case 5.9594E-5. In the table view GO nodes with smaller p-values are high-lighted now in yellow (see below). By default, nodes are sorted by their p-value.
In order to mark nodes the p-values of which lie in a certain range, choose "specify range" in the Compare Option box. Leave the "p-value" in the Selected Column box unchanged. In the field containing the search text input your the upper and lower limit of the p-value interval with a white space in between: For example type "0.0001 0.001" in order to get all GO nodes with a p-value above 0.0001 and below 0.001. Do not forget to separate the two numbers by a white space (the order of the numbers does not matter).
Example 1: 
- Search for a certain word / phrase in specified column (regular expression supported):
Select the column of interest in in the Selected Column box (e.g. "p-value", "GO Name", "GO Accession", "GO Category"). If the selected column contains only numbers, which is solely the case for the p-value column, all four search options from the Compare Option box are supported ("equal", "smaller than",."greater than", "specify range". If the selected column contains text, only the Compare Option option "equal" makes sense (string / regular expression serach).. For displaying only those GO nodes that meet the search criteria (switch Show Only Marked over to yes) and sort by p-value or GO name. Example 2: Show all GO nodes containing the phrase amino acid (Search Colum=GO name, Compare Option=equal and Show Only Marked=yes) sorted by their p-values. Wildcards are automatically appended in front of and behind the phrase.
Example 3: Show all GO nodes containing whose names contain either of or both of the two acid amino acids aspartate" or glutamate (Search Colum=GO name, Compare Option=equal and Show Only Marked=yes) sorted alphabetically by GO name. Here a regular expression was employed.
Example 4: Mark the significant GO nodes from the sub-ontology "Biological Process" (Select only check box next to "Biological Process", Search Colum=GO name, Compare Option=smaller than and Search Text="5.9594E-5" or adapted significance level, which has been calculated). In order to sort GO nodes by their category (=sub-ontology) click on the arrow head in the GO Category column.
Significance Level
In statistical testing, a result is considered as significant if it is unlikely to have occurred just by chance, given that a presumed null hypothesis is true.
In more detail, the level of significance - also denoted by the symbol alpha - in conventional statistical hypothesis testing denotes the maximum probability of accidentally rejecting a true null hypothesis. It is also called Type I error. In context with this analysis tool, such a type of error occurs if a GO node does not differ in its gene expression profile from its environment but the applied test spuriously states that this was the case.
To each GO node a p-value is assigned by the statistical test; the smaller the p-value, the more significant the result.
To control the rate of false-positives (percentage of true null hypotheses that were rejected), one may choose a significance level of, for example, 5% or 0.05. This means that in only 1 out of 20 tests a true null hypothesis is accidentally rejected, that is a GO node were falsely identified as significant. Common significance levels used in hypothesis testing are 5%, 1% and 0.1%.
Threshold Value
Only relevant for overrepresentation-based methods (see analysis methods)!
The Threshold Value should help to estimate wether an expression value for a certain gene should be regarded
as relevant, meaning the genes is likely to be differently expressed between the two conditions tested. The threshold value tries to take into account the biological and technical variability with microarray experiments.
The selected threshold value is dependent on the type of analyzed microarray data: For 'expression ratios' commonly a threshold of 2 or 4 is used for up-regulated genes meaning that only
genes which are increased in expression by more than two-fold or four-fold are regarded as up-regulated. On the other hand, threshold values of 0.5 or 0.25 are commonly used for defining down-regulated genes meaning that only
genes which are decreased by more than two-fold or four-fold are regarded as down-regulated.
For 'probabilities of differential expression' threshold values of 0.9, 0.95, and 0.99 are reasonable values corresponding to 1-(alpha error).
|