E-Gene III

The exercise consists first in identifying potential Open Reading Frame (ORF), this research will be done using specialized softwares which are :

SMS
NCBI
EBI
PROSITE

The retained ORF for further analysis should fulfill all the following conditions:

Contains no STOP codons
Contains at least 60 codons
Maybe on the direct or indirect strand
Maybe in frame 1, 2 or 3
Maybe complete or not in 5' or 3'

Copy and paste the raw resultats of the ORF research in the field 'ORF finding' in the entry form of E-Gene. Do not forget to note the PROTOCOL for the ORF finding.

N.B : If you use SMS, don't forget to make analysis in both directions (direct and indirect).
And, if your sequence contains several ORF, choose the longest one.

Then determine if this possible included ORF in the fragment sequence seems likely (true or false positive?). Key elements for judging the credibility of an ORF are:

The length of the ORF.
The existence homologous proteins(see BLAST).

If there are clear homologous, you conclude that your sequence is coding whatever the length of your ORF is.
Otherwise, you'll judge whether the length of the ORF is sufficient to conclude that sequence is coding.
This criterion is rather subjective, but it seems highly unlikely that such an ORF of more than 150 amino acids is a false positive!

By cons, if the sequence fragment does not appear to contain gene (ORF not long enough and no homologous) , check 'non coding' in the heading 'Status'.
The E-Gene of this sequence fragment is then completed with only the fields 'ORF finding' and 'BLAST' will therefore be informed (in addition to your analytical report in the conclusion field). After saving the E-Gene on this sequence fragment, you can add a new sequence fragment in your basket!

Moreover, if the search of homologous by BLAST suggests that an ORF corresponds to a gene* (or that ORF has no known homologous, but seems too long to be due at random- e.g 250 codons), check the 'coding' in the heading 'Status'. Indicate the strand (direct or indirect) on which the ORF is, and its positions START and END. To validate this ORF, submit these data to E-Gene by clicking on "SAVE".

If your ORF verifies the conditions mentioned above, the translation will be displayed automatically. Otherwise, an error message will tell you for example if the ORF contains STOP or if not a multiple of 3.
The ORF maybe incomplete (STOP codon at the end or missing initiation codon ) in this case, only a warning is displayed.

(*) Indeed, the absence of homologous in sequence databases does not demonstrate that ORF is non coding, in which case we would never find new genes! There are other techniques for identifying gene called ab initio (e.g those operating statistics through the use of codons) but these will be in bioinformatics program in next version.

For more details on the ORF finding, see the "Ressources" section of the site EGENE, including the very sensitive and crucial question to the exact position of the beginning of the ORF...

Go to the site 3DJIGSAW / LOOPP : http://bmm.cancerresearchuk.org/~3djigsaw/

Complete the form :

Name
Personal mail
Give your mail once again
Sequence name (Protein ID) : sequence code (e.g : >JCVI_READ_1095899390246)
The protein sequence (Translation) WITHOUT FASTA form.

Do not change the options (keep it "automatic")

Click on "Submit"

You will receive an e-mail with an attached file (format example : >JCVI_READ_1095899390246.pdb) in a maximum time of 1 hour.

Save the file to your desktop

In the 3D MODELING on E-GENE, click on "BROWSE" then find the file (.pdb) in the office and finally click on "Send"

The model appears.

You can : ZOOM, MOVE and GUIDE your molecule using the mouse

A right click and a toolbar option appears to change the style of your modeling

Complete the field nearby with your comments and commentary on the molecule.

In the case where the ORF is complete (initiation codon and STOP codon), you can calculate then the theoretical molecular weight of your polypeptide corresponding to your selected ORF using :

SMS

MWCalc (Infobiogen)

N.B : Note that if your ORF has 2 STOP codon separated by a large number of aminoacids (about 200), you can calculate then its molecular weight.

The search for potentiel conserved protein domaines is done from your translation of the ORF using one of the following software :

INTERPRO
PROSITE
PFam
CDD

Retain only the significant protein domains, that's to say :

Those we do not expect to find purely by chance (including signature/ profiles are sufficiently specific)
Those whose function is consistent with other bioinformatics analysisperformed (e.g DNA binding domain for ORF whose homologous found by BLAST are transcription factors
Non redundant (and non overlapping) with other domains you have chosen.
Those who have a size, score and E-value convincing (large size, high score and low E-value).

If you are convinced of the veracity of certain predicted domains (maximum 4), enter the name and contact information in the table of RAW RESULTS.
Do not repeat the same functional domain represented under different names/accession in several databases (found frequently thle same protein domain under various accession numbers in PROSITE, INTERPRO and PFam).

For more details on the search for conserved domains, see the "Ressources" section of the site EGENE.

Reminder: BLAST = BASIC LOCAL ALIGNMENT SEARCH TOOL, ALTSCHUL, 1990

Blast is the description of the algorithm used by a family of five programs that align a new sequence against a database of sequences already identified. Statistical tests can determine if the alignment obtained is significant or not, and results are provided in order of significance.
NOTE : Blast is optimized for finding alignments, and not for research patterns!

The research of potential homologous of your ORF in sequence databases is by performing a BLAST in one of the following BLAST web servers:

NCBI (prefer this one)
EBI (in case of overload of NCBI)
Infobiogen

For this, two BLAST approaches are possible to find homologous to your sequence:

A BLASTp (proein sequence against protein bank) of your ORF against a protein bank
A BLASTx (translated nucleotide sequence by BLAST in the 6 phases against protein bank) of the complete NUCLEOTIDE sequence of your fragment sequence (not your ORF) against your protein bank.
Use this BLASTx if you have any doubts on the reading frame of your ORF, or if your ORF finding has apparently been unsuccessful (BLASTx is generally insensitive to sequencing errors)

By cons, you should ask the two following banks :

NR : Bank of protein most comprehensive available (useful for phylogenetic analysis)
SWISSPROT : small bank where the protein annotation files are very complete (useful for functional analysis)

Copy and paste in the raw results of the field 'BLAST' of the site E-GENE :

The header of the results of BLAST
The complete list of hits (the list of sequences followed by two columns 'Score' and 'E-Value')

Remember to specify the protocol: what program was used against any database and any other parameters.

If your ORF presents known homologous, indicate how vital is the cutoff score (or E-value) between the homologous (true positives) from non-homologous (false positives).

For more details on BLAST, see the "Ressources" section of the site EGENE.

It's in this part where you must select groups: two groups of homologous sequences which will be used, after multiple alignment, to attempt a reconstruction of the phylogenetic tree.

In/Studying group : (up to about 20-30 sequences) representing homologous belonging to the same taxonomic group that your ORF.
Out/External group : (approximately 5-6 sequences) representing the closest homologous outside the study group( in order to root the phylogenetic tree).

For more details on the formation of Studying Group and External Group, see the "Ressources" section of the site EGENE.

Taxonomy Report of BLAST (BLAST at NCBI only, link in the header of BLAST) in the field "Taxonomy Report" of E-GENE, copy only the first part called Lineage Report!

It's in this part where you must select groups: two groups of homologous sequences which will be used, after multiple alignment, to attempt a reconstruction of the phylogenetic tree.

The aim of this multiple alignment is to first verify that the ORF in question fits properly in the family of his alleged homolgous: the multiple alignment must present convincing conserved regions.

Moreover the multiple alignment will be used to infer a phylogenetic tree to the suspected homologous (see below "phylogenetic tree"): alignment should suggest enough mutations (informative positions) to be able to reconstruct the evolutionary history that has separated these proteins!
Be careful not to include too partial sequences which reduce the informative positions.

It is common to have to redo the alignment several times after you add or remove sequences more or less divergent before obtaining a satisfactory alignment.

IMPORTANT: Before proceeding to the multiple alignment, you must insert the names of their sequences directly in FASTA format to create "tags" of readable sequence for the multiple alignment and phylogenetic trees

Take the sequence in FASTA format as it is coming directly from NCBI (if left in the state, the tag will be a cryptic "gi | 0000000) and place your tag sequence « unknown ».

Your sequence in FASTA format after insertion of readable sequence name must be as follows : the sequence name consists of letters directly after the > sign until the first space, up to 10 characters.

Choose a sequence name recognizable, such as "Ecoli" for "Escherichia coli". It is crucial that each sequence name is unique, otherwise the multiple alignment software will return an error message! If you have two sequences of "Ecoli", use for example "Ecoli1" and "Ecoli2".

Construct multiple alignments (sequences of the studying group, the external group and do not forget the translation of your ORF!) using an Internet version of one of the following software :

ClustalW (classical)
MUSCLE (fast and a little more efficient)
T-COFFEE (slower but very robust, with a display color blocks kept very useful).

These softwares are available on:

EBI
Phylogeny.fr

The only limit to the number of sequences included in your multiple alignment is related to the time of calculation for multiple alignment softwares, and time for calculating the phylogenetic tree. This time is usually reasonable up to thirty (or fifty) of sequences of several hundred residues each.

Copy and paste the product multiple alignment (in CLUSTALW format) in the field 'multiple alignment' of the E-gene.

For more details on the multiple alignment, see the "Ressources" section of the site EGENE.

The multiple alignment performed previously will serve to make a phylogenetic tree. For this, two construction approaches are possible :

The so-called "distance" method or Neighbor-Joining "NJ" , "BioNJ" or "Phylip protdist/neighbor" :

Copy and paste the multiple alignment in "Phylip" format, and not in "ClustalW" format! The option format of multiple alignment is available in the form ClustalW..
Choose for "Nature of data" : Sequences.
Choose for "Nature sequences": Protein.
Choose TREATMENT: NEIGHBOR.
Under "Options on the construction of the tree", select "Randomization of the order of sequences".
Under "Nature of the tree to build" check "root" and enter the number of your "external group" in the corresponding field.
Click on "SCAN"!

Method of "maximum likelihood" "PhyML" :

You must copy and paste your multiple alignment in "Phylip" format, and not in "ClustalW" format! The option format of multiple alignment is available in the form ClustalW.
Choose TREATMENT "PROTPARS (proteines)"
Select "Randomization of the order of sequences"
Under "Nature of the tree to build" check "root" and enter the number of your "external group" in the corresponding field.

In all cases, copy and paste litterally the representation of the proposed tree in the field "Tree" of E-Gene. Also indicate the method and the main parameters used to produce your tree (e.g Phylip / NJ method / external group: ......).

For more details on the inference of phylogenetic trees, see the "Ressources" section of the site E-GENE.

Taxonomy

After analyzing your phylogenetic tree, specify the closest taxonomic group in which appears to be from your organism that carries your DNA fragment.
To do so, specify in the field the scientific name..
then click on 'Refresh', the correct scientific name of this organism is automatically displayed as well as other information concerning the classification of this organism.

Remember to save your changes on the site 'E-GENE'.

Biological process and molecular function

When your in silico analysis (BLAST, INTERPRO) allow, select from the menu the most appropriate term and describing as specifically as possible your ORF.

These terms are part of an exhaustive and hierarchical list allowing to describe all cellular activities: It is the "Gene Ontology", often cited as GO annotations.

Molecular function: biochemical activity of the protein (e.g kinase)
Biological process: role of this activity in cell (e.g signal transduction)

These GO annotations are frequently assigned to known genes in public databases such as SWISSPROT or INTERPRO, do not hesitate to get inspired GO annotations from your ORFs' homologous or its conserved domains to select most appropriate GO terms.

Gene

To the extent that your ORF presents a convincing homology with a family of protein of known function, and when the classification symbols of genes of members of this family appears uniform and stable, offer a gene symbol for your ORF based on that used for its orthologs. If your ORF homologous have no standardized gene symbol, do not invent gene symbol! Leave blank...

This field will be crucial to your assessment: synthesize your interpretations and hypothesis built on the basis of observations in the preceding headings "ANALYSIS OF RESULTS". Imagine that you are addressing a priori to a skeptical jury that must be convinced! Argue, refer to the results obtained, encrypt your statements, cross indices, pay attention to your vocabulary; bioinformatic analysis can not prove anything, so attention to formulas such as "The sequence GOS_00000 comes from an alphaproteobacteria type XYZ". Separate the facts, your observations and assumptions ("likely", "suggests", "putative")...

Make sure you have at least covered: :

The arguments in support with your hypothesis(coding or non coding).
Your predictions on the protein function, both in terms of potential biochemical mechanisms (e.g "enzyme of the ubiquitin conjugation...") and broader level in biological role within the organism (e.g "role in cell cycle control..."). Use for predictions to function annotations available for your ORF homologous whoose function is known, for example in SWISSPROT sheets or descriptive sheets of Pfam / INTERPRO domains.
your hypothesis on the taxonomic classification of the organism carrying this DNA fragment.

What not to do :

Explain the operation or the theoretical objectives of the software used (consider that the reader knows how it works).
Explain which button you clicked (consider that the reader knows how to launch an online BLAST).
Write in SMS style.
Mix, coat, stretch, in fact expect an assessment to the weight ... because it is useless.
Repeating in extenso the raw results as already presented in the appropriate fields.
Write all of a block without any structure.
Divide up each analysis (you can, you should refer to the multiple alignment when discussing the end of your ORF).
Conclude directly without any reference to observations.
Present assumptions without detailed and specific arguments.
Remain vague, for instance cited counterparts of BLAST or conserved domains without giving their E-values.

Produce first and foremost a scientific argument, concise, complete, accurate, encrypted, structured and unforgiving, and NOT a summary of all steps.

Game rules