<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<font face="Helvetica, Arial, sans-serif">I have just been re-reading
the 2007 Brugia genome paper in Science. I'm now a bit surprised that
the JIGSAW gene set generated by Darin and the nGASP group scored
better in Gary's anomalies than TIGR set based on the way the paper
says the genes were called. From the supplemental Methods section;<br>
<br>
2. Gene Finding<br>
The gene calling programs Augustus (10), FGENESH (11), GlimmerHMM (12)
and SNAP (13) were<br>
used to predict protein coding sequences. The final gene models were
then picked by JIGSAW (14).<br>
5<br>
FGENESH was trained with the assembled WGS genomic sequences and ESTs
of B. malayi. All other<br>
ab initio gene finding programs and JIGSAW were trained by a set of 497
B. malayi sample genes<br>
which were curated manually. We used cloned B. malayi genes available
in GenBank that had both<br>
genomic and cDNA sequences. We also used a set of manually curated gene
models. The trained<br>
Augustus, FGENESH, GlimmerHMM and SNAP gene finders yielded the ab
initio gene calls, and the<br>
pre-trained JIGSAW picked the final gene models by incorporating:<br>
(1).the output of a BLASTX search against the NCBI nonredundant protein
database,<br>
(2) the B. malayi EST data aligned by PASA (15),<br>
(3) B. malayi tgi (TIGR gene index) data aligned by BLASTN,<br>
(4) cDNA data of other nematode genomes (complete predicted
transcriptomes of C. elegans and C.<br>
briggsae) aligned by TBLASTX<br>
(5) EST data from closely related filarial nematodes (D. immitis, O.
volvulus and Wuchereria bancrofti)<br>
aligned by TBLASTX and<br>
(6) the gene models predicted by the gene finding programs.<br>
Some manual annotation was conducted to fix the gene splits, gene
fusion and other prediction errors<br>
based on the cDNA, EST or homolog evidence.<br>
The output of the gene-finding programs was assessed prior to
performing the final B. malayi genome<br>
gene prediction. The gene sensitivity and exon sensitivity of JIGSAW
were 54.1% and 88.7%,<br>
respectively. The gene sensitivities and exon sensitivities of the four
gene finding programs ranged<br>
from 20% to 33% and 73.6% to 79.5% (detailed data not shown).<br>
<br>
Any one have thoughts on this?<br>
<br>
thanks,<br>
<br>
John<br>
<br>
<br>
</font><br>
Gary Williams wrote:
<blockquote cite="mid:487B1792.6080406@sanger.ac.uk" type="cite">John,
<br>
<br>
The results of the curation anomalies in Brugia for the TIGR and Jigsaw
gene predictions are as follows:
<br>
<br>
<br>
TIGR JIGSAW
<br>
<br>
UNMATCHED_PROTEIN 2695 648
<br>
<br>
Jigsaw looks to be very significantly better at spotting homologous
regions where there are protein alignments and incorporating them into
gene structures.
<br>
<br>
OVERLAPPING_EXONS 22 0
<br>
<br>
I am surprised that there were any overlapping exons from different
genes on opposite strands in the TIGR prediction. This is poor.
<br>
<br>
WEAK_INTRON_SPLICE_SITE 6340 9623
<br>
<br>
Jigsaw uses a significantly greater number of poor-scoring splice
sites. This appears to be because it tries harder to predict gene
models across pseudogenic regions.
<br>
<br>
SPLIT_GENES_BY_PROTEIN 323 438
<br>
MERGE_GENES_BY_PROTEIN 156 236
<br>
<br>
I would not put much value on this difference; both sets of prediction
appear to have have trouble with merging and splitting predictions in
regions such as pseudogenes and duplicated pairs of genes.
<br>
<br>
REPEAT_OVERLAPS_EXON 864 815
<br>
<br>
No great difference.
<br>
<br>
UNCONFIRMED_INTRON 44228 40895
<br>
<br>
A slightly greater number of introns confirmed by ESTs or mRNAs were
missed by TIGR than by Jigsaw. These figures were for each individual
EST or mRNA predicting an intron that was missed by the prediction, so
strongly expressed regions will be counted more than regions with only
a few ESTs/mRNA.
<br>
<br>
If the numbers of unique missed introns is counted instead of the
number of transcripts across the introns, then we get:
<br>
TIGR: 20,787 and Jigsaw: 18,993
<br>
so Jigsaw comes out as the best again.
<br>
<br>
EST_OVERLAPS_INTRON 2 0
<br>
<br>
This is the number of predicted introns with an EST transcript running
across them. This does not look significant.
<br>
<br>
SHORT_INTRON 59 137
<br>
<br>
This looks like jigsaw tries harder to generate a gene model over
difficult (pseudogenic?) regions and will create a short intron over a
frameshift.
<br>
<br>
<br>
Result: Jigsaw is significantly better at making correct gene models,
but also tries to make them even in inappropriate pseudogenic regions.
<br>
<br>
Gary
<br>
<br>
<br>
John Spieth wrote:
<br>
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">Hi Gary,
<br>
<br>
Have you had time yet to generate brugia anomalies using the TIGR and
JIGSAW gene sets?
<br>
<br>
thanks,
<br>
<br>
John
<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br>
<br>
</blockquote>
</body>
</html>