[Ngasp-help] Re: brugia anomalies

John Spieth jspieth at watson.wustl.edu
Tue Jul 15 14:03:19 EDT 2008


I have just been re-reading the 2007 Brugia genome paper in Science.  
I'm now a bit surprised that the JIGSAW gene set generated by Darin and 
the nGASP group scored better in Gary's anomalies than TIGR set based on 
the way the paper says the genes were called.  From the supplemental 
Methods section;

2. Gene Finding
The gene calling programs Augustus (10), FGENESH (11), GlimmerHMM (12) 
and SNAP (13) were
used to predict protein coding sequences. The final gene models were 
then picked by JIGSAW (14).
5
FGENESH was trained with the assembled WGS genomic sequences and ESTs of 
B. malayi. All other
ab initio gene finding programs and JIGSAW were trained by a set of 497 
B. malayi sample genes
which were curated manually. We used cloned B. malayi genes available in 
GenBank that had both
genomic and cDNA sequences. We also used a set of manually curated gene 
models. The trained
Augustus, FGENESH, GlimmerHMM and SNAP gene finders yielded the ab 
initio gene calls, and the
pre-trained JIGSAW picked the final gene models by incorporating:
(1).the output of a BLASTX search against the NCBI nonredundant protein 
database,
(2) the B. malayi EST data aligned by PASA (15),
(3) B. malayi tgi (TIGR gene index) data aligned by BLASTN,
(4) cDNA data of other nematode genomes (complete predicted 
transcriptomes of C. elegans and C.
briggsae) aligned by TBLASTX
(5) EST data from closely related filarial nematodes (D. immitis, O. 
volvulus and Wuchereria bancrofti)
aligned by TBLASTX and
(6) the gene models predicted by the gene finding programs.
Some manual annotation was conducted to fix the gene splits, gene fusion 
and other prediction errors
based on the cDNA, EST or homolog evidence.
The output of the gene-finding programs was assessed prior to performing 
the final B. malayi genome
gene prediction. The gene sensitivity and exon sensitivity of JIGSAW 
were 54.1% and 88.7%,
respectively. The gene sensitivities and exon sensitivities of the four 
gene finding programs ranged
from 20% to 33% and 73.6% to 79.5% (detailed data not shown).

Any one have thoughts on this?

thanks,

John



Gary Williams wrote:
> John,
>
> The results of the curation anomalies in Brugia for the TIGR and 
> Jigsaw gene predictions are as follows:
>
>
>                         TIGR        JIGSAW
>
> UNMATCHED_PROTEIN       2695        648
>
> Jigsaw looks to be very significantly better at spotting homologous 
> regions where there are protein alignments and incorporating them into 
> gene structures.
>
> OVERLAPPING_EXONS       22        0
>
> I am surprised that there were any overlapping exons from different 
> genes on opposite strands in the TIGR prediction. This is poor.
>
> WEAK_INTRON_SPLICE_SITE 6340        9623
>
> Jigsaw uses a significantly greater number of poor-scoring splice 
> sites. This appears to be because it tries harder to predict gene 
> models across pseudogenic regions.
>
> SPLIT_GENES_BY_PROTEIN  323        438
> MERGE_GENES_BY_PROTEIN  156        236
>
> I would not put much value on this difference; both sets of prediction 
> appear to have have trouble with merging and splitting predictions in 
> regions such as pseudogenes and duplicated pairs of genes.
>
> REPEAT_OVERLAPS_EXON    864        815
>
> No great difference.
>
> UNCONFIRMED_INTRON      44228        40895
>
> A slightly greater number of introns confirmed by ESTs or mRNAs were 
> missed by TIGR than by Jigsaw. These figures were for each individual 
> EST or mRNA predicting an intron that was missed by the prediction, so 
> strongly expressed regions will be counted more than regions with only 
> a few ESTs/mRNA.
>
> If the numbers of unique missed introns is counted instead of the 
> number of transcripts across the introns, then we get:
> TIGR: 20,787 and Jigsaw: 18,993
> so Jigsaw comes out as the best again.
>
> EST_OVERLAPS_INTRON     2        0
>
> This is the number of predicted introns with an EST transcript running 
> across them. This does not look significant.
>
> SHORT_INTRON        59        137
>
> This looks like jigsaw tries harder to generate a gene model over 
> difficult (pseudogenic?) regions and will create a short intron over a 
> frameshift.
>
>
> Result: Jigsaw is significantly better at making correct gene models, 
> but also tries to make them even in inappropriate pseudogenic regions.
>
> Gary
>
>
> John Spieth wrote:
>>>>>> Hi Gary,
>>>>>>
>>>>>> Have you had time yet to generate brugia anomalies using the TIGR 
>>>>>> and JIGSAW gene sets?
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> John
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://brie4.cshl.edu/pipermail/ngasp-help/attachments/20080715/1b3b112b/attachment.html>


More information about the Ngasp-help mailing list