From jspieth at watson.wustl.edu Tue Jul 15 14:03:19 2008 From: jspieth at watson.wustl.edu (John Spieth) Date: Tue, 15 Jul 2008 13:03:19 -0500 Subject: [Ngasp-help] Re: brugia anomalies In-Reply-To: <487B1792.6080406@sanger.ac.uk> References: <4873B896.6090607@watson.wustl.edu> <48748337.1050801@sanger.ac.uk> <4874B254.3050001@watson.wustl.edu> <4874B834.6020209@sanger.ac.uk> <4874D084.2080507@watson.wustl.edu> <487B1792.6080406@sanger.ac.uk> Message-ID: <487CE667.7090507@watson.wustl.edu> I have just been re-reading the 2007 Brugia genome paper in Science. I'm now a bit surprised that the JIGSAW gene set generated by Darin and the nGASP group scored better in Gary's anomalies than TIGR set based on the way the paper says the genes were called. From the supplemental Methods section; 2. Gene Finding The gene calling programs Augustus (10), FGENESH (11), GlimmerHMM (12) and SNAP (13) were used to predict protein coding sequences. The final gene models were then picked by JIGSAW (14). 5 FGENESH was trained with the assembled WGS genomic sequences and ESTs of B. malayi. All other ab initio gene finding programs and JIGSAW were trained by a set of 497 B. malayi sample genes which were curated manually. We used cloned B. malayi genes available in GenBank that had both genomic and cDNA sequences. We also used a set of manually curated gene models. The trained Augustus, FGENESH, GlimmerHMM and SNAP gene finders yielded the ab initio gene calls, and the pre-trained JIGSAW picked the final gene models by incorporating: (1).the output of a BLASTX search against the NCBI nonredundant protein database, (2) the B. malayi EST data aligned by PASA (15), (3) B. malayi tgi (TIGR gene index) data aligned by BLASTN, (4) cDNA data of other nematode genomes (complete predicted transcriptomes of C. elegans and C. briggsae) aligned by TBLASTX (5) EST data from closely related filarial nematodes (D. immitis, O. volvulus and Wuchereria bancrofti) aligned by TBLASTX and (6) the gene models predicted by the gene finding programs. Some manual annotation was conducted to fix the gene splits, gene fusion and other prediction errors based on the cDNA, EST or homolog evidence. The output of the gene-finding programs was assessed prior to performing the final B. malayi genome gene prediction. The gene sensitivity and exon sensitivity of JIGSAW were 54.1% and 88.7%, respectively. The gene sensitivities and exon sensitivities of the four gene finding programs ranged from 20% to 33% and 73.6% to 79.5% (detailed data not shown). Any one have thoughts on this? thanks, John Gary Williams wrote: > John, > > The results of the curation anomalies in Brugia for the TIGR and > Jigsaw gene predictions are as follows: > > > TIGR JIGSAW > > UNMATCHED_PROTEIN 2695 648 > > Jigsaw looks to be very significantly better at spotting homologous > regions where there are protein alignments and incorporating them into > gene structures. > > OVERLAPPING_EXONS 22 0 > > I am surprised that there were any overlapping exons from different > genes on opposite strands in the TIGR prediction. This is poor. > > WEAK_INTRON_SPLICE_SITE 6340 9623 > > Jigsaw uses a significantly greater number of poor-scoring splice > sites. This appears to be because it tries harder to predict gene > models across pseudogenic regions. > > SPLIT_GENES_BY_PROTEIN 323 438 > MERGE_GENES_BY_PROTEIN 156 236 > > I would not put much value on this difference; both sets of prediction > appear to have have trouble with merging and splitting predictions in > regions such as pseudogenes and duplicated pairs of genes. > > REPEAT_OVERLAPS_EXON 864 815 > > No great difference. > > UNCONFIRMED_INTRON 44228 40895 > > A slightly greater number of introns confirmed by ESTs or mRNAs were > missed by TIGR than by Jigsaw. These figures were for each individual > EST or mRNA predicting an intron that was missed by the prediction, so > strongly expressed regions will be counted more than regions with only > a few ESTs/mRNA. > > If the numbers of unique missed introns is counted instead of the > number of transcripts across the introns, then we get: > TIGR: 20,787 and Jigsaw: 18,993 > so Jigsaw comes out as the best again. > > EST_OVERLAPS_INTRON 2 0 > > This is the number of predicted introns with an EST transcript running > across them. This does not look significant. > > SHORT_INTRON 59 137 > > This looks like jigsaw tries harder to generate a gene model over > difficult (pseudogenic?) regions and will create a short intron over a > frameshift. > > > Result: Jigsaw is significantly better at making correct gene models, > but also tries to make them even in inappropriate pseudogenic regions. > > Gary > > > John Spieth wrote: >>>>>> Hi Gary, >>>>>> >>>>>> Have you had time yet to generate brugia anomalies using the TIGR >>>>>> and JIGSAW gene sets? >>>>>> >>>>>> thanks, >>>>>> >>>>>> John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fiedler at fit.edu Wed Jul 16 13:30:30 2008 From: fiedler at fit.edu (Dr. Tristan J. Fiedler) Date: Wed, 16 Jul 2008 13:30:30 -0400 Subject: [Ngasp-help] Fwd: nGASP manuscript References: <20080716172720.51E93E489E@fileserver.binf.ku.dk> Message-ID: Anders will not be able to comment, as he is away until Aug 4. TJF Begin forwarded message: From: Anders Krogh Date: July 16, 2008 1:27:20 PM EDT To: fiedler at fit.edu Subject: Re: nGASP manuscript I am away until Aug 4th. - Anders -- Anders Krogh krogh at binf.ku.dk Professor www.binf.ku.dk The Bioinformatics Centre Dept of Biology Ph. +45 3532 1329 University of Copenhagen Secr +45 3532 2003 Ole Maaloes Vej 5 Dept +45 3532 3710 2200 Copenhagen, Denmark Fax +45 3532 1281 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fiedler at fit.edu Wed Jul 16 13:36:28 2008 From: fiedler at fit.edu (Dr. Tristan J. Fiedler) Date: Wed, 16 Jul 2008 13:36:28 -0400 Subject: [Ngasp-help] Fwd: nGASP manuscript References: <24c96eca0807161033w308b65b3hde8fd9af7ee51c4b@mail.gmail.com> Message-ID: <6D532673-B78A-4FDC-B97A-04EF671089EC@fit.edu> Begin forwarded message: From: "Aaron Mackey" Date: July 16, 2008 1:33:18 PM EDT Evigan has now been published, so you can update Table 1: http://bioinformatics.oxfordjournals.org/cgi/content/full/24/5/597 Thanks, -Aaron On Wed, Jul 16, 2008 at 1:26 PM, Dr. Tristan J. Fiedler wrote: Dear nGASP Participants, We thank you again for your participation in nGASP. The nGASP analysis team has now written up the results of nGASP as a paper, which we plan to submit to BMC Bioinformatics. As agreed, we are sending you a copy of the draft manuscript for your perusal before submission. We would be very grateful if you can let us know if you have any major comments on the draft manuscript by Thursday 24th July. Comments may be sent to ngasp-help at wormbase.org Yours sincerely, The nGASP analysis team. -------------- next part -------------- An HTML attachment was scrubbed... URL: From fiedler at fit.edu Wed Jul 16 17:04:15 2008 From: fiedler at fit.edu (Dr. Tristan J. Fiedler) Date: Wed, 16 Jul 2008 17:04:15 -0400 Subject: [Ngasp-help] Fwd: nGASP manuscript References: <139a4dc30807161216l95247datadd51389dc89fb4b@mail.gmail.com> Message-ID: <31DACE49-C1D8-4292-85D0-2AF58AB9E921@fit.edu> Begin forwarded message: From: "Mario Stanke" Date: July 16, 2008 3:16:08 PM EDT To: "Dr. Tristan J. Fiedler" Subject: Re: nGASP manuscript Dear Tristan, thanks gain for organizing this. This paper is of great value for the community. I have just one suggestion: Could you please replace citation 11 for AUGUSTUS with this one: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/5/637?ijkey=AqOiFZBiTC5VhDS&keytype=ref Mario Stanke , Mark Diekhans , Robert Baertsch , and David Haussler Using native and syntenically mapped cDNA alignments to improve de novo gene finding DOI 10.1093/bioinformatics/btn013. Bioinformatics 24: 637-644. This more recent paper reflects much better the methods I used for the nGASP version of AUGUSTUS. Thanks and best regards, Mario 2008/7/16 Dr. Tristan J. Fiedler : > Dear nGASP Participants, > > We thank you again for your participation in nGASP. > > The nGASP analysis team has now written up the results of nGASP > as a paper, which we plan to submit to BMC Bioinformatics. > > As agreed, we are sending you a copy of the draft manuscript > for your perusal before submission. > > We would be very grateful if you can let us know if you have any > major comments on the draft manuscript by Thursday 24th July. > > Comments may be sent to ngasp-help at wormbase.org > > Yours sincerely, > > The nGASP analysis team. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fiedler at fit.edu Wed Jul 16 17:06:25 2008 From: fiedler at fit.edu (Dr. Tristan J. Fiedler) Date: Wed, 16 Jul 2008 17:06:25 -0400 Subject: [Ngasp-help] Fwd: nGASP manuscript References: <487E51DC.5030207@umd.edu> Message-ID: Begin forwarded message: From: Steven Salzberg Date: July 16, 2008 3:54:04 PM EDT To: "Dr. Tristan J. Fiedler" , alc at sanger.ac.uk, lstein at cshl.edu , Steven Salzberg Subject: Re: nGASP manuscript hi Tristan, Avril, and Lincoln, Thanks for sending the manuscript. I've not had time to read it closely yet, but I noticed one problem that I want to point out. This is an error that many of us (including me) in the gene-finding community have made before, but now that I know about it I want to avoid it in the future. The problem is our use of the term "specificity." The way you (we) have used it in the manuscript follows the usage in the EGASP competition, which also got it wrong. Our definition in EGASP was the percentage of a gene finders' predictions that were correct; i.e.: (# correct predictions)/(total # predictions) However, a good friend and colleague of mine (a biostatistician) pointed out that this measure should instead be called "precision." You can find standard definitions of sensitivity and specificity in any text, and also in Wikipedia: http://en.wikipedia.org/wiki/Sensitivity_and_specificity The proper definition of "specificity" is the ratio of true negatives to all "negative" predictions. This isn't really meaningful in our context, because we don't attempt to predict non-gene regions. (Another way to look at this is that we aren't taking a putative gene and saying yes/ no.) In fact, we don't even have a good way to say with certainty that a region isn't a gene, so we just look at positive predictions. The other term for what we're measuring is "positive predictive value" (PPV): http://en.wikipedia.org/wiki/Positive_predictive_value although I like "precision" better. I think you'll agree that this is what the EGASP competition was calling "specificity" - and it's been used this way in previous papers too. But this definition is quite confusing to statisticians, and I think we should revert to the standard usage. A simple global replace of "specificity" with "precision" will probably fix the manuscript, though it would be best to check carefully. I hope you'll agree. Steven Steven L. Salzberg, Ph.D. Horvitz Professor of Computer Science Director, Center for Bioinformatics and Computational Biology 3125 Biomolecular Sciences Building University of Maryland, College Park, MD 20742 Phone: 301-405-9611 Email: salzberg at umd.edu Blog: http://genefinding.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From fiedler at fit.edu Fri Jul 18 10:36:48 2008 From: fiedler at fit.edu (Tristan Fiedler) Date: Fri, 18 Jul 2008 10:36:48 -0400 Subject: [Ngasp-help] Fwd: nGASP manuscript References: <4880A7C1.4030700@cshl.edu> Message-ID: <66441AF6-3BBD-4C53-9E0A-EB7D30FB0AB6@fit.edu> Does anything need to be done regarding Chengzhi's comment below? TJF Begin forwarded message: From: Chengzhi Liang Date: July 18, 2008 10:25:05 AM EDT To: "Dr. Tristan J. Fiedler" Subject: Re: nGASP manuscript Hi Tristan, Thanks for the manuscript. I think I eventually find out what's Richard(?) was talking about when he told me once that the genes I provided in nGASP had some inconsistency: some with stop codon, some without. This is because I used two separate Ensembl modules to build all the genes. One of them append stop codon at the end of CDS, but the other doesn't. I didn't noticed this before because I never compared the genes built by these two modules at the CDS level. The protein level is always the same. I don't know how much this affected my results, but maybe you guys want to know this. Chengzhi Dr. Tristan J. Fiedler wrote: > Dear nGASP Participants, > > We thank you again for your participation in nGASP. > > The nGASP analysis team has now written up the results of nGASP > as a paper, which we plan to submit to BMC Bioinformatics. > > As agreed, we are sending you a copy of the draft manuscript for > your perusal before submission. > We would be very grateful if you can let us know if you have any > major comments on the draft manuscript by Thursday 24th July. > > Comments may be sent to ngasp-help at wormbase.org > > > Yours sincerely, > > The nGASP analysis team. > > = > ------------------------------------------------------------------------ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fiedler at fit.edu Mon Jul 21 22:59:54 2008 From: fiedler at fit.edu (Tristan Fiedler) Date: Mon, 21 Jul 2008 22:59:54 -0400 (EDT) Subject: [Ngasp-help] Manuscript comments - by MAKER group Message-ID: <35998.65.33.111.147.1216695594.squirrel@webaccess.fit.edu> >From Mark Yandell: Revised manuscript attached (MSWord doc) As you can see they are very, very minor; basically I just want the manuscript to clarify that this is MAKER using SNAP. Rather than Maker producing its own ab-initio predictions. I think making the distinction helps to clarify the extent to which Maker improves on the SNAP ab-initio predictions. A pretty minor point I know, but its an important one I think. -------------- next part -------------- A non-text attachment was scrubbed... Name: ngasp_16jul08b_alc_my.doc Type: application/msword Size: 1645056 bytes Desc: not available URL: From tv35 at cornell.edu Tue Jul 22 17:35:46 2008 From: tv35 at cornell.edu (Tomas Vinar) Date: Tue, 22 Jul 2008 17:35:46 -0400 Subject: [Ngasp-help] Re: nGASP manuscript In-Reply-To: References: Message-ID: Hello, The paper is a very nice summary of the experiment. I have very few comments and corrections. Corrections in Table 1: - Augustus entry: there is only one author (Stanke), so writing "Stanke et al." is perhaps not the best - ExonHunter entry: there are two authors, so it would be nicer to write Brejova and Vinar (instead of Vinar et al.), or at least please change this to Brejova et al. Other comments: - in abstract: "There was a tie for the third place..." -> this sentence implies that cross-species gene finding does not work in C. elegans. That may be a very sensitive point with some referees, and I don't feel that the corresponding section in the "results" gives good enough discussion on this subject. I would remove the sentence from the abstract as to not immediately make it a discussion point, especially if better analysis cannot be made. - in results: comparisons between EGASP and nGASP numbers. I am not sure that the direct comparison of numbers between EGASP and nGASP is a good idea. From various experiments in past, it seems clear that the absolute numbers in Sn and Sp can change dramatically depending on particular testing set, even within species. Considering that there are some substantial differences in methodologies between EGASP and nGASP, while comparing the absolute numbers can give some information on general trends, I don't think some of the conclusions can be supported by the data, especially ones derived from comparison of gene level sensitivities and specificities. - short table giving overview of numbers of genes/exons/bases and basic comparison of EGASP and nGASP data sets (e.g., exon lengths, intron lengths, exon numbers, etc.) would be useful - Table 3: Why gene Sn is given only to 1 decimal digit, while all the other numbers are given to 2 decimal digits? Also, I am not sure how much it is justified to give the data to 2 decimal digit precision, since 0.01 is not likely to be anywhere near to statistically significant difference in any of the measures On Wed, Jul 16, 2008 at 1:26 PM, Dr. Tristan J. Fiedler wrote: > Dear nGASP Participants, > > We thank you again for your participation in nGASP. > > The nGASP analysis team has now written up the results of nGASP > as a paper, which we plan to submit to BMC Bioinformatics. > > As agreed, we are sending you a copy of the draft manuscript > for your perusal before submission. > > We would be very grateful if you can let us know if you have any > major comments on the draft manuscript by Thursday 24th July. > > Comments may be sent to ngasp-help at wormbase.org > > Yours sincerely, > > The nGASP analysis team. > > > -- -------------------------------------------------------------------------- Tomas Vinar, Postdoctoral Researcher Biological Statistics and Computational Biology Cornell University E-mail: tv35 at cornell.edu Office: 169 Biotechnology Building Work Phone: +1-607-255-7430 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tristan.fiedler at gmail.com Tue Jul 22 17:52:46 2008 From: tristan.fiedler at gmail.com (Tristan Fiedler) Date: Tue, 22 Jul 2008 17:52:46 -0400 Subject: [Ngasp-help] Re: nGASP manuscript In-Reply-To: References: Message-ID: <61DE0374-94B4-4CFD-9182-AD81EAA6F291@fit.edu> Dear Tomas, Thank you very much for your comments. We are currently reviewing them. Through the support and participation of scientists such as yourself, the nGASP project has made significant contributions to the field of genome annotation. Cheers, Tristan FL Tech Physics & Space Sciences Ortega Telescope Dedication http://research.fit.edu/pssevent FL Tech Biological Sciences Reunion & Symposium http://research.fit.edu/biologysymposium -- Tristan J. Fiedler, M.Sc., Ph.D. Assistant Vice President of Advancement Research Assistant Professor Department of Biological Sciences Florida Institute of Technology 150 W. University Blvd Melbourne, FL 32901 e fiedler at fit.edu o 321 674 7723 c 321 432 0721 On Jul 22, 2008, at 5:35 PM, Tomas Vinar wrote: Hello, The paper is a very nice summary of the experiment. I have very few comments and corrections. Corrections in Table 1: - Augustus entry: there is only one author (Stanke), so writing "Stanke et al." is perhaps not the best - ExonHunter entry: there are two authors, so it would be nicer to write Brejova and Vinar (instead of Vinar et al.), or at least please change this to Brejova et al. Other comments: - in abstract: "There was a tie for the third place..." -> this sentence implies that cross-species gene finding does not work in C. elegans. That may be a very sensitive point with some referees, and I don't feel that the corresponding section in the "results" gives good enough discussion on this subject. I would remove the sentence from the abstract as to not immediately make it a discussion point, especially if better analysis cannot be made. - in results: comparisons between EGASP and nGASP numbers. I am not sure that the direct comparison of numbers between EGASP and nGASP is a good idea. From various experiments in past, it seems clear that the absolute numbers in Sn and Sp can change dramatically depending on particular testing set, even within species. Considering that there are some substantial differences in methodologies between EGASP and nGASP, while comparing the absolute numbers can give some information on general trends, I don't think some of the conclusions can be supported by the data, especially ones derived from comparison of gene level sensitivities and specificities. - short table giving overview of numbers of genes/exons/bases and basic comparison of EGASP and nGASP data sets (e.g., exon lengths, intron lengths, exon numbers, etc.) would be useful - Table 3: Why gene Sn is given only to 1 decimal digit, while all the other numbers are given to 2 decimal digits? Also, I am not sure how much it is justified to give the data to 2 decimal digit precision, since 0.01 is not likely to be anywhere near to statistically significant difference in any of the measures On Wed, Jul 16, 2008 at 1:26 PM, Dr. Tristan J. Fiedler wrote: Dear nGASP Participants, We thank you again for your participation in nGASP. The nGASP analysis team has now written up the results of nGASP as a paper, which we plan to submit to BMC Bioinformatics. As agreed, we are sending you a copy of the draft manuscript for your perusal before submission. We would be very grateful if you can let us know if you have any major comments on the draft manuscript by Thursday 24th July. Comments may be sent to ngasp-help at wormbase.org Yours sincerely, The nGASP analysis team. -- -------------------------------------------------------------------------- Tomas Vinar, Postdoctoral Researcher Biological Statistics and Computational Biology Cornell University E-mail: tv35 at cornell.edu Office: 169 Biotechnology Building Work Phone: +1-607-255-7430 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: FL Tech.tiff Type: image/tiff Size: 36518 bytes Desc: not available URL: From borodovsky at gatech.edu Wed Jul 23 22:35:24 2008 From: borodovsky at gatech.edu (Borodovsky, Mark) Date: Wed, 23 Jul 2008 22:35:24 -0400 (EDT) Subject: [Ngasp-help] corrections for the nGASP manuscript Message-ID: <73234392.1071191216866924208.JavaMail.root@mail6.gatech.edu> Hello, ... at the moment we have 1 correct reference for publication describing GeneMark.hmm 3.0 (an engine in a self-training procedure) is as follows Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y.O. and Borodovsky, M. (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res, 33, 6494-6506. (please remove reference to 2001 publication) 2 participants of the NGASP work - Alex Lomsadze, Vardges Ter-Hovhannisyan, Andrey Kislyuk and Mark Borodovsky please include all the names Thanks, Mark From allen99 at llnl.gov Thu Jul 24 19:41:21 2008 From: allen99 at llnl.gov (Jonathan E. Allen) Date: Thu, 24 Jul 2008 16:41:21 -0700 Subject: [Ngasp-help] Re: nGASP manuscript In-Reply-To: References: Message-ID: <48891321.7040201@llnl.gov> > As agreed, we are sending you a copy of the draft manuscript > for your perusal before submission. > > We would be very grateful if you can let us know if you have any > major comments on the draft manuscript by Thursday 24th July. > > Comments may be sent to ngasp-help at wormbase.org > > A few minor points. The nGASP analysis team is probably already aware of this, but I just want to mention that the Brugia malayi annotation associated with the Science publication used JIGSAW and included Fgenesh and Augustus as input. I realize that the versions of Fgenesh and Augustus are likely very different, but some comparison of the annotations could be useful in highlighting potential limitations in the initial publication. The JIGSAW performance results reported in the text don't use the same inputs as are proposed for generating the new annotations, was a performance comparison done using just the three proposed input programs to confirm comparable performance? I would be curious to know if the combiners were more robust for the genes with less common hexamers or they essentially suffered the same fate. I would guess that they would suffer the same problem as the input, but I don't know if it was mentioned in the text. I also wish to offer my assistance if needed to address any difficulties in using JIGSAW to combine output from the other programs. Typos: "Johnathan" should be "Jonathan" My address is Lawrence Livermore National Laboratory, PO Box 808, L-174, Livermore, CA, 94551, USA Regards, Jonathan From fiedler at fit.edu Thu Jul 24 20:36:48 2008 From: fiedler at fit.edu (Tristan Fiedler) Date: Thu, 24 Jul 2008 20:36:48 -0400 (EDT) Subject: [Ngasp-help] Re: nGASP manuscript In-Reply-To: <48891321.7040201@llnl.gov> References: <48891321.7040201@llnl.gov> Message-ID: <35220.65.33.111.147.1216946208.squirrel@webaccess.fit.edu> Dear Jonathan, Thank you very much for commenting on the manuscript. We are reviewing your points now. Sincerely, Tristan --- Tristan J. Fiedler, M.Sc., Ph.D. Assistant Vice President for University Advancement Research Assistant Professor - Department of Biological Sciences Florida Institute of Technology 150 W. University Blvd Melbourne, FL 32901 email : fiedler at fit.edu office : 321 674 7723 cell : 321 432 0721 > >> As agreed, we are sending you a copy of the draft manuscript >> for your perusal before submission. >> >> We would be very grateful if you can let us know if you have any >> major comments on the draft manuscript by Thursday 24th July. >> >> Comments may be sent to ngasp-help at wormbase.org >> >> > A few minor points. > > The nGASP analysis team is probably already aware of this, but I just want > to mention that the Brugia malayi annotation associated with the Science > publication > used JIGSAW and included Fgenesh and Augustus as input. I realize that > the versions > of Fgenesh and Augustus are likely very different, but some comparison > of the annotations could be useful in highlighting potential limitations > in the initial > publication. > > The JIGSAW performance results reported in the text don't use the > same inputs as are proposed for generating the new annotations, was a > performance > comparison done using just the three proposed input programs to confirm > comparable > performance? > > I would be curious to know if the combiners were more robust for the > genes with > less common hexamers or they essentially suffered the same fate. I > would guess that > they would suffer the same problem as the input, but I don't know if it > was mentioned > in the text. > > I also wish to offer my assistance if needed to address any difficulties > in using JIGSAW > to combine output from the other programs. > > Typos: > "Johnathan" should be "Jonathan" > My address is > Lawrence Livermore National Laboratory, PO Box 808, L-174, Livermore, > CA, 94551, USA > > Regards, > > Jonathan > > > > From fiedler at fit.edu Sat Jul 26 17:18:51 2008 From: fiedler at fit.edu (Tristan Fiedler) Date: Sat, 26 Jul 2008 17:18:51 -0400 (EDT) Subject: [Ngasp-help] Re: nGASP manuscript In-Reply-To: References: Message-ID: <35480.65.33.111.147.1217107131.squirrel@webaccess.fit.edu> Thank you for the comments. We are reviewing them now. Cheers, Tristan > Hi Tristan, > > the worm genomics meeting in Cambridge kept me busy most of the day. I > finally came to read the manuscript. > I think the paper is already pretty good. I have a few comments (some > of which came from the mGene team), which you may consider before > submitting the manuscript: > > * I missed a more detailed discussion of the annotation dataset that > has been used for evaluation. There are a few issues: > * For unconfirmed genes, how where they predicted? Did any of the > compared methods produce these predictions? This would lead to biases > in the evaluation, which should not be hidden. > * Also, for the EST-confirmed genes: these gene models where > generated using EST alignments, some manual curation and probably also > gene finder predictions. Except the manual curation, the input is the > same as for cat-3 gene finding. Who knows what the correct gene models > are... > (or something in this direction, you probably have your own > thoughts on this). > > * The evaluation for the combiners was done on only the 3' end of the > regions. We noticed there is a significant difference for in the > performance between the 5' and 3' ends (several percent in transcript > level). Hence, the performance on the 3' end and the performance on > the whole region are not directly comparable. I don't necessarily want > you to redo the evaluation, but at least this should be noted in the > main text. Also, it is important to state in the main text that > combiners had about 50% more data for training. > > * Since the result of the paper is that combiners are the method of > choice, it would be very interesting to understand how important the > accuracy and the choice of the base-gene finders are. It would be > really great if you could show some results, indicating which > (sub-)set of (the three) gene finders lead to which performance. > > * I really would love to have the evaluation separately for each > category. This would make the comparison between the methods > considerably easier. It would be very nice to have, at least in the > supplement. > > * One feature that came to our mind that you did not seem to have > checked whether it leads to wrong predictions, is whether there is a > gene on the opposite strand. Many gene finders only predict genes on > one strand. mGene, for instance, predicts the genes independently on > both strands. > > * It would be a great service to the gene finding community and also > would facilitate reproducibility of this research, if you'd provide > the evaluation scripts which lead to exactly the numbers given in the > table in the supplement of this paper. (In any case I'd like to get > them to evaluate new predictions that we have in the same way.) > > Finally, I'd like to ask you to reconsider the choice of the Journal. > I think the paper would have a good chance in PLoS Comp Bio or Genome > Biology, which I find considerably better suited for this work than > BMC Bioinformatics. > > (Please let me know if anything is unclear or you need any additional > information). > > Thanks a lot for your efforts in writing this manuscript and in > organising the nGASP competition! > > All the best, > > Gunnar > > On 16.07.2008, at 19:26, Dr. Tristan J. Fiedler wrote: > >> Dear nGASP Participants, >> >> We thank you again for your participation in nGASP. >> >> The nGASP analysis team has now written up the results of nGASP >> as a paper, which we plan to submit to BMC Bioinformatics. >> >> As agreed, we are sending you a copy of the draft manuscript >> for your perusal before submission. >> >> We would be very grateful if you can let us know if you have any >> major comments on the draft manuscript by Thursday 24th July. >> >> Comments may be sent to ngasp-help at wormbase.org >> >> Yours sincerely, >> >> The nGASP analysis team. >> >> > > +-------------------------------------------------------------------+ > Gunnar R?tsch http://www.fml.mpg.de/raetsch > Friedrich Miescher Laboratory Gunnar.Raetsch at tuebingen.mpg.de > Max Planck Society Tel: (+49) 7071 601 820 > Spemannstra?e 39, 72076 T?bingen, Germany Fax: (+49) 7071 601 801 > > > > > From fiedler at fit.edu Sat Jul 26 17:19:55 2008 From: fiedler at fit.edu (Tristan Fiedler) Date: Sat, 26 Jul 2008 17:19:55 -0400 (EDT) Subject: [Ngasp-help] Gunnar Raetsch's comments Message-ID: <35502.65.33.111.147.1217107195.squirrel@webaccess.fit.edu> ---------------------------- Original Message ---------------------------- Subject: Re: nGASP manuscript From: "Gunnar Raetsch" Date: Sat, July 26, 2008 12:30 am To: "Dr. Tristan J. Fiedler" Cc: "Alexander Zien" "Gabriele Schweikert" -------------------------------------------------------------------------- Hi Tristan, the worm genomics meeting in Cambridge kept me busy most of the day. I finally came to read the manuscript. I think it is already in pretty good shape. I have a few comments, which you may consider before submitting the manuscript: * I missed a more detailed discussion of the annotation dataset that has been used for evaluation. There are a few issues: * For unconfirmed genes, how where they predicted? Did any of the compared methods produce these predictions? * Also, for the EST-confirmed genes: these gene models where generated using EST alignments, some manual curation and probably also gene finder predictions. Except the manual curation, the input is the same as for cat-3 gene finding. Who knows what the correct gene models are... (or something in this direction, you probably have your own thoughts on this). * The evaluation for the combiners was done on only the 3' end of the regions. We noticed there is a significant difference for in the performance between the 5' and 3' ends (several percent in transcript level). Hence, the performance on the 3' end and the performance on the whole region are not directly comparable. I don't necessarily want you to redo the evaluation, but at least this should be noted in the main text and also that combiners had about 50% more data for training. * Since the result of the paper is that combiners are the method of choice, it would be very interesting to understand how important the accuracy and the choice of the base-gene finders are. It would be really great if you could show some results, indicating which (sub-)set of (the three) gene finders lead to which performance. * I really would like the evaluation separately for each category. This would allow us the compare the contributions within the categories more easily. It would be very nice to have, at least in the supplement. * One feature that came to our mind that you did not check whether it leads to wrong predictions, is whether there is a gene on the opposite strand. Many gene finders only predict genes on one strand. mGene, for instance, does not. * It would be a great service to the gene finding community and also would facilitate reproducibility of this research, if you'd provide the evaluation scripts which lead to exactly the numbers given in the table in the supplement of this paper. (In any case I'd like to get them to evaluate new predictions that we have in the same way.) Finally, I'd like to ask you to reconsider the choice of the Journal. I think it would have a good chance in PLoS Comp Bio or Genome Biology, which I find considerably better suited for this work than BMC Bioinformatics. Why not trying it? Thank you and the others for writing the manuscript and organising this competition! All the best, Gunnar On 16.07.2008, at 19:26, Dr. Tristan J. Fiedler wrote: Dear nGASP Participants, We thank you again for your participation in nGASP. The nGASP analysis team has now written up the results of nGASP as a paper, which we plan to submit to BMC Bioinformatics. As agreed, we are sending you a copy of the draft manuscript for your perusal before submission. We would be very grateful if you can let us know if you have any major comments on the draft manuscript by Thursday 24th July. Comments may be sent to ngasp-help at wormbase.org Yours sincerely, The nGASP analysis team. +-------------------------------------------------------------------+ Gunnar R?tsch http://www.fml.mpg.de/raetsch Friedrich Miescher Laboratory Gunnar.Raetsch at tuebingen.mpg.de Max Planck Society Tel: (+49) 7071 601 820 Spemannstra?e 39, 72076 T?bingen, Germany Fax: (+49) 7071 601 801 On 16.07.2008, at 19:26, Dr. Tristan J. Fiedler wrote: > Dear nGASP Participants, > > We thank you again for your participation in nGASP. > > The nGASP analysis team has now written up the results of nGASP > as a paper, which we plan to submit to BMC Bioinformatics. > > As agreed, we are sending you a copy of the draft manuscript > for your perusal before submission. > > We would be very grateful if you can let us know if you have any > major comments on the draft manuscript by Thursday 24th July. > > Comments may be sent to ngasp-help at wormbase.org > > Yours sincerely, > > The nGASP analysis team. > > +-------------------------------------------------------------------+ Gunnar R?tsch http://www.fml.mpg.de/raetsch Friedrich Miescher Laboratory Gunnar.Raetsch at tuebingen.mpg.de Max Planck Society Tel: (+49) 7071 601 820 Spemannstra?e 39, 72076 T?bingen, Germany Fax: (+49) 7071 601 801 +-------------------------------------------------------------------+ Gunnar R?tsch http://www.fml.mpg.de/raetsch Friedrich Miescher Laboratory Gunnar.Raetsch at tuebingen.mpg.de Max Planck Society Tel: (+49) 7071 601 820 Spemannstra?e 39, 72076 T?bingen, Germany Fax: (+49) 7071 601 801 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 185 bytes Desc: not available URL: From flicek at ebi.ac.uk Sun Jul 27 15:40:53 2008 From: flicek at ebi.ac.uk (Paul Flicek) Date: Sun, 27 Jul 2008 20:40:53 +0100 Subject: [Ngasp-help] Re: [Ngasp-dev] Gunnar Raetsch's comments In-Reply-To: <35502.65.33.111.147.1217107195.squirrel@webaccess.fit.edu> References: <35502.65.33.111.147.1217107195.squirrel@webaccess.fit.edu> Message-ID: <6A97FEFC-0689-4B93-A892-BB22827D2BE8@ebi.ac.uk> Hi All, Some comments below. > > > * I missed a more detailed discussion of the annotation dataset that > has been used for evaluation. There are a few issues: > * For unconfirmed genes, how where they predicted? Did any of the > compared methods produce these predictions? > * Also, for the EST-confirmed genes: these gene models where > generated using EST alignments, some manual curation and probably also > gene finder predictions. Except the manual curation, the input is the > same as for cat-3 gene finding. Who knows what the correct gene models > are... > (or something in this direction, you probably have your own > thoughts on this). This is not a bad idea if it is easy to do. > > * The evaluation for the combiners was done on only the 3' end of the > regions. We noticed there is a significant difference for in the > performance between the 5' and 3' ends (several percent in transcript > level). Hence, the performance on the 3' end and the performance on > the whole region are not directly comparable. I don't necessarily want > you to redo the evaluation, but at least this should be noted in the > main text and also that combiners had about 50% more data for > training. > This is actually shocking if true. Why would the 3' portions of the regions be so much easier to predict? Are they enriched for the characteristics that Arvil found to be associated with difficult to predict gene? I don't think that it is true that the combiners necessarily had more training data unless they also used all of the original training regions as well. But these did not have all of the gene predictions that they used in the combination, right? > * Since the result of the paper is that combiners are the method of > choice, it would be very interesting to understand how important the > accuracy and the choice of the base-gene finders are. It would be > really great if you could show some results, indicating which > (sub-)set of (the three) gene finders lead to which performance. This point has been covered in some previous combiner papers. Basically, better inputs lead to better outputs and so the best individual gene predictions are most likely to be the ones that are most effective for the combiners. See J. Allen's original combiner paper in Genome Research and the GLEAN paper from earlier this year in Bioinformatics for examples. > > * I really would like the evaluation separately for each category. > This would allow us the compare the contributions within the > categories more easily. It would be very nice to have, at least in the > supplement. I thought we had this already. > > * One feature that came to our mind that you did not check whether it > leads to wrong predictions, is whether there is a gene on the opposite > strand. Many gene finders only predict genes on one strand. mGene, for > instance, does not. I don't see any reason to do this. > > * It would be a great service to the gene finding community and also > would facilitate reproducibility of this research, if you'd provide > the evaluation scripts which lead to exactly the numbers given in the > table in the supplement of this paper. (In any case I'd like to get > them to evaluate new predictions that we have in the same way.) This is possible, but difficult given the processing that took place on the submitted predictions before they could be read by the evaluation code. That said, I am supportive of this. But we would have to include much more detailed file munging methods in the supplement section. > > Finally, I'd like to ask you to reconsider the choice of the Journal. > I think it would have a good chance in PLoS Comp Bio or Genome > Biology, which I find considerably better suited for this work than > BMC Bioinformatics. Why not trying it? I would support going to PLoS Comp Bio if people think that we had a chance there. Genome Biology is also a possibility as they published the EGASP. Remember, however, that EGASP went to Genome Biology as a supplement through their marketing department after their editorial turned down the idea. We were pitching the summary paper plus the methods papers so it was a significant different thing than just this. Paul > > Thank you and the others for writing the manuscript and organising > this competition! > > All the best, > > Gunnar > > > On 16.07.2008, at 19:26, Dr. Tristan J. Fiedler wrote: > > Dear nGASP Participants, > > We thank you again for your participation in nGASP. > > The nGASP analysis team has now written up the results of nGASP > as a paper, which we plan to submit to BMC Bioinformatics. > > As agreed, we are sending you a copy of the draft manuscript > for your perusal before submission. > > We would be very grateful if you can let us know if you have any > major comments on the draft manuscript by Thursday 24th July. > > Comments may be sent to ngasp-help at wormbase.org > > Yours sincerely, > > The nGASP analysis team. > > > > +-------------------------------------------------------------------+ > Gunnar R?tsch http://www.fml.mpg.de/raetsch > Friedrich Miescher Laboratory Gunnar.Raetsch at tuebingen.mpg.de > Max Planck Society Tel: (+49) 7071 601 820 > Spemannstra?e 39, 72076 T?bingen, Germany Fax: (+49) 7071 601 801 > > > > On 16.07.2008, at 19:26, Dr. Tristan J. Fiedler wrote: > >> Dear nGASP Participants, >> >> We thank you again for your participation in nGASP. >> >> The nGASP analysis team has now written up the results of nGASP >> as a paper, which we plan to submit to BMC Bioinformatics. >> >> As agreed, we are sending you a copy of the draft manuscript >> for your perusal before submission. >> >> We would be very grateful if you can let us know if you have any >> major comments on the draft manuscript by Thursday 24th July. >> >> Comments may be sent to ngasp-help at wormbase.org >> >> Yours sincerely, >> >> The nGASP analysis team. >> >> > > +-------------------------------------------------------------------+ > Gunnar R?tsch http://www.fml.mpg.de/raetsch > Friedrich Miescher Laboratory Gunnar.Raetsch at tuebingen.mpg.de > Max Planck Society Tel: (+49) 7071 601 820 > Spemannstra?e 39, 72076 T?bingen, Germany Fax: (+49) 7071 601 801 > > > > > +-------------------------------------------------------------------+ > Gunnar R?tsch http://www.fml.mpg.de/raetsch > Friedrich Miescher Laboratory Gunnar.Raetsch at tuebingen.mpg.de > Max Planck Society Tel: (+49) 7071 601 820 > Spemannstra?e 39, 72076 T?bingen, Germany Fax: (+49) 7071 601 801 > > > > > >_______________________________________________ > Ngasp-dev mailing list > Ngasp-dev at wormbase.org > http://mail.wormbase.org/mailman/listinfo/ngasp-dev From tristan.fiedler at gmail.com Tue Jul 29 14:58:59 2008 From: tristan.fiedler at gmail.com (Tristan Fiedler) Date: Tue, 29 Jul 2008 14:58:59 -0400 Subject: [Ngasp-help] Re: A critical assessment of Mus musculus gene functio...[Genome Biol. 2008] - PubMed Result In-Reply-To: <53FEBB63-8343-44FF-A9A0-755181528C90@tuebingen.mpg.de> References: <53FEBB63-8343-44FF-A9A0-755181528C90@tuebingen.mpg.de> Message-ID: Dear Gunnar, This is very good information. Thank you. Cheers, Tristan FL Tech Physics & Space Sciences Ortega Telescope Dedication http://research.fit.edu/pssevent FL Tech Biological Sciences Reunion & Symposium http://research.fit.edu/biologysymposium -- Tristan J. Fiedler, M.Sc., Ph.D. Assistant Vice President of Advancement Research Assistant Professor Department of Biological Sciences Florida Institute of Technology 150 W. University Blvd Melbourne, FL 32901 e fiedler at fit.edu o 321 674 7723 c 321 432 0721 On Jul 27, 2008, at 7:48 AM, Gunnar R?tsch wrote: Hi Tristan, as a follow-up to the suggestion of choosing another journal. Genome Biology recently published the results of a similar challenge: http://www.ncbi.nlm.nih.gov/pubmed/18613946?dopt=AbstractPlus Hth. Cheers, Gunnar -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: FL Tech.tiff Type: image/tiff Size: 36518 bytes Desc: not available URL: