[Gmod-schema] [Gmod-help] massively parallel sequencing question

Thu May 29 13:17:55 EDT 2008

On Thu, May 29, 2008 at 11:57 AM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
> On Thu, May 29, 2008 at 11:41 AM, Scott Cain <cain.cshl at gmail.com> wrote:
>> Hi Jennifer,
>>
>> I'm cc'ing the GMOD schema mailing list, because there have been other
>> people wondering the same thing.
>>
>> First I should say that I don't really know, because no one has tried
>> it.  That said, I can tell you that the FlyBase Chado schema has several
>> million rows in their feature table and it works for them.  What you no
>> doubt would need is a database server with enough horsepower and memory
>> to do the job, as well as properly tuning the database server for
>> performance.
>>
>> For use with GBrowse, I don't think I would advocate using Chado
>> directly, as the Chado adaptor for GBrowse is significantly slower than
>> the Bio::SeqFeature::Store database which is designed specifically for
>> giving speedy query results for use with GBrowse.  You could set up a
>> system where you use Chado as your working/annotation database and then
>> set up a periodic dump of your features to GFF3 which would get loaded
>> into a SeqFeature::Store database for use with GBrowse.
>>
>> Also, in the upcoming release of GBrowse there will be support for
>> wiggle tracks like in the UCSC browser, which will be well suited for
>> displaying things like coverage density in a fast-rendering way.
>
> I think that you will find that these data are HUGE and likely to
> overwhelm GBrowse and Chado.  These tools WILL work, but very slowly.
> We use the Affymetrix Integrated Genome browser (freely available and
> not Affy-centric, despite the name).  It is VERY fast and is better
> suited to these high-throughput data.
>
> In short, I would advocate using MAQ to do the alignments.  MAQ has
> tools for producing text files from the (binary) alignment files as
> well as calling SNPs and producing wiggle-like tracks, all very
> quickly.  Then, use tools like Affy IGB to look at the data and even
> do some analysis.  Alternatively, it is pretty easy to load these
> tracks into the UCSC genome browser; we have found that browser is
> generally faster than GBrowse.
>
> Once you have regions of the genome of interest (for chip-seq-like
> applications) or SNP calls, use Chado to store those things.  I would
> really NOT advocate storing the raw sequence reads or
> alignments--there is just too much data for this to be useful.  (That
> is not to say that a database is not a good place to stick the data,
> but adding 10's or 100's of million rows of "raw" data is probably not
> the best use of the feature and featureloc tables.)
>
> That is just my $0.02 worth.  I'd love to hear others' experiences.

I should add here, though, that GBrowse and Chado will likely be very
useful for integrating the results of experiments once the data have
been processed.  I'm just suggesting that some of the preprocessing
and raw data may best be done with other tools.

Sean

>> On Thu, 2008-05-29 at 08:38 -0600, Jennifer Beane wrote:
>>> Hi,
>>>
>>> I'm a post-doctoral fellow in bioinformatics and my lab is about to
>>> receive data generated from a massively parallel sequencing platform
>>> --  Illumina's genome analyzer.  The data will contain several million
>>> short sequence reads from mRNA and microRNA.  There are several
>>> software packages to align the reads to the human genome, but I will
>>> need to create a way to store, filter, and efficiently annotate these
>>> reads.  I'm thinking of loading the data into a chado database, and
>>> using applications such as GBrowse to view the data.  I'm wondering if
>>> you have any experience with using GMOD software/applications for this
>>> type of data?  I'm wondering if the data will be too extensive to be
>>> queried in a database?  If you have any advice/suggestions I would
>>> really appreciate it.
>>>
>>> Thank you very much,
>>> Jennifer Beane, Ph.D
>>> Post-doctoral Fellow
>>> Boston University School of Medicine
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D.                                         cain at cshl.edu
>> GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
>> Cold Spring Harbor Laboratory
>>
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by: Microsoft
>> Defy all challenges. Microsoft(R) Visual Studio 2008.
>> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
>> _______________________________________________
>> Gmod-schema mailing list
>> Gmod-schema at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>