[Gmod-help] Re: [Gmod-schema] Pipelines for 'short read processing' + chado back-end?

Thu Aug 27 09:47:02 EDT 2009

Hi,

I agree with Don regarding the management of the primary data outside chado.

FlyBase has been actively looking at the management of RNA-Seq data in 
our chado instance, and our (nascent) policy is consistent with what 
Don suggested.  

In terms of the actual implementation of RNA-Seq data in chado, FlyBase 
has only done junction calls at this point.  This will follow the same 
basic pattern of all prediction evidence in chado:

     ----------------------------------------------  genome
             ^                   ^
             |   _______A______  |          alignment feature type = match
        floc |    ^          ^   | floc (rank = 0)
             |    | f_r  f_r |   |
            --B----        ---C---        hsp feature type = match

Specifically, the feature A is the junction, and features B and C are 
the flanking exons.  The advantage of this implementation is that the 
junction data integrates smoothly into existing processes and tools 
FlyBase already has for other prediction data.

Regarding metadata, the read count for a particular junction will be 
implemented as a featureprop linked to the junction feature (A).  
Junctions are grouped together in collections using the library module, 
and we plan, at least initially, to use libraryprop and library_cvterm 
to store metadata for particular collections.  In our current test 
implementation, we have a libraryprop containing a text description of 
experimental and analytical protocols associated with the collection, 
a libraryprop for the stage (eg, "adult male"), and a libraryprop with 
a description of the data (eg, "Junction calls based on RNA-Seq reads").

I'm least happy with our plans for metadata, frankly.  FlyBase can 
store the data we know we need for our searches and so forth using the 
implementation above, but I'm uncertain whether this implementation 
is as generic - useable by other projects - as it might be.  I've 
had a good look at the "BIR-TAB" chado module developed by the 
modENCODE DCC for managing metadata, but I think it might be too 
industrial strength for FlyBase's needs (though I could be wrong).

I hope this is helpful.  I'd love to see more discussion of the 
issues Dan is raising, and a real consensus on implementation of 
these data, particularly the metadata.

Best,

-Dave

>From gmod-schema-bounces at lists.sourceforge.net  Wed Aug 26 18:46:46 2009
>> To: Don Gilbert <gilbertd at cricket.bio.indiana.edu>
>> Cc: gmod-schema at lists.sourceforge.net, help at gmod.org
>> Subject: Re: [Gmod-schema] Pipelines for 'short read processing' + chado
>> 	back-end?
>> 
>> Thanks all for help - some interesting suggestions and examples.
>> 
>> To clarify, we expect that an experimentalist will give us, for
>> example, a transcriptomics analysis for some tissue in some state, we
>> then want to report back (based on some custom analysis) which genes
>> look over expressed...
>> 
>> While this is quite a specific requirement (and will probably require
>> some specific analysis), we thought that there should be a common way
>> to store the data 'under the hood'. In this way we could build various
>> canned analysis on top of a common storage system that could then be
>> generalised to different experiments. i.e. the common storage system
>> could allow 'specific analysis' to be generalised, giving us the
>> flexibility to rapidly try out different ideas on top of a standard
>> storage system...
>> 
>> Does that sound reasonable?
>> 
>> Where would we start?
>> 
>> Sorry for being so vague! I just feel that there should be a better
>> way than designing a database from scratch.
>> 
>> 
>> Pengcheng, that looks great, but I'm not sure what I'm looking at!
>> Have you mapped resequencing data onto a reference and integrated
>> various other sources of data (omim, gene annotations, etc.)? How much
>> of this is 'hand crafted' and how much could be done using 'out of the
>> box' software?
>> 
>> 
>> Thanks again,
>> Dan.
>> 
>> 
>> 
>> 2009/8/26 Don Gilbert <gilbertd at cricket.bio.indiana.edu>:
>> >
>> > My off hand thoughts are: one could store the short read data in a
>> > Chado genome database, but that is a bit like storing all the traces from
>> > your genome sequencing also, where most folks only want the finished
>> > assembly for long term management in a genome database.  Handling the
>> > preliminary, unassembled reads this way is probably a big effort.
>> >
>> > The short read data processing tools like SAM/BAM would let you store
>> > and fetch this raw data, then you could save metadata / features derived from
>> > those in a Chado genome db.
>> >
>> > I like using a Unix file system instead of an RDBMS where suited;
>> > raw data in bulk form from sequencers is perhaps best left in its raw files,
>> > with indexing as needed to pull individual data. But usually you want
>> > to process the whole experiment as a batch, with say perl or C compiled
>> > programs, then db-store the results of those as genome features.  Your
>> > database can store the metadata for each experiment, including a file path
>> > to the raw reads.
>> >
>> > - Don
>> >
>> 
>> ------------------------------------------------------------------------------
>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
>> trial. Simplify your report design, integration and deployment - and focus on 
>> what you do best, core application coding. Discover what's new with 
>> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>> _______________________________________________
>> Gmod-schema mailing list
>> Gmod-schema at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> 
>>