[Gmod-help] Re: [Gmod-schema] Pipelines for 'short read processing' + chado back-end?
David Emmert
emmert at morgan.harvard.edu
Thu Aug 27 09:47:02 EDT 2009
Hi,
I agree with Don regarding the management of the primary data outside chado.
FlyBase has been actively looking at the management of RNA-Seq data in
our chado instance, and our (nascent) policy is consistent with what
Don suggested.
In terms of the actual implementation of RNA-Seq data in chado, FlyBase
has only done junction calls at this point. This will follow the same
basic pattern of all prediction evidence in chado:
---------------------------------------------- genome
^ ^
| _______A______ | alignment feature type = match
floc | ^ ^ | floc (rank = 0)
| | f_r f_r | |
--B---- ---C--- hsp feature type = match
Specifically, the feature A is the junction, and features B and C are
the flanking exons. The advantage of this implementation is that the
junction data integrates smoothly into existing processes and tools
FlyBase already has for other prediction data.
Regarding metadata, the read count for a particular junction will be
implemented as a featureprop linked to the junction feature (A).
Junctions are grouped together in collections using the library module,
and we plan, at least initially, to use libraryprop and library_cvterm
to store metadata for particular collections. In our current test
implementation, we have a libraryprop containing a text description of
experimental and analytical protocols associated with the collection,
a libraryprop for the stage (eg, "adult male"), and a libraryprop with
a description of the data (eg, "Junction calls based on RNA-Seq reads").
I'm least happy with our plans for metadata, frankly. FlyBase can
store the data we know we need for our searches and so forth using the
implementation above, but I'm uncertain whether this implementation
is as generic - useable by other projects - as it might be. I've
had a good look at the "BIR-TAB" chado module developed by the
modENCODE DCC for managing metadata, but I think it might be too
industrial strength for FlyBase's needs (though I could be wrong).
I hope this is helpful. I'd love to see more discussion of the
issues Dan is raising, and a real consensus on implementation of
these data, particularly the metadata.
Best,
-Dave
>From gmod-schema-bounces at lists.sourceforge.net Wed Aug 26 18:46:46 2009
>> To: Don Gilbert <gilbertd at cricket.bio.indiana.edu>
>> Cc: gmod-schema at lists.sourceforge.net, help at gmod.org
>> Subject: Re: [Gmod-schema] Pipelines for 'short read processing' + chado
>> back-end?
>>
>> Thanks all for help - some interesting suggestions and examples.
>>
>> To clarify, we expect that an experimentalist will give us, for
>> example, a transcriptomics analysis for some tissue in some state, we
>> then want to report back (based on some custom analysis) which genes
>> look over expressed...
>>
>> While this is quite a specific requirement (and will probably require
>> some specific analysis), we thought that there should be a common way
>> to store the data 'under the hood'. In this way we could build various
>> canned analysis on top of a common storage system that could then be
>> generalised to different experiments. i.e. the common storage system
>> could allow 'specific analysis' to be generalised, giving us the
>> flexibility to rapidly try out different ideas on top of a standard
>> storage system...
>>
>> Does that sound reasonable?
>>
>> Where would we start?
>>
>> Sorry for being so vague! I just feel that there should be a better
>> way than designing a database from scratch.
>>
>>
>> Pengcheng, that looks great, but I'm not sure what I'm looking at!
>> Have you mapped resequencing data onto a reference and integrated
>> various other sources of data (omim, gene annotations, etc.)? How much
>> of this is 'hand crafted' and how much could be done using 'out of the
>> box' software?
>>
>>
>> Thanks again,
>> Dan.
>>
>>
>>
>> 2009/8/26 Don Gilbert <gilbertd at cricket.bio.indiana.edu>:
>> >
>> > My off hand thoughts are: one could store the short read data in a
>> > Chado genome database, but that is a bit like storing all the traces from
>> > your genome sequencing also, where most folks only want the finished
>> > assembly for long term management in a genome database. Handling the
>> > preliminary, unassembled reads this way is probably a big effort.
>> >
>> > The short read data processing tools like SAM/BAM would let you store
>> > and fetch this raw data, then you could save metadata / features derived from
>> > those in a Chado genome db.
>> >
>> > I like using a Unix file system instead of an RDBMS where suited;
>> > raw data in bulk form from sequencers is perhaps best left in its raw files,
>> > with indexing as needed to pull individual data. But usually you want
>> > to process the whole experiment as a batch, with say perl or C compiled
>> > programs, then db-store the results of those as genome features. Your
>> > database can store the metadata for each experiment, including a file path
>> > to the raw reads.
>> >
>> > - Don
>> >
>>
>> ------------------------------------------------------------------------------
>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
>> trial. Simplify your report design, integration and deployment - and focus on
>> what you do best, core application coding. Discover what's new with
>> Crystal Reports now. http://p.sf.net/sfu/bobj-july
>> _______________________________________________
>> Gmod-schema mailing list
>> Gmod-schema at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>>
More information about the Gmod-help
mailing list