[Gmod-help] Re: [Gmod-schema] Pipelines for 'short read processing' + chado back-end?

Wed Aug 26 18:45:52 EDT 2009

Thanks all for help - some interesting suggestions and examples.

To clarify, we expect that an experimentalist will give us, for
example, a transcriptomics analysis for some tissue in some state, we
then want to report back (based on some custom analysis) which genes
look over expressed...

While this is quite a specific requirement (and will probably require
some specific analysis), we thought that there should be a common way
to store the data 'under the hood'. In this way we could build various
canned analysis on top of a common storage system that could then be
generalised to different experiments. i.e. the common storage system
could allow 'specific analysis' to be generalised, giving us the
flexibility to rapidly try out different ideas on top of a standard
storage system...

Does that sound reasonable?

Where would we start?

Sorry for being so vague! I just feel that there should be a better
way than designing a database from scratch.

Pengcheng, that looks great, but I'm not sure what I'm looking at!
Have you mapped resequencing data onto a reference and integrated
various other sources of data (omim, gene annotations, etc.)? How much
of this is 'hand crafted' and how much could be done using 'out of the
box' software?

Thanks again,
Dan.

2009/8/26 Don Gilbert <gilbertd at cricket.bio.indiana.edu>:
>
> My off hand thoughts are: one could store the short read data in a
> Chado genome database, but that is a bit like storing all the traces from
> your genome sequencing also, where most folks only want the finished
> assembly for long term management in a genome database.  Handling the
> preliminary, unassembled reads this way is probably a big effort.
>
> The short read data processing tools like SAM/BAM would let you store
> and fetch this raw data, then you could save metadata / features derived from
> those in a Chado genome db.
>
> I like using a Unix file system instead of an RDBMS where suited;
> raw data in bulk form from sequencers is perhaps best left in its raw files,
> with indexing as needed to pull individual data. But usually you want
> to process the whole experiment as a batch, with say perl or C compiled
> programs, then db-store the results of those as genome features.  Your
> database can store the metadata for each experiment, including a file path
> to the raw reads.
>
> - Don
>