[Gmod-help] new database and GBrowse

Fri Dec 5 14:38:31 EST 2008

Hi Dave

Thank you very much for your detailed and informative reply - much, much
appreciated!

I will go over in detail all of the information you sent me and then go from
there - but I wanted to reply to you beforehand just to say thanks!

Certainly I am willing to help the GMOD community where I can and will share
any information I am able to share (not sure yet about sharing the schema as
it is the property of the hospital for which I work.so I would need
permission first) - but certainly any input I can offer, I will.

If I have further questions, you may hear from me. 

Thanks again for your help - terrific!

Have a wonderful weekend J

Margie

From: Dave Clements, GMOD Help Desk [mailto:gmodhelp at googlemail.com] 
Sent: December-04-08 7:30 PM
To: Margie Manker
Cc: GMOD Help Desk
Subject: Re: [Gmod-help] new database and GBrowse

Hi Margie,

I remember speaking with you in Toronto. I hope that you are still enjoying
working in biology!

-          I am creating two new database schemas that will contain mostly
genomic variation data as well as some phenotype data. These data will also
include information on a study, methods, platforms, subjects, samples, etc.

-          I would like to create a schema that suits the needs of our
organization. I have reviewed Chado in some detail and it does not suit the
needs of our organization. Ideally, our own schema should be used and I
would like to continue with this approach.

Can you describe what you found lacking in Chado?  This will help us improve
it in the near future:  Chado is extendable and NESCent (nescent.org) has
developed a natural diversity module for Chado. This is still in Beta (and
is likely to change before it is released).  It is based on the GDPDM, which
is used at Gramene and MaizeGenetics for this purpose.  One of my
deliverables for 2009 is to get the natural diversity module out of Beta and
into production Chado.

Several things should help this along.  One is a NESCent working group that
needs this to be done, and secondly we are trying to schedule a GMOD natural
diversity hackathon for 2009 that will move this work forward.  

If you are interested the natural diversity module and GDPDM are described
at:

http://heliconiusdb.svn.sourceforge.net/viewvc/heliconiusdb/trunk/schema/doc
/

  http://www.maizegenetics.net/gdpdm/

I think all this work may come too late for your needs.  However, I
encourage you to look at the current beta release as a possible solution.
When I actually get to work on this (probably starting in February) I may
ask you for any insights you have and for a copy of your schema.  If you are
really lucky (!) I might even ask if you are interested in attending the
hackathon.  :-)

-          We will most likely employ GBrowse as the genome browser for
display of data in the above databases.

-          My highest level questions that I have yet to find appropriate
answers to are these:

o   Can I use my own schema to build the database which underlies Gbrowse?
If so, will a separate 'Bio::DB::GFF' database need to be created to act as
a bridge between my database and Gbrowse?

o   What components would I most likely need from GMOD to get my database
and GBrowse to work together?

-          From what I can determine based on the documentation, I should be
able to use my own database schema to underlie GBrowse. It looks like my
database would require a GBrowse adaptor (Bio::DB::GFF??) and GBrowse. It
also looks like I might need an annotation pipeline, too.

-          Other questions that arise are:

o   What is "Bio::DB::GFF"? Is it a database? Schema? Adaptor?

o   Where does annotation data come from? What is the annotation pipeline?

GBrowse uses adaptors to read different data sources.  The data source can
be flat files (GFF3 + FASTA if you want the sequence), or databases, or any
other data source you can imagine.  I believe that all adaptors are written
in Perl.  Each adaptor has an expected input format.  The database adaptors
expect a specific schema to talk to.  

So Bio::DB::GFF is a Perl module that is a GBrowse adaptor.  It expects to
read from a database with a specific schema.  (Bio::DB::GFF also assumes
GFF2, a now deprecated format.)

However, writing an adaptor is not a small undertaking.  Probably a much
easier way to tackle this is to write a program to export GFF3 and FASTA
formatted files from your database and then load it into a into a
Bio::DB::SeqFeature::Store MySQL database.  This will likely be faster than
running directly off of your source database.  GFF3 is a flat file format
for specifying genomic features (genes, exons, SNPs, ...) and relationships
between them.  FASTA is a flat file format for specifying sequence.

Since you have a custom database, there is not going to be any program that
will create GFF3 or FASTA for you. FASTA should be trivial to create (if you
have the sequence).  GFF3 will require more work.  Some code you could look
at for inspiration is the GMODTools suite (http://gmod.org/wiki/GMODTools).
It does conversion from several formats to GFF3.

Where does annotation data come from?  From an annotation pileline!

Wait.  That answer isn't helpful, darnit.  A pipeline is usually a series
(thus a pipeline) of programs that performs some analysis on sequence.  For
example, you might have an already annotated reference genome, and a slew of
short sequences reads from ESTs* from the latest high-throughput sequencer
and you want to annotate the reference genome with the new data.  your
pipeline might be:

1. Assemble the short reads into a series of contigs (put the short reads
together into longer chunks, hopefully each as long as the complete EST).  

2. Align the contigs to the reference genome (figure out where they came
from)

3. Create a GFF3 file and a FASTA file (not sure on the FASTA) describing
where each EST aligns to and load it into GBrowse.

All of these steps may involve heavy magic.  Fortunately, most of that magic
is already done by the people who have written the programs to do the steps.

ESTs = a relatively easy way to find out what part of the genome is being
transcribed (what the active genes are)

As I said, I am relatively new to GMOD and I find the online documentation
is plentiful, but not easily navigated by the newbie. After two weeks of
reading the documentation I find I am now going in circles looking for
answers to my questions - and information on how to design an information
system employing components of GMOD. 

Ideally a diagram that displays a database and how it interacts with the
components of GMOD would be great to see. I haven't yet found anything like
this in the documentation. At the very least, if someone could steer me in
the right direction as far as what components I should focus on and what
specific documentation I can read, it would be appreciated.

It is possible to use GBrowse as a standalone tool, without any other GMOD
tools.  A lot of people actually do this.  It sounds like this might work
fine for you. 

Thanks for the documentation suggestions.  We just did a community survey
and one of the top priorities for the help desk was improving the
documentation.  Look for progress in 2009.

Finally, although you didn't ask for it, I can think of two GBrowse
instances that might show datatypes that are sort of similar to what you are
doing:

  http://hapmap.org

  http://jimwatsonsequence.cshl.edu/cgi-perl/gbrowse/jwsequence/ or

  http://jimwatsonsequence.cshl.edu/cgi-perl/gbrowse/cvsequence/

Please let em know if you have any questions or comments.

Dave C

541 914 6324

AIM or Skype user: tnabtaf 

Any assistance you can provide on these questions would be tremendously
appreciated. And if I can, in turn, provide some input on how to create some
"newbie" documentation, I will do so - to help others in my situation.

Also.I have 15 years' experience working with relational databases.but not
genomic databases.so you can assume a level of technical understanding, but
with the caveat that genomic databases are new territory for me.

Thanks so much for your time.

Kind regards,

Margie Manker

Was this helpful?  Let us know at http://gmod.org/wiki/Help_Desk_Feedback

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://brie4.cshl.edu/pipermail/gmod-help/attachments/20081205/cdf1ed90/attachment.html>