[Gmod-help] new database and GBrowse

Wed Dec 10 16:46:00 EST 2008

Hi Dave

Again, thank you for your thorough reply.

I will walk through the GBrowse tutorial in the very near future. I have
been hesitant in doing this because it would require me to download things
to my laptop and start playing around with them. But my current laptop is
belongs to my company (Popular Genetics) and not my client (i.e. The Centre
for Applied Genomics) so I've been worried I might start breaking things (a
broken laptop means business comes to a halt). I've got an older laptop that
I am reconfiguring and eventually will install a Unix OS (probably CentOS as
recommended on the GMOD pages) - I will then be in a much better position
for downloading, installing, and monkeying around with GBrowse and such.

Your description of the annotation pipeline sheds some light on an area of
GBrowse that presented a fair bit of confusion to me. I now understand that
the data behind some of the tracks is obtained from an "eternal source" -
such as NCBI or UCSC - and then converted to an "internal source" that can
be read by GBrowse. The bottom line is that I finally understand where this
data originates and how it fits in the GBrowse world.

I will try to create a diagram of what I understand about GBrowse and how it
interacts with other GMOD components and a database. If this diagram turns
out to be of any value to the GMOD community (especially newbies like me),
you will be welcome to post it on the wiki (provided I get around to
creating it!).

I genuinely appreciate your help with all of this - it is terrific to have
you as a resource!

Cheers,

Margie

From: Dave Clements, GMOD Help Desk [mailto:gmodhelp at googlemail.com] 
Sent: December-08-08 4:11 PM
To: Margie Manker
Cc: GMOD Help Desk
Subject: Re: [Gmod-help] new database and GBrowse

Hi Margie

You will still need an adaptor such as Bio::DB::GFF (GFF2) or
Bio::DB::SeqFeature::Store (GFF3) to provide data to GBrowse.  However, you
won't need other components such as Chado, Apollo, or CMAP.  You can write
your own adaptor if you choose, but this is non-trivial and you are instead
encouraged to write a program to generate GFF3 & FASTA files from your
database and then use something like bp_seqfeature_load.pl (a bioperl
program) to load the data into a MySQL DB that GBrowse will then read.

I don't think there is any specific documentation on how to write a GBrowse
adaptor.  (Scott, Lincoln, please chime in if you disagree.)  If you still
want to go down this path, I would start by looking at the Chado adaptor
(Bio::DB::Das::Chado), which is inside the GBrowse distribution.

If you haven't already done so, I encourage you to walk through the GBrowse
tutorial, actually do an installation and playing with the configuration.
At the end of the tutorial it tells you how to hook GBrowse up to a database
instead of running it from flat files.  Installing the prequisites for for
GBrowse will take some effort (usually BioPerl and libgd) but once you have
it up and running, the tutorial is pretty straightforward.

I'm not familiar with the history or details of the Database of Genomic
Variants (DGV), but I'll take a stab at your questions.  The data behind
tracks such as RefSeq and UCSC segmental duplications were generated by NCBI
(certainly) and UCSC (probably) and made available in some common
bioinformatics file format (if you are lucky, GFF3).  NCBI and UCSC ran
pipelines to generate these data, but as a consumer of the data, you don't
need to know all the details of how those pipelines worked, just what the
output formats are.  The data was generated against a specific build of the
reference human genome.

To create the DGV GBrowse instance the site admins needed to get data files
describing the reference genome, the RefSeq genes, the UCSC segmental
duplications, and so on.

For any file that is not already in GFF2/3, they need to translate the data
into GFF2/3.  If they were lucky, those programs already existed.  Once the
data is in GFF they can load it into a GBrowse database and then show it in
GBrowse.

There may be other data at DGV that was locally created, perhaps the all
CNV's track.  In this case, a lot more work had to be done to get the data
from whatever format it was in, into GFF so it could be loaded into the DB,
and then seen by GBrowse.

All GBrowse instances get their data from somewhere (see
http://gmod.org/wiki/GBrowse_Adaptors).  Is that somewhere external?
"External" and "Internal" may not be meaningful here.  GBrowse is a Perl
program that reads its data from a data source.  A more useful question
might be "Does the data source exist solely to provide data to GBrowse?"
The answer to that question is most often Yes.  Most users probably did not
have Bio:DB::GFF or Bio::DB::SeqFeature::Store databases prior to installing
GBrowse.

Please let me know if any of this is not clear or if you have other
questions.  Please don't hesitate to call/Skype/Aim/Ichat me.

Thanks,

Dave C

On Mon, Dec 8, 2008 at 11:04 AM, Margie Manker <manker at populargenetics.ca>
wrote:

Hi Dave

I have had a chance to review your comments in detail and have a couple of
follow-up questions for you:

1)      "It is possible to use GBrowse as a standalone tool, without any
other GMOD tools.  A lot of people actually do this.  It sounds like this
might work fine for you."  -- If I use GBrowse as a stand-alone tool, do I
still need an adaptor (such as Bio::DB::GFF)? If not, I would conclude that
GBrowse can be customized to read directly from my custom database. Is this
correct? If I can use GBrowse without any other GMOD components, where is
the best place to look at documentation specifically on how to do this?

2)      Annotation Pipeline: I am still unclear as to what is part of the
annotation pipeline. For example, the tracks in the Database of Genomic
Variants (i.e. Genome Browser:
<http://projects.tcag.ca/cgi-bin/variation/gbrowse/hg18/>
http://projects.tcag.ca/cgi-bin/variation/gbrowse/hg18/ ) display data such
as RefSeq genes, UCSC segmental duplications, Clones, OMIM disease genes,
etc. Where do these data come from? Are files downloaded, stored, and then
read by the pipeline? Are these data part of GBrowse somehow? i.e. does
GBrowse fetch data from an external source (external to my database/in-house
information system) and display it to users? 

I think I will be getting ever closer to a better understanding of GMOD and
GBrowse if you are able to provide further insight to the above topics.

Again, thank you very much for your help with this. It is much appreciated
and I look forward to the opportunity to, in turn, help others in our GMOD
community.

Cheers!

Margie 

From: Dave Clements, GMOD Help Desk [mailto:gmodhelp at googlemail.com] 
Sent: December-04-08 7:30 PM
To: Margie Manker
Cc: GMOD Help Desk
Subject: Re: [Gmod-help] new database and GBrowse

Hi Margie,

I remember speaking with you in Toronto. I hope that you are still enjoying
working in biology!

-          I am creating two new database schemas that will contain mostly
genomic variation data as well as some phenotype data. These data will also
include information on a study, methods, platforms, subjects, samples, etc.

-          I would like to create a schema that suits the needs of our
organization. I have reviewed Chado in some detail and it does not suit the
needs of our organization. Ideally, our own schema should be used and I
would like to continue with this approach.

Can you describe what you found lacking in Chado?  This will help us improve
it in the near future:  Chado is extendable and NESCent (nescent.org) has
developed a natural diversity module for Chado. This is still in Beta (and
is likely to change before it is released).  It is based on the GDPDM, which
is used at Gramene and MaizeGenetics for this purpose.  One of my
deliverables for 2009 is to get the natural diversity module out of Beta and
into production Chado.

Several things should help this along.  One is a NESCent working group that
needs this to be done, and secondly we are trying to schedule a GMOD natural
diversity hackathon for 2009 that will move this work forward.  

If you are interested the natural diversity module and GDPDM are described
at:

http://heliconiusdb.svn.sourceforge.net/viewvc/heliconiusdb/trunk/schema/doc
/

  http://www.maizegenetics.net/gdpdm/

I think all this work may come too late for your needs.  However, I
encourage you to look at the current beta release as a possible solution.
When I actually get to work on this (probably starting in February) I may
ask you for any insights you have and for a copy of your schema.  If you are
really lucky (!) I might even ask if you are interested in attending the
hackathon.  :-)

-          We will most likely employ GBrowse as the genome browser for
display of data in the above databases.

-          My highest level questions that I have yet to find appropriate
answers to are these:

o   Can I use my own schema to build the database which underlies Gbrowse?
If so, will a separate 'Bio::DB::GFF' database need to be created to act as
a bridge between my database and Gbrowse?

o   What components would I most likely need from GMOD to get my database
and GBrowse to work together?

-          From what I can determine based on the documentation, I should be
able to use my own database schema to underlie GBrowse. It looks like my
database would require a GBrowse adaptor (Bio::DB::GFF??) and GBrowse. It
also looks like I might need an annotation pipeline, too.

-          Other questions that arise are:

o   What is "Bio::DB::GFF"? Is it a database? Schema? Adaptor?

o   Where does annotation data come from? What is the annotation pipeline?

GBrowse uses adaptors to read different data sources.  The data source can
be flat files (GFF3 + FASTA if you want the sequence), or databases, or any
other data source you can imagine.  I believe that all adaptors are written
in Perl.  Each adaptor has an expected input format.  The database adaptors
expect a specific schema to talk to.  

So Bio::DB::GFF is a Perl module that is a GBrowse adaptor.  It expects to
read from a database with a specific schema.  (Bio::DB::GFF also assumes
GFF2, a now deprecated format.)

However, writing an adaptor is not a small undertaking.  Probably a much
easier way to tackle this is to write a program to export GFF3 and FASTA
formatted files from your database and then load it into a into a
Bio::DB::SeqFeature::Store MySQL database.  This will likely be faster than
running directly off of your source database.  GFF3 is a flat file format
for specifying genomic features (genes, exons, SNPs, ...) and relationships
between them.  FASTA is a flat file format for specifying sequence.

Since you have a custom database, there is not going to be any program that
will create GFF3 or FASTA for you. FASTA should be trivial to create (if you
have the sequence).  GFF3 will require more work.  Some code you could look
at for inspiration is the GMODTools suite (http://gmod.org/wiki/GMODTools).
It does conversion from several formats to GFF3.

Where does annotation data come from?  From an annotation pileline!

Wait.  That answer isn't helpful, darnit.  A pipeline is usually a series
(thus a pipeline) of programs that performs some analysis on sequence.  For
example, you might have an already annotated reference genome, and a slew of
short sequences reads from ESTs* from the latest high-throughput sequencer
and you want to annotate the reference genome with the new data.  your
pipeline might be:

1. Assemble the short reads into a series of contigs (put the short reads
together into longer chunks, hopefully each as long as the complete EST).  

2. Align the contigs to the reference genome (figure out where they came
from)

3. Create a GFF3 file and a FASTA file (not sure on the FASTA) describing
where each EST aligns to and load it into GBrowse.

All of these steps may involve heavy magic.  Fortunately, most of that magic
is already done by the people who have written the programs to do the steps.

ESTs = a relatively easy way to find out what part of the genome is being
transcribed (what the active genes are)

As I said, I am relatively new to GMOD and I find the online documentation
is plentiful, but not easily navigated by the newbie. After two weeks of
reading the documentation I find I am now going in circles looking for
answers to my questions - and information on how to design an information
system employing components of GMOD. 

Ideally a diagram that displays a database and how it interacts with the
components of GMOD would be great to see. I haven't yet found anything like
this in the documentation. At the very least, if someone could steer me in
the right direction as far as what components I should focus on and what
specific documentation I can read, it would be appreciated.

It is possible to use GBrowse as a standalone tool, without any other GMOD
tools.  A lot of people actually do this.  It sounds like this might work
fine for you. 

Thanks for the documentation suggestions.  We just did a community survey
and one of the top priorities for the help desk was improving the
documentation.  Look for progress in 2009.

Finally, although you didn't ask for it, I can think of two GBrowse
instances that might show datatypes that are sort of similar to what you are
doing:

  http://hapmap.org

  http://jimwatsonsequence.cshl.edu/cgi-perl/gbrowse/jwsequence/ or

  http://jimwatsonsequence.cshl.edu/cgi-perl/gbrowse/cvsequence/

Please let em know if you have any questions or comments.

Dave C

541 914 6324

AIM or Skype user: tnabtaf 

Any assistance you can provide on these questions would be tremendously
appreciated. And if I can, in turn, provide some input on how to create some
"newbie" documentation, I will do so - to help others in my situation.

Also.I have 15 years' experience working with relational databases.but not
genomic databases.so you can assume a level of technical understanding, but
with the caveat that genomic databases are new territory for me.

Thanks so much for your time.

Kind regards,

Margie Manker

Was this helpful?  Let us know at http://gmod.org/wiki/Help_Desk_Feedback

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://brie4.cshl.edu/pipermail/gmod-help/attachments/20081210/d21590b3/attachment.html>