[Gmod-help] new database and GBrowse

Mon Dec 8 16:11:13 EST 2008

Hi Margie

You will still need an adaptor such as Bio::DB::GFF (GFF2) or
Bio::DB::SeqFeature::Store (GFF3) to provide data to GBrowse.  However, you
won't need other components such as Chado, Apollo, or CMAP.  You can write
your own adaptor if you choose, but this is non-trivial and you are instead
encouraged to write a program to generate GFF3 & FASTA files from your
database and then use something like bp_seqfeature_load.pl (a bioperl
program) to load the data into a MySQL DB that GBrowse will then read.

I don't think there is any specific documentation on how to write a GBrowse
adaptor.  (Scott, Lincoln, please chime in if you disagree.)  If you still
want to go down this path, I would start by looking at the Chado adaptor
(Bio::DB::Das::Chado), which is inside the GBrowse distribution.

If you haven't already done so, I encourage you to walk through the GBrowse
tutorial, actually do an installation and playing with the configuration.
 At the end of the tutorial it tells you how to hook GBrowse up to a
database instead of running it from flat files.  Installing the prequisites
for for GBrowse will take some effort (usually BioPerl and libgd) but once
you have it up and running, the tutorial is pretty straightforward.

I'm not familiar with the history or details of the Database of Genomic
Variants (DGV), but I'll take a stab at your questions.  The data behind
tracks such as RefSeq and UCSC segmental duplications were generated by NCBI
(certainly) and UCSC (probably) and made available in some common
bioinformatics file format (if you are lucky, GFF3).  NCBI and UCSC ran
pipelines to generate these data, but as a consumer of the data, you don't
need to know all the details of how those pipelines worked, just what the
output formats are.  The data was generated against a specific build of the
reference human genome.

To create the DGV GBrowse instance the site admins needed to get data files
describing the reference genome, the RefSeq genes, the UCSC segmental
duplications, and so on.

For any file that is not already in GFF2/3, they need to translate the data
into GFF2/3.  If they were lucky, those programs already existed.  Once the
data is in GFF they can load it into a GBrowse database and then show it in
GBrowse.

There may be other data at DGV that was locally created, perhaps the all
CNV's track.  In this case, a lot more work had to be done to get the data
from whatever format it was in, into GFF so it could be loaded into the DB,
and then seen by GBrowse.

All GBrowse instances get their data from somewhere (see
http://gmod.org/wiki/GBrowse_Adaptors).  Is that somewhere external?
 "External" and "Internal" may not be meaningful here.  GBrowse is a Perl
program that reads its data from a data source.  A more useful question
might be "Does the data source exist solely to provide data to GBrowse?"
 The answer to that question is most often Yes.  Most users probably did not
have Bio:DB::GFF or Bio::DB::SeqFeature::Store databases prior to installing
GBrowse.

Please let me know if any of this is not clear or if you have other
questions.  Please don't hesitate to call/Skype/Aim/Ichat me.

Thanks,

Dave C

On Mon, Dec 8, 2008 at 11:04 AM, Margie Manker <manker at populargenetics.ca>wrote:

>  Hi Dave
>
>
>
> I have had a chance to review your comments in detail and have a couple of
> follow-up questions for you:
>
>
>
> 1)      "It is possible to use GBrowse as a standalone tool, without any
> other GMOD tools.  A lot of people actually do this.  It sounds like this
> might work fine for you."  -- If I use GBrowse as a stand-alone tool, do I
> still need an adaptor (such as Bio::DB::GFF)? If not, I would conclude that
> GBrowse can be customized to read directly from my custom database. Is this
> correct? If I can use GBrowse without any other GMOD components, where is
> the best place to look at documentation specifically on how to do this?
>
>
>
> 2)      Annotation Pipeline: I am still unclear as to what is part of the
> annotation pipeline. For example, the tracks in the Database of Genomic
> Variants (i.e. Genome Browser:
> http://projects.tcag.ca/cgi-bin/variation/gbrowse/hg18/ ) display data
> such as RefSeq genes, UCSC segmental duplications, Clones, OMIM disease
> genes, etc. Where do these data come from? Are files downloaded, stored, and
> then read by the pipeline? Are these data part of GBrowse somehow? i.e. does
> GBrowse fetch data from an external source (external to my database/in-house
> information system) and display it to users?
>
>
>
> I think I will be getting ever closer to a better understanding of GMOD and
> GBrowse if you are able to provide further insight to the above topics.
>
>
>
> Again, thank you very much for your help with this. It is much appreciated
> and I look forward to the opportunity to, in turn, help others in our GMOD
> community.
>
>
>
> Cheers!
>
>
>
> Margie
>
>
>
>
>
>
>
> *From:* Dave Clements, GMOD Help Desk [mailto:gmodhelp at googlemail.com]
> *Sent:* December-04-08 7:30 PM
> *To:* Margie Manker
> *Cc:* GMOD Help Desk
> *Subject:* Re: [Gmod-help] new database and GBrowse
>
>
>
> Hi Margie,
>
> I remember speaking with you in Toronto. I hope that you are still enjoying
> working in biology!
>
>  -          I am creating two new database schemas that will contain
> mostly genomic variation data as well as some phenotype data. These data
> will also include information on a study, methods, platforms, subjects,
> samples, etc.
>
> -          I would like to create a schema that suits the needs of our
> organization. I have reviewed Chado in some detail and it does not suit the
> needs of our organization. Ideally, our own schema should be used and I
> would like to continue with this approach.
>
>  Can you describe what you found lacking in Chado?  This will help us
> improve it in the near future:  Chado is extendable and NESCent (
> nescent.org) has developed a natural diversity module for Chado. This is
> still in Beta (and is likely to change before it is released).  It is based
> on the GDPDM, which is used at Gramene and MaizeGenetics for this purpose.
>  One of my deliverables for 2009 is to get the natural diversity module out
> of Beta and into production Chado.
>
> Several things should help this along.  One is a NESCent working group that
> needs this to be done, and secondly we are trying to schedule a GMOD natural
> diversity hackathon for 2009 that will move this work forward.
>
> If you are interested the natural diversity module and GDPDM are described
> at:
>
>
> http://heliconiusdb.svn.sourceforge.net/viewvc/heliconiusdb/trunk/schema/doc/
>
>   http://www.maizegenetics.net/gdpdm/
>
> I think all this work may come too late for your needs.  However, I
> encourage you to look at the current beta release as a possible solution.
>  When I actually get to work on this (probably starting in February) I may
> ask you for any insights you have and for a copy of your schema.  If you are
> really lucky (!) I might even ask if you are interested in attending the
> hackathon.  :-)
>
>  -          We will most likely employ GBrowse as the genome browser for
> display of data in the above databases.
>
> -          My highest level questions that I have yet to find appropriate
> answers to are these:
>
> o   Can I use my own schema to build the database which underlies Gbrowse?
> If so, will a separate 'Bio::DB::GFF' database need to be created to act as
> a bridge between my database and Gbrowse?
>
> o   What components would I most likely need from GMOD to get my database
> and GBrowse to work together?
>
> -          From what I can determine based on the documentation, I should
> be able to use my own database schema to underlie GBrowse. It looks like my
> database would require a GBrowse adaptor (Bio::DB::GFF??) and GBrowse. It
> also looks like I might need an annotation pipeline, too.
>
> -          Other questions that arise are:
>
> o   What is "Bio::DB::GFF"? Is it a database? Schema? Adaptor?
>
> o   Where does annotation data come from? What is the annotation pipeline?
>
>  GBrowse uses adaptors to read different data sources.  The data source
> can be flat files (GFF3 + FASTA if you want the sequence), or databases, or
> any other data source you can imagine.  I believe that all adaptors are
> written in Perl.  Each adaptor has an expected input format.  The database
> adaptors expect a specific schema to talk to.
>
> So Bio::DB::GFF is a Perl module that is a GBrowse adaptor.  It expects to
> read from a database with a specific schema.  (Bio::DB::GFF also assumes
> GFF2, a now deprecated format.)
>
> However, writing an adaptor is not a small undertaking.  Probably a much
> easier way to tackle this is to write a program to export GFF3 and FASTA
> formatted files from your database and then load it into a into a
> Bio::DB::SeqFeature::Store MySQL database.  This will likely be faster than
> running directly off of your source database.  GFF3 is a flat file format
> for specifying genomic features (genes, exons, SNPs, ...) and relationships
> between them.  FASTA is a flat file format for specifying sequence.
>
> Since you have a custom database, there is not going to be any program that
> will create GFF3 or FASTA for you. FASTA should be trivial to create (if you
> have the sequence).  GFF3 will require more work.  Some code you could look
> at for inspiration is the GMODTools suite (http://gmod.org/wiki/GMODTools).
>  It does conversion from several formats to GFF3.
>
> Where does annotation data come from?  From an annotation pileline!
>
> Wait.  That answer isn't helpful, darnit.  A pipeline is usually a series
> (thus a pipeline) of programs that performs some analysis on sequence.  For
> example, you might have an already annotated reference genome, and a slew of
> short sequences reads from ESTs* from the latest high-throughput sequencer
> and you want to annotate the reference genome with the new data.  your
> pipeline might be:
>
> 1. Assemble the short reads into a series of contigs (put the short reads
> together into longer chunks, hopefully each as long as the complete EST).
>
> 2. Align the contigs to the reference genome (figure out where they came
> from)
>
> 3. Create a GFF3 file and a FASTA file (not sure on the FASTA) describing
> where each EST aligns to and load it into GBrowse.
>
> All of these steps may involve heavy magic.  Fortunately, most of that
> magic is already done by the people who have written the programs to do the
> steps.
>
> ESTs = a relatively easy way to find out what part of the genome is being
> transcribed (what the active genes are)
>
>  As I said, I am relatively new to GMOD and I find the online
> documentation is plentiful, but not easily navigated by the newbie. After
> two weeks of reading the documentation I find I am now going in circles
> looking for answers to my questions – and information on how to design an
> information system employing components of GMOD.
>
> Ideally a diagram that displays a database and how it interacts with the
> components of GMOD would be great to see. I haven't yet found anything like
> this in the documentation. At the very least, if someone could steer me in
> the right direction as far as what components I should focus on and what
> specific documentation I can read, it would be appreciated.
>
>  It is possible to use GBrowse as a standalone tool, without any other
> GMOD tools.  A lot of people actually do this.  It sounds like this might
> work fine for you.
>
> Thanks for the documentation suggestions.  We just did a community survey
> and one of the top priorities for the help desk was improving the
> documentation.  Look for progress in 2009.
>
> Finally, although you didn't ask for it, I can think of two GBrowse
> instances that might show datatypes that are sort of similar to what you are
> doing:
>
>   http://hapmap.org
>
>   http://jimwatsonsequence.cshl.edu/cgi-perl/gbrowse/jwsequence/ or
>
>   http://jimwatsonsequence.cshl.edu/cgi-perl/gbrowse/cvsequence/
>
> Please let em know if you have any questions or comments.
>
> Dave C
>
> 541 914 6324
>
> AIM or Skype user: tnabtaf
>
>
>
> Any assistance you can provide on these questions would be tremendously
> appreciated. And if I can, in turn, provide some input on how to create some
> "newbie" documentation, I will do so – to help others in my situation.
>
>
>
> Also…I have 15 years' experience working with relational databases…but not
> genomic databases…so you can assume a level of technical understanding, but
> with the caveat that genomic databases are new territory for me.
>
>
>
> Thanks so much for your time.
>
>
>
> Kind regards,
>
>
>
> Margie Manker
>
>  Was this helpful?  Let us know at http://gmod.org/wiki/Help_Desk_Feedback
>
>   Was this helpful?  Let us know at
http://gmod.org/wiki/Help_Desk_Feedback
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://brie4.cshl.edu/pipermail/gmod-help/attachments/20081208/b564bb6a/attachment.html>