[Gmod-help] RE: Bee Base site - databases
Dave Clements
clements at nescent.org
Thu Oct 21 12:43:52 EDT 2010
Hi Tom,
I will answer some of your questions below. I'll try to avoid answering the
same questions that Scott did.
Dave C
On Wed, Oct 20, 2010 at 3:56 PM, Walk, Tom <Tom.Walk at ars.usda.gov> wrote:
> Chris,
>
>
>
> I am a new postdoc working for Scott Geib. Please pardon any redundancy or
> confusion on my part.
>
>
>
> fyi, I am including gmod help in case they can more readily answer my
> questions.
>
>
>
> Your input is very much appreciated here, where we are sequencing the
> oriental fruit fly genome along with transcripts, and are in the initial
> stages of web and database development. I have worked at the Broad
> Institute and TAIR, so I am familiar with using these things, but they were
> already well constructed before my arrival. Here at the USDA we are
> starting from scratch.
>
>
>
> Right now, there are a few things I want to ask about.
>
>
>
> One thing I am unclear about is the distinction between a GBrowse db and a
> drupal db.
>
>
>
> Would the drupal db be part of web user queries? For example, if a web
> user wants a list of all kinases, would the search only use the drupal db?
> If so, it seems redundant to me. We will have all of the information in a
> genomics db, perhaps with a Chado schema. I can see that using this for web
> queries might pose integrity risks. Are you suggesting that we use 2
> somewhat overlapping db, one for internal use and another for public use.
> If so, I see the public db being mostly a subset of the internal one, with
> us choosing which fields to make public, and possibly adding tables for web
> specific info, such as links followed or user info.
>
>
>
> Alternatively, should we use the GBrowse db for internal use as well as
> the backend for genome related web use, and limit the drupal db use to
> nongenome related web info?
>
I am far from a Drupal expert, but I'll take a stab at this. I concur with
Chris that you want a separate database for GBrowse. Something like a
Bio::SeqFeature::Store database is custom built to power an application like
GBrowse. As Scott said, Chado can do it for small to medium datasets but it
won't be quite as fast. For large datasets, performance will become a
serious issue.
It's worth thinking about viewing a Chado instance as your "system of
record." It is where the definitive copy of your data lives. It can also
store a lot more datatypes then a GBrowse-centric db. So, yes, I would use
1 db to power your genome browser, and Chado to power your non-genome
website. Therefore, Drupal would not come into consideration for your
genome browser, only your non-genome web data.
Drupal complicates things further. Drupal 6 (as I understand it) does not
integrate well with pre-existing databases that weren't created by Drupal,
e.g., Chado. Tools like Tripal and GMOD-DBSF (see http:/
gmod.org/wiki/Tripal and http://gmod.org/wiki/Gmod-dbsf) jump through some
hoops to make Drupal talk to Chado. Tripal uses synchrnoization to keep the
Drupal side and the Chado side in agreement. I'm not sure what GMOD-DBSF
does.
This will become easier in Drupal 7, which will be able to talk directly to
external databases. See http://drupal7releasedate.com/ for a statistics
based estimate on when Drupal 7 will be released. However, that won't
immediately help Tripal or GMOD-DBSF, as they will both need to be upgraded,
and that can happen until all the Drupal modules they use are ported to
Drupal 7, which will likely be a while.
>
> Should we use the same RDBMS for both? It would seem to be simpler, but I
> may be missing some reasons why we need both mySQL and PostgreSQL.
>
>
>
> While I am on the subject, does PostgreSQL have problems with the size of
> sequence objects? Should that factor into our decisions?
>
>
> As for which tools we will use, I think that a lot of decisions remain. We
> may use GBrowse and Apollo, but I am also experimenting with Argo. I was
> leaning toward using Chado, but from your email, and from looking at the
> tables, it seems that we may want to use another schema, probably simpler
> than Chado for our purposes.
>
>
>
> Is GBrowse limited to the schemas outlined under adaptors in the following?
>
> http://gmod.org/wiki/GBrowse#About_Databases
>
There are probably one or two other ones floating around out there, but
these are the ones that are well supported. I would say that everything
from Bio::DB::SeqFeature::Store through Bio::DB::Das::Chado is both well
supported and widely enough used to consider.
>
>
> Do you know which ones are most supported or least buggy? Do you have a
> recommendation?
>
>
>
> In a related inquiry, are the Gmod tools flexible?
>
> I see a lot of tables in Chado, which is great, but for whatever db we use,
> can we add or delete tables? How difficult is it to incorporate new tools
> or info?
>
Flexibility is a key goal at GMOD. We strive to have software that is
useful in a wide variety of environments. This means the software is
usually both very configurable and extensible. However, I would say that
almost every large project pushes the software in new directions. Sometimes
this new functionality is implemented by the tool developers, and sometimes
by users.
You can drop/modify/add tables and columns to Chado. Many organizations do
this. I would be very careful deleting things from the core: General, CV,
Sequence, Pub, Organism. Tools should continue to work with the addition of
new tables/columns, provided this doesn't cause constraint violations for
existing tools.
>
>
> For example, we are dealing with multiple strains/species. If SNP
> analysis is not in the db, can we add table for it, or are we constrained to
> existing tables and fields?
>
>
>
> In addition, how well do the GMOD tools and db's handle existing functional
> or structural analysis or adapt to new analysis tools?
>
>
>
> Finally, at USDA, we are interested in biotic interactions. Are there
> tables to link organisms like there are for linking protein interactions?
> If not, are the tools extensible for that?
>
This is a poorly documented area of Chado. There is a phylogeny module for
that type of data. There is also a natural diversity module under
development that will be able to support arbitrary crosses and breeding.
This will all likely become clearer during the upcoming GMOD Evo Hackathon
next month.
>
>
> It looks like anything that can be mapped out as a feature with genome
> coordinates can be handled. So if we use new tools, or find pathogenicity
> genes or markers, then we can use GMOD. Perhaps you can correct that if it
> is wrong or too simplistic.
>
I think this is correct.
>
>
> I think that is enough for today. It seems that I still have a lot to
> figure out. Sorry if this is too long. As I learn more and we progress, I
> will likely seek out more advice. If you would prefer to talk to me on the
> phone, please call the number below.
>
>
>
> Thanks for all of you help to date and for any feedback you can provide to
> this inquiry.
>
>
>
> Tom Walk
>
> tom.walk at ars.usda.gov
>
> 808 932 2176
>
>
>
>
>
>
>
> *From:* Chris Childers [mailto:genetics.guy at gmail.com]
> *Sent:* Wednesday, October 20, 2010 4:22 AM
> *To:* Natasha Sostrom
> *Subject:* Re: Bee Base site - databases
>
>
>
> Hi Natasha,
>
> The short answer is that I would recommend keeping your GBrowse database
> separate from the drupal database, for the simple reason that this allows
> you to have more flexibility in the future. You might want to run both
> databases on MySQL or Postgres, or have one on each. I'll talk a little
> more about this below. I apologize if this is something you already know
> about, but I wanted to try clarifying my earlier response. The long answer
> is below.
>
> There is an important distinction that a lot of people can get mixed up
> over when talking about databases, and this can be confusing to folks that
> are just getting into it. There are actually two distinct things people
> talk about when they mention databases. One is a "RDBMS" or Relational
> Database Management System", and the other is the databases that live in
> that system.
>
> The RDBMS is something like MySQL, or postrges, or Oracle, and it includes
> all the software for storing and managing information. Many of the RDBMS
> out there use SQL, and there is a lot of overlap in how you interact with
> the data, regardless of whether it is a postgres or mysql database. There
> are some differences though, and that's why people use different systems for
> different uses. Each RDBMS can hold many databases, and each database can
> have lots of data.
>
> The GMOD tool Chado is the main relational database for housing all the
> information you might have, but it has historically had problems when used
> as a back end for GBrowse. GBrowse has several different database schemas
> (a schema is like a blueprint for how to store the data) that it can use,
> as long as you specify which one you use.
>
> That was why I was asking if you were still planning to only use GBRowse,
> or if you had decided to also start using Chado. If you are going to use
> Chado, I have heard that the new version of GBrowse runs a lot better with
> it, but I haven't tested it myself. If you guys are only planning to use
> GBrowse, you might just want to use one of the basic MySQL databases. Those
> are much smaller and run really fast.
>
> Sorry about the long winded answer. I hope this helps you guys with your
> planning.
>
> Thanks,
> Chris
>
> On Tue, Oct 19, 2010 at 7:44 PM, Natasha Sostrom <sostrom at hawaii.edu>
> wrote:
>
> Chris,
>
>
> I apologize for not being clear about what the situation was. Right now we
> are still in the development stage. Nothing has gone live, and we are trying
> to make some decisions about where we want our site to go and such.
>
>
> MySQL is what we were using for the general functionality of the Drupal
> site. As we speak we have not set up anything on the website to display
> data. Is it best to JUST use postgres?
>
>
> I did see the iFrame module, which seems very useful. Which is why I'm
> wondering whether we should use two separate databases or just one. To chose
> just ONE database for the entire website, which would be best?
>
>
> Thank you
> Natasha Sostrom
>
>
>
> ----- Original Message -----
> From: Chris Childers <genetics.guy at gmail.com>
> Date: Tuesday, October 19, 2010 3:33 am
> Subject: Re: Bee Base site - databases
> To: Natasha Sostrom <sostrom at hawaii.edu>
>
> > Hi Natasha,
> >
> > Are you still planning to run GBrowse only, or are you using Chado? In
> our lab, we have instances of Chado to store our community annotation data
> and mysql databases to house the GBrowse data.
> >
> > When you say a mysql database site, are you referring to a GBrowse
> page? Or are you using some other software to display the data in the
> postgres database?
> >
> > In terms of showing a GBrowse page using an iframe, this is not a
> problem as long as you are nor planning to send extra information va the
> address bar. Drupal has an iframe plugin that simplifies the syntax for
> making an iframe, and it can auto set the frame height to the length of the
> page, which is great for dynamically generated pages.
> >
> > I hope this helps,
> > Chris
> >
>
> > On Mon, Oct 18, 2010 at 8:17 PM, Natasha Sostrom <sostrom at hawaii.edu>
> wrote:
>
> > Chris,
> >
> >
> > I emailed you a while back about Gbrowse and Drupal. Now we have come to
> find that we need to use PostgreSQL for GMOD, while the Drupal site is
> currently using MySQL. In the last email you mentioned using iFrames which
> is a good way to display a postgresql database site within a mysql site. Is
> this what you did?
> >
> >
> > A fellow employee mentioned that it may be best to just use one database
> (migrating to PostgreSQL).
> >
> >
> > Do you have any insight about this?
> >
> >
> > Thanks in advance,
> > Natasha Sostrom
>
> >
> >
> >
> > --
> > Chris Childers
> > Postdoctoral Fellow
> > Elsik Computational Genomics Laboratory
> > Georgetown University
> > Department of Biology
> > 406 Reiss Bldg
> > Washington, DC 20057
> > Phone 202-687-5855
> > Fax 202-687-5662
> >
>
>
>
>
> --
> Chris Childers
> Postdoctoral Fellow
> Elsik Computational Genomics Laboratory
> Georgetown University
> Department of Biology
> 406 Reiss Bldg
> Washington, DC 20057
> Phone 202-687-5855
> Fax 202-687-5662
>
--
http://gmod.org/wiki/GMOD_News
http://gmod.org/wiki/Calendar
http://gmod.org/wiki/Help_Desk_Feedback
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://brie4.cshl.edu/pipermail/gmod-help/attachments/20101021/42c274f0/attachment.html>
More information about the Gmod-help
mailing list