database schema?
Steven Schmidt
schmidt at cshl.edu
Mon May 13 16:50:56 EDT 2002
Hi Andy,
The schema for the sequence data is in
gramene_ensembl/genbank-fetch/createSchema.sql
There are some supplements to it in gramene_ensembl/load-scripts
The mysql queries are mostly fixed-up on the fly. A few had to be changed.
This is explained in the attached document, which will be included in future
releases.
-Steve
On Thursday 09 May 2002 14:31, Coats, Andy wrote:
> I downloaded the gramene web code recently and have a couple of questions
> about the backend underlying it:
>
> --could you provide a database schema for the rice data? This data is
> stored in an Oracle db, correct?
>
> --I noticed that, in the Ensembl portion of the code
> (gramene_ensembl-data-rel-5-0), the calls to the Mysql db are still intact.
> Are these really being used, so that there are queries on both an Oracle
> and a Mysql database? Thanks.
>
> Andy Coats
--
Steven Schmidt
www.gramene.org
Cold Spring Harbor Laboratory
516-367-6977
-------------- next part --------------
What we did was not so much change Ensembl to work with Oracle, but
as much as possible, make it work without change, so that upgrading to
newer version of Ensembl would be easier, at the cost of some efficiency.
Right now we're still using the 1.0.0 we started with, but should be
up-to-date later this spring.
genbank-fetch/create.sql
========================================================================
Our db schema
We prefixed column names which are oracle reserved words, like 'sequence',
with "ORCL_".
========================================================================
The general part of the adaptation is
Bio::EnsEMBL::DBSQL::DBAdaptor uses
Bio::EnsEMBL::DBSQL::SQL.pm uses
Bio::EnsEMBL::DBSQL::SQL::StatementHandle uses
Bio::EnsEMBL::DBSQL::SQL::Orowhash
Oracle returns all column names as upper case, and the queries
usually have them lower case--a problem for fetchrow_hashref().
I added some NAME_lc's ( $sth->fetchrow_hashref('NAME_lc') )
before realizing this wasn't sufficient.
(There are queries mentioning columns like "xrefId".
So I gave up fixing the Ensembl code and did the Orowhashref thing.)
So these NAME_lc's can be ignored.
ensembl/modules/Bio/EnsEMBL/DBSQL/DBAdaptor.pm
========================================================================
Oracle-style DBI connect
and also save more connect parameters for the map database
(I think this works, but we're using our own map db now).
Adds some tracing -- so you can see better which queries don't work
I'm not sure if the AUTOLOAD I added is really necessary.
You'll want to change other Adaptors you use similarly.
========================================================================
ensembl/modules/Bio/EnsEMBL/DBSQL/SQL.pm
========================================================================
Tries to fix up queries
prepare returns our own StatementHandle type so the results can be munged.
Ofixsql handles reserved words, single vs. double quotes, IF (we're
lucky they can all be made DECODE's), DATES, etc.
The -- stuff is necessary because of queries build with
"Where field1>field2-$x" and $x is negative
========================================================================
ensembl/modules/Bio/EnsEMBL/DBSQL/SQL/StatementHandle.pm
========================================================================
Just makes fetchrow_hasref return our funny kind of hash.
========================================================================
ensembl/modules/Bio/EnsEMBL/DBSQL/SQL/Orowhash.pm
========================================================================
For select_row_hashref.
Makes it case insensitive,
Also handles the column names we changed by adding "ORCL_"
========================================================================
conf/SiteDefs.pm
========================================================================
Relevant variables:
$ENSEMBL_DRIVER = 'Oracle'; # ensembl database driver
$ENSEMBL_HOST = 'cheese'; # ensembl database server
$ENSEMBL_HOST_PORT = '1521'; # ensembl database port (Oracle)
$ENSEMBL_DB = 'gramene'; # ensembl database name
$ENSEMBL_HELP = 'gramene'; # help database name
$ENSEMBL_HELP_HOST = 'cheese'; # help database name
$ENSEMBL_HELP_PORT = '1521'; # help database port
$ENSEMBL_HELP_DBUSER = 'riceseq'; # help database name
$ENSEMBL_HELP_DBPASS = 'passwd'; # help database name
$ENSEMBL_DBUSER = 'riceseq'; # database username
$ENSEMBL_DBPASS = 'passwd'; # database password
(etc. for other databases SNP, FAMILY, etc. if you have them)
$ENSEMBL_ORACLE_HOME = '/oracle1/app/oracle/product/9.0.1.1';
$ENSEMBL_ORACLE_BASE = '/oracle1/app/oracle';
$ENSEMBL_TWO_TASK = 'T:localhost:gramene';
$ENSEMBL_TNS_ADMIN = '/oracle1/app/oracle/product/9.0.1.1/network/admin';
$ENSEMBL_ORACLE_SID = 'gramene';
========================================================================
conf/httpd.conf
========================================================================
These variables are put into the environment of each mod_perl process:
$ENSEMBL_ORACLE_HOME
$ENSEMBL_ORACLE_BASE
$ENSEMBL_TWO_TASK
$ENSEMBL_TNS_ADMIN
========================================================================
perl/EnsWeb.pm
========================================================================
get_locator() mentions everything Oracle wants.
========================================================================
ensembl/modules/Bio/EnsEMBL/DBSQL/RawContig.pm
========================================================================
get_all_SimilarityFeatures_above_score:
Oracle doesn't like "not in" with >1000 entries,
so I build up a "not in" clause for the first 900 features already
processed and remember the rest in a hash.
You might have this problem elsewhere.
========================================================================
ensembl/modules/Bio/EnsEMBL/DBSQL/DBPrimarySeq.pm
========================================================================
Since varchar2 is too short for the sequence dna table, we use CLOB.
So we can't do SUBSTR (etc.) in SQL (Why can't Oracle throw in a hack
for this??)
For convenience we add a column for the length.
You could do better by using the DBMS_LOB PL/SQL package, but we haven't
gotten around to it.
========================================================================
helpview:
========================================================================
The text is often too long to be a varchar2. So each record is split into
several, distinguished by a "piece" number. kw and title are repeated,
so it's not normalized, but makes for quick adaption of the code.
helpview/mkhelpview.sql is the new schema
helpview/prep_table.pl change the helpview table as given by EnsEMBL
into a format that's easier to edit
helpview/gramene_helpview.txt is our help in this format
helpview/load_helpview.pl loads this file into the database, splitting
each entry that needs to be.
( EnsemblLocator is defined in CSHL::Config as
use constant EnsemblLocator => 'Bio::EnsEMBL::DBSQL::DBAdaptor/host=cheese.cshl.org;port=1521;dbname=gramene;user=riceseq;pass=passwd;perlonlyfeatures=0;mapdbname=;driver=Oracle;debug=0';
)
perl/HelpView.pm is modified to fetch all the pieces and put them together.
========================================================================
perl/unisearch
========================================================================
Database connection fixed-up
( added %dbh to avoid Apache::Registry "will not stay shared" errors(this
has nothing to do with Oracle))
Fixed CONCAT(A,B,C) to CONCAT(CONCAT(A,B),C) in queries
Got rid of the offset & limit stuff.
========================================================================
doc/EnsEMBL_Oracle.txt contains some old notes
More information about the Gramene
mailing list