database schema?

Mon May 13 16:50:56 EDT 2002

Hi Andy,

The schema for the sequence data is in  
gramene_ensembl/genbank-fetch/createSchema.sql
There are some supplements to it in gramene_ensembl/load-scripts

The mysql queries are mostly fixed-up on the fly.  A few had to be changed.
This is explained in the attached document, which will be included in future 
releases.

-Steve

On Thursday 09 May 2002 14:31, Coats, Andy wrote:
> I downloaded the gramene web code recently and have a couple of questions
> about the backend underlying it:
>
> --could you provide a database schema for the rice data?  This data is
> stored in an Oracle db, correct?
>
> --I noticed that, in the Ensembl portion of the code
> (gramene_ensembl-data-rel-5-0), the calls to the Mysql db are still intact.
> Are these really being used, so that there are queries on both an Oracle
> and a Mysql database?  Thanks.
>
> Andy Coats

-- 
Steven Schmidt
www.gramene.org
Cold Spring Harbor Laboratory
516-367-6977
-------------- next part --------------

What we did was not so much change Ensembl to work with Oracle, but
as much as possible, make it work without change, so that upgrading to
newer version of Ensembl would be easier, at the cost of some efficiency.

Right now we're still using the 1.0.0 we started with, but should be
up-to-date later this spring.

genbank-fetch/create.sql
========================================================================
Our db schema
We prefixed column names which are oracle reserved words, like 'sequence',
with "ORCL_".
========================================================================

The general part of the adaptation is 
Bio::EnsEMBL::DBSQL::DBAdaptor uses 
Bio::EnsEMBL::DBSQL::SQL.pm uses
Bio::EnsEMBL::DBSQL::SQL::StatementHandle uses
Bio::EnsEMBL::DBSQL::SQL::Orowhash

Oracle returns all column names as upper case, and the queries
usually have them lower case--a problem for fetchrow_hashref().
I added  some NAME_lc's (  $sth->fetchrow_hashref('NAME_lc')  )
before realizing this wasn't sufficient.
(There are queries mentioning columns like "xrefId".
So I gave up fixing the Ensembl code and did the Orowhashref thing.)
So these NAME_lc's can be ignored.

ensembl/modules/Bio/EnsEMBL/DBSQL/DBAdaptor.pm
========================================================================
Oracle-style DBI connect 
and also save more connect parameters for the map database
(I think this works, but we're using our own map db now).

Adds some tracing -- so you can see better which queries don't work

I'm not sure if the AUTOLOAD I added is really necessary.

You'll want to change other Adaptors you use similarly.

========================================================================

ensembl/modules/Bio/EnsEMBL/DBSQL/SQL.pm
========================================================================
Tries to fix up queries
prepare returns our own StatementHandle type so the results can be munged.

Ofixsql  handles reserved words, single vs. double quotes,  IF (we're
lucky they can all be made DECODE's), DATES, etc.
The   --  stuff is necessary because of queries build with
    "Where field1>field2-$x" and $x is negative
========================================================================

ensembl/modules/Bio/EnsEMBL/DBSQL/SQL/StatementHandle.pm
========================================================================
Just makes fetchrow_hasref return our funny kind of hash.
========================================================================

ensembl/modules/Bio/EnsEMBL/DBSQL/SQL/Orowhash.pm
========================================================================
For select_row_hashref.
Makes it case insensitive,
Also handles the column names we changed by adding "ORCL_"
========================================================================

conf/SiteDefs.pm
========================================================================
Relevant variables:
$ENSEMBL_DRIVER		= 'Oracle';	    # ensembl database driver
$ENSEMBL_HOST           = 'cheese';      # ensembl database server
$ENSEMBL_HOST_PORT	= '1521';	    # ensembl database port (Oracle)
$ENSEMBL_DB             = 'gramene';	    # ensembl database name
$ENSEMBL_HELP           = 'gramene';    # help database name
$ENSEMBL_HELP_HOST      = 'cheese';      # help database name
$ENSEMBL_HELP_PORT	= '1521';	    # help database port
$ENSEMBL_HELP_DBUSER    = 'riceseq';	    # help database name
$ENSEMBL_HELP_DBPASS    = 'passwd';	    # help database name
$ENSEMBL_DBUSER         = 'riceseq';         # database username
$ENSEMBL_DBPASS         = 'passwd';	    # database password
(etc. for other databases SNP, FAMILY, etc. if you have them)

$ENSEMBL_ORACLE_HOME    = '/oracle1/app/oracle/product/9.0.1.1';
$ENSEMBL_ORACLE_BASE    = '/oracle1/app/oracle';
$ENSEMBL_TWO_TASK       = 'T:localhost:gramene';
$ENSEMBL_TNS_ADMIN      = '/oracle1/app/oracle/product/9.0.1.1/network/admin';
$ENSEMBL_ORACLE_SID     = 'gramene';
========================================================================

conf/httpd.conf
========================================================================
These variables are put into the environment of each mod_perl process:
$ENSEMBL_ORACLE_HOME
$ENSEMBL_ORACLE_BASE
$ENSEMBL_TWO_TASK
$ENSEMBL_TNS_ADMIN
========================================================================

perl/EnsWeb.pm
========================================================================
get_locator() mentions everything Oracle wants.
========================================================================

ensembl/modules/Bio/EnsEMBL/DBSQL/RawContig.pm
========================================================================
get_all_SimilarityFeatures_above_score:
Oracle doesn't like "not in" with >1000 entries,
so I build up a "not  in" clause for the first 900 features already
processed and remember the rest in a hash.
You might have this problem elsewhere.
========================================================================

ensembl/modules/Bio/EnsEMBL/DBSQL/DBPrimarySeq.pm
========================================================================
Since varchar2 is too short for the sequence dna table, we use CLOB.
So we can't do  SUBSTR (etc.) in SQL  (Why can't Oracle throw in a hack
for this??)
For convenience we add a column for the length.
You could do better by using the DBMS_LOB PL/SQL package, but we haven't
gotten around to it.
========================================================================

helpview:
========================================================================
The text is often too long to be a varchar2. So each record is split into
several, distinguished by a "piece" number.  kw and title are repeated,
so it's not normalized, but makes for quick adaption of the code.

helpview/mkhelpview.sql is the new schema
helpview/prep_table.pl change the helpview table as given by EnsEMBL
				into a format that's easier to edit
helpview/gramene_helpview.txt is our help in  this format
helpview/load_helpview.pl loads this file into the database, splitting
					    each entry that needs to be.
( EnsemblLocator  is defined in CSHL::Config as 
    use constant EnsemblLocator =>    'Bio::EnsEMBL::DBSQL::DBAdaptor/host=cheese.cshl.org;port=1521;dbname=gramene;user=riceseq;pass=passwd;perlonlyfeatures=0;mapdbname=;driver=Oracle;debug=0';
)
perl/HelpView.pm is modified to fetch all the pieces and put them together.
========================================================================

perl/unisearch
========================================================================
Database connection fixed-up
( added %dbh to avoid Apache::Registry "will not stay shared" errors(this
					has nothing to do with Oracle))
Fixed CONCAT(A,B,C) to CONCAT(CONCAT(A,B),C) in queries
Got rid of the offset & limit stuff.
========================================================================

doc/EnsEMBL_Oracle.txt contains some old notes