[Gmod-help] Re: [Gmod-schema] Error using gmod_bulk_load_gff3.pl with a ##sequence-region directive

Jonathan Leto jaleto at gmail.com
Mon Aug 9 14:06:35 EDT 2010


Howdy,

There is actually a flag called -ignore_seqregion in
Bio::DB::SeqFeature::Store::GFF3Loader .

It would be nice if gmod_bulk_load_gff3.pl could take that as a
command-line argument
and do the right thing with it.

Duke



On Wed, Jul 28, 2010 at 12:20 PM, Scott Cain <scott at scottcain.net> wrote:
> An additional (though probably somewhat easy to fix) issue is
> Bio::FeatureIO's insistence that ##sequence-region directives get
> turned into features.  These bits of data are not sufficient to create
> a full fledged feature that Chado requires, which is why the loader
> (should) ignore them.  Only it can't, because it defers to
> Bio::FeatureIO for file parsing.  If the constructor had a flag to
> ignore those directives, that would make life a little better.  Even
> better than that would be if Bio::FeatureIO could return a message
> stating that a ##sequence-region directive was found but was being
> ignored, so that message could be relayed to the user.
>
> On the other hand, I was unaware of Bio::FeatureIO dropping features;
> that's somewhat unpleasant.  I recall an issue with skipping
> sequences, but I thought that was fixed already.
>
> Scott
>
>
> On Wed, Jul 28, 2010 at 12:53 AM, Chris Fields <cjfields at illinois.edu> wrote:
>> I think the part of BioPerl Scott is referring to for significant refactoring is Bio::FeatureIO.  Scott, is that correct?
>>
>> Having some tests would really help.  I can always sync them over to the Bio-FeatureIO repo, which is separate from core ATM.  I did uncover some pretty significant bugs during my first round of FeatureIO work which are now fixed (skipping features and/or sequences was one).  Now just waiting on tuits...
>>
>> chris
>>
>> On Jul 27, 2010, at 6:39 PM, Jonathan Leto wrote:
>>
>>> Howdy,
>>>
>>> Could you explain what exactly Chado and BioPerl are disagreeing on?
>>> If modifying BioPerl does not make any BioPerl tests fail and allows the loading
>>> of sequence-region directives, I think it should be done.
>>>
>>> If the part of BioPerl that needs to be modified has no or few tests, I can add
>>> some and ask the BioPerl people what they think.
>>>
>>> Duke
>>>
>>>
>>> On Fri, Jul 23, 2010 at 10:52 AM, Scott Cain <scott at scottcain.net> wrote:
>>>> This is in fact a current bug; the easiest work around is to get rid
>>>> of sequence-region directives.  Actually fixing the bug is a little
>>>> trickier since it is due to the fact the Chado and BioPerl have
>>>> different ideas of what should happen.  While I could (probably)
>>>> modify BioPerl to do the right thing (from my perspective), I am
>>>> reluctant to do that at the moment since that section of BioPerl is
>>>> slated to be refactored.
>>>>
>>>> Scott
>>>>
>>>>
>>>> On Tue, Jul 20, 2010 at 6:55 PM, Dave Clements, GMOD Help Desk
>>>> <help at gmod.org> wrote:
>>>>> Hi Jonathan,
>>>>> I've created a bug report on this:
>>>>>   http://sourceforge.net/tracker/?func=detail&aid=3032325&group_id=27707&atid=391291
>>>>> This is interesting because the code says:
>>>>>   This script does not use sequence-region directives for anything.
>>>>>   If it represents a feature that needs to be inserted into the database,
>>>>>   it should be represented with a full GFF line.
>>>>> Dave C.
>>>>> On Fri, Jul 16, 2010 at 1:31 PM, Jonathan Leto <jaleto at gmail.com> wrote:
>>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> I have been attempting to load the ITAG GFF3 [0] files, which contain
>>>>>> ##sequence-region directives, but I run into errors like this:
>>>>>>
>>>>>> $ ./gmod_bulk_load_gff3.pl --gfffile
>>>>>> ~/git/ITAG1_release/ITAG1_gene_models_sample.gff3 --organism tomato
>>>>>> --noexon --recreate_cache --analysis --remove_lock --save_tmpfiles
>>>>>> (Re)creating the uniquename cache in the database...
>>>>>> Creating table...
>>>>>> Populating table...
>>>>>> Creating indexes...
>>>>>> Adjusting the primary key sequences (if necessary)...Done.
>>>>>>
>>>>>> --------------------- WARNING ---------------------
>>>>>> MSG: '##feature-ontology' directive handling not yet implemented
>>>>>> ---------------------------------------------------
>>>>>> Preparing data for inserting into the cxgn database
>>>>>> (This may take a while ...)
>>>>>> Loading data into feature table ...
>>>>>>        COPY feature
>>>>>> (feature_id,organism_id,name,uniquename,type_id,is_analysis,seqlen,dbxref_id)
>>>>>> FROM STDIN; at /home/leto/local-lib/lib/perl5/Bio/GMOD/DB/Adapter.pm
>>>>>> line 3210.
>>>>>> Loading data into featureloc table ...
>>>>>>        COPY featureloc
>>>>>>
>>>>>> (featureloc_id,feature_id,srcfeature_id,fmin,fmax,strand,phase,rank,locgroup)
>>>>>> FROM STDIN; at /home/leto/local-lib/lib/perl5/Bio/GMOD/DB/Adapter.pm
>>>>>> line 3210.
>>>>>> DBD::Pg::db pg_endcopy failed: ERROR:  invalid input syntax for integer:
>>>>>> ""
>>>>>> CONTEXT:  COPY featureloc, line 1, column strand: "" at
>>>>>> /home/leto/local-lib/lib/perl5/Bio/GMOD/DB/Adapter.pm line 3222, <$fh>
>>>>>> line 3.
>>>>>>
>>>>>> ------------- EXCEPTION: Bio::Root::Exception -------------
>>>>>> MSG: calling endcopy for featureloc failed:
>>>>>> STACK: Error::throw
>>>>>> STACK: Bio::Root::Root::throw
>>>>>> /home/leto/local-lib/lib/perl5/Bio/Root/Root.pm:368
>>>>>> STACK: Bio::GMOD::DB::Adapter::copy_from_stdin
>>>>>> /home/leto/local-lib/lib/perl5/Bio/GMOD/DB/Adapter.pm:3222
>>>>>> STACK: Bio::GMOD::DB::Adapter::load_data
>>>>>> /home/leto/local-lib/lib/perl5/Bio/GMOD/DB/Adapter.pm:3144
>>>>>> STACK: ./gmod_bulk_load_gff3.pl:1060
>>>>>> -----------------------------------------------------------
>>>>>>
>>>>>> The salient information is that somehow a strand of "" is attempting
>>>>>> to be inserted into the database, which fails. Note that I have also
>>>>>> uncommented
>>>>>> a warning statement that shows the SQL query that is being executed.
>>>>>>
>>>>>> I have traced this issue to be caused by the sequence-region
>>>>>> directive. When I remove the line, the file loads fine. As another
>>>>>> test, I created a file with nothing but a sequence-region directive,
>>>>>> and the same error occurs. I have attached that file and  the temp
>>>>>> data file that gmod_bulk_load_gff3.pl creates as well. The 6th column
>>>>>> of that file is the strand, and it has a value of "\N, which is the
>>>>>> text representation of NULL.
>>>>>>
>>>>>> It seems to me that something is stringifying the NULL into "" and
>>>>>> then attempting to insert the empty string into strand, which has a
>>>>>> type of smallint. This is what causes the failure.
>>>>>>
>>>>>> I would greatly appreciate any thoughts or comments on how to make the
>>>>>> bulk loading script support the sequence-region directive.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> [0] ftp://ftp.solgenomics.net/tomato_genome/annotation/ITAG1_release/
>>>>>>
>>>>>> --
>>>>>> Jonathan "Duke" Leto
>>>>>> jonathan at leto.net
>>>>>> http://leto.net
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> This SF.net email is sponsored by Sprint
>>>>>> What will you do first with EVO, the first 4G phone?
>>>>>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
>>>>>> _______________________________________________
>>>>>> Gmod-schema mailing list
>>>>>> Gmod-schema at lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ===> PLEASE KEEP RESPONSES ON THE LIST <===
>>>>> http://gmod.org/wiki/GMOD_News
>>>>> http://gmod.org/wiki/Calendar
>>>>> http://gmod.org/wiki/Help_Desk_Feedback
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> This SF.net email is sponsored by Sprint
>>>>> What will you do first with EVO, the first 4G phone?
>>>>> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
>>>>> _______________________________________________
>>>>> Gmod-schema mailing list
>>>>> Gmod-schema at lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ------------------------------------------------------------------------
>>>> Scott Cain, Ph. D.                                   scott at scottcain dot net
>>>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>>>> Ontario Institute for Cancer Research
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan "Duke" Leto
>>> jonathan at leto.net
>>> http://leto.net
>>>
>>> ------------------------------------------------------------------------------
>>> The Palm PDK Hot Apps Program offers developers who use the
>>> Plug-In Development Kit to bring their C/C++ apps to Palm for a share
>>> of $1 Million in cash or HP Products. Visit us here for more details:
>>> http://ad.doubleclick.net/clk;226879339;13503038;l?
>>> http://clk.atdmt.com/CRS/go/247765532/direct/01/
>>> _______________________________________________
>>> Gmod-schema mailing list
>>> Gmod-schema at lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>>
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>



-- 
Jonathan "Duke" Leto
jonathan at leto.net
http://leto.net




More information about the Gmod-help mailing list