[Gmod-help] Re: [Gmod-gbrowse] bp_genbank2gff3.pl bugs?
Don Gilbert
gilbertd at cricket.bio.indiana.edu
Fri Dec 11 10:25:07 EST 2009
Alessandra,
I checked with the version of bp_genbank2gff3.pl that I have, and
with NCBI Genbank's current human chr1, and don't see the problems
you report. It could be various things, such as the version of
bp_genbank2gff3.pl I have might be older. THere is one error
with a CDS below, that may be mis-formatting in the genbank data.
As far as a GFF Name= tag goes, Genbank doesn't seem to have /name=
tags, so this converter doesn't make them. But for genes, you could
change all the gene= fields to Name= with a quick perl conversion:
perl -pi -e's/;gene=/;Name=/ unless(/Name=/);' hs_ref_GRCh37_chr1.gbk.gff
- Don Gilbert
curl -RO ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37_chr1.gbk.gz
gzcat hs_ref_GRCh37_chr1.gbk.gz | perl bin/bp_genbank2gff3.pl -in stdin >& log.gbc1
top of gff:
##gff-version 3
# sequence-region NT_077402 1 257719
# conversion-by bp_genbank2gff3.pl
# organism Homo sapiens
# date 10-JUN-2009
# Note Homo sapiens chromosome 1 genomic contig, GRCh37 reference primary assembly.
NT_077402 GenBank chromosome 1 257719 . + . ID=NT_077402;mol_type=genomic
DNA;date=10-JUN-2009;comment1=REFSEQ INFORMATION: Features on this sequence have been produced for build 37
version 1 of the NCBI's genome annotation [see documentation]. The reference sequence is identical to GL00000
1.1. On or before Jun 10%2C 2009 this sequence version replaced gi:29794400%2C gi:29794392. The DNA sequence
is composed of genomic sequence%2C primarily finished clones that were sequenced as part of the Human Genome
Project. PCR products and WGS shotgun sequence have been added where necessary to fill gaps or correct errors
. All such additions are manually curated by GRC staff. For more information see: http://genomereference.org.
;Note=Homo sapiens chromosome 1 genomic contig%2C GRCh37 reference primary assembly.;Alias=1;chromosome=1;Db
xref=taxon:9606;organism=Homo sapiens
NT_077402 GenBank gene 1874 4409 . + . ID=LOC100287102;gene=LOC100287102;Not
e=Derived by automated computational analysis using gene prediction method: GNOMON. Supporting evidence inclu
des similarity to: 3 mRNAs%2C 1 EST%2C 2 Proteins;Dbxref=GeneID:100287102
NT_077402 GenBank mRNA 1874 4409 . + . ID=LOC100287102.t01;Parent=LOC1002871
02;gene=LOC100287102;product=similar to DEAD/H box polypeptide 11 like 9;Note=Derived by automated computatio
nal analysis using gene prediction method: GNOMON. Supporting evidence includes similarity to: 3 mRNAs%2C 1 E
ST%2C 2 Proteins;Dbxref=GI:239740966,GeneID:100287102;transcript_id=XM_002342010.1
NT_077402 GenBank CDS 2190 2227 . + . ID=LOC100287102.p01;Parent=LOC1002871
02.t01;codon_start=1;protein_id=XP_002342051.1;gene=LOC100287102;product=hypothetical protein XP_002342051;No
te=Derived by automated computational analysis using gene prediction method: GNOMON.;Dbxref=GI:239740967,Gene
ID:100287102
NT_077402 GenBank CDS 2595 2721 . + . ID=LOC100287102.p01;Parent=LOC1002871
02.t01;codon_start=1;protein_id=XP_002342051.1;gene=LOC100287102;product=hypothetical protein XP_002342051;No
te=Derived by automated computational analysis using gene prediction method: GNOMON.;Dbxref=GI:239740967,Gene
ID:100287102
NT_077402 GenBank CDS 3403 3639 . + . ID=LOC100287102.p01;Parent=LOC1002871
02.t01;codon_start=1;protein_id=XP_002342051.1;gene=LOC100287102;product=hypothetical protein XP_002342051;No
te=Derived by automated computational analysis using gene prediction method: GNOMON.;Dbxref=GI:239740967,Gene
ID:100287102
NT_077402 GenBank exon 1874 2227 . + . Parent=LOC100287102.t01;gene=LOC10028
7102
NT_077402 GenBank exon 2595 2721 . + . Parent=LOC100287102.t01;gene=LOC10028
7102
NT_077402 GenBank exon 3403 4409 . + . Parent=LOC100287102.t01;gene=LOC10028
7102
--------------
log:
# Input: stdin
# working on chromosome:NT_077402, Homo sapiens, 10-JUN-2009, Homo sapiens chromosome 1 genomic contig, GRCh3
7 reference primary assembly.
..
# working on chromosome:NT_113799, Homo sapiens, 10-JUN-2009, Homo sapiens chromosome 1 genomic contig, GRCh3
7 reference primary assembly.
# working on chromosome:NT_004487, Homo sapiens, 10-JUN-2009, Homo sapiens chromosome 1 genomic contig, GRCh3
7 reference primary assembly.
NT_004487 Unflattening error:
Details:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: PROBLEM, SEVERITY==2
no containers possible for SeqFeature of type: CDS; this SF is being placed at root level
SF [Bio::SeqFeature::Generic=HASH(0x8a09360)]: CDS; OAZ3; ornithine decarboxylase antizyme 3 isoform 2
# Possible gene unflattening error withNT_004487: consult STDERR
..
# GFF3 saved to ./stdin.gff
More information about the Gmod-help
mailing list