[Gmod-help] [Gmod-gbrowse] gff3 file conversion from jgi annotation
Jason Stajich
jason.stajich at gmail.com
Mon Jun 4 18:44:45 EDT 2012
Alex -
I do this type of training a lot - here some pointers.
I often train by generating models using cegma on the genome and get these 400 or so good models as my training set. when I have EST or RNA-Seq I use PASA to generate the best set of annotations.
For CEGMA - then I run this script that comes with MAKER:
cegma2zff output.cegma.gff genome.fa
Then I follow the SNAP directions
fathom genome.ann genome.dna -categorize 1000
fathom uni.ann uni.dna -export 1000 -plus
mkdir MYGENOME
cd MYGENOME
forge ../export.ann ../export.dna --OPTIONS
cd ../MYGENOME
hmm-assembler.pl MYGENOME MYGENOME > MYGENOME.snap.hmm
I then also make the augustus training data like this:
perl gene_prediction/zff2augustus_gbk.pl > train.gb
using this script:
https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/zff2augustus_gbk.pl
I also make ZFF from GFF with this script if I got the RNA-Seq aligned and best models from PASA
https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/pasatraining2zff.pl
or I think some other variant of GFF3 -> ZFF scripts that exist but I don't have it in front of me.
--
Jason E Stajich, PhD
Assistant Professor
Plant Pathology & Microbiology
University of California, Riverside
951.827.2363
http://lab.stajich.org http://fungalgenomes.org http://fungidb.org
twitter @stajichlab @hyphaltip @fungalgenomes @fungidb
http://plantpathology.ucr.edu http://genomics.ucr.edu
On Jun 4, 2012, at 6:58 AM, Scott Cain wrote:
> Hi Alex,
>
> I'm cc'ing your email to the GBrowse mailing list, where we discuss
> file format conversions on a regular basis. Your GFF looks very
> GFF2/GTF to me, and looks very similar to the JGI GFF. Have you tried
> this script:
>
> https://github.com/hyphaltip/genome-scripts/blob/master/data_format/gtf2gff3_3level.pl
>
> There is a good chance that will do that job.
>
> Scott
>
>
> On Mon, Jun 4, 2012 at 2:25 AM, Alex Greninger
> <Alexander.Greninger at ucsf.edu> wrote:
>> Dear GMOD Help Desk,
>>
>> Hi, I've spent the weekend searching online trying to answer this question and this is a last resort; I hope you can help!
>>
>> I'm trying to annotate a relatively small (50Mb) genome for which I have a good 300X coverage genome assembly and EST/RNA-Seq data. I've been using the MAKER pipeline. There are no good gene models from Augustus or preloaded into SNAP that I can use, so I'm trying to build a gene model HMM with SNAP using a close-cousin reference genome annotation that was done by JGI. This is the only quasi-related organism when it comes to building the gene models for my assembly and my RNA-Seq data is probably not good enough to be able to bootstrap a gene model using iterative rounds of HMM building from the RNA-Seq data, per the "advanced" recommendations in the MAKER tutorial online. So I really have to figure out how to build this gene model HMM from the JGI file.
>>
>> This requires converting a GFF2/GTF2/JGI-type file to GFF3 or at least a ZFF file. I say GFF2/GTF2/JGI-type file because I can't quite figure out what type it is (it seems like JGI has its own format). Every homebrew script I've found online to solve this deprecated GFF2/GTF2/JGI -> GFF3 parsing problem does not jive with SNAP ZFF specification or MAKER's maker2zff script. Before I bite the bullet and try to figure out the correct way to parse from the format I have (GFF2/GTF2/JGI) to GFF3, I figured I should write since this problem has to have been solved before by others who've built models off of JGI's/fgenesh annotations. I include the first 100 lines of the gene model annotation file from JGI so you know what it looks like.
>>
>> This is the great democratization of sequencing and DIY annotation, so of course this is the first time I've come across these file formats and I'm a wet-bio person with only enough coding experience to be dangerous to others. I really thank you for your time and for all the great websites and wikis y'all have put up that have kept me from having to email y'all until I ran across this problem.
>>
>> Many, many thanks for your help!
>>
>> cheers,
>>
>> alex
>>
>> scaffold_1 JGI exon 11332 11910 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
>> scaffold_1 JGI CDS 11332 11910 . - 0 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 3
>> scaffold_1 JGI stop_codon 11332 11334 . - 0 name "fgeneshNG_pg.scaffold_1000001"
>> scaffold_1 JGI exon 12067 16940 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
>> scaffold_1 JGI CDS 12067 16940 . - 0 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 2
>> scaffold_1 JGI exon 17467 17506 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
>> scaffold_1 JGI CDS 17467 17506 . - 2 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 1
>> scaffold_1 JGI start_codon 17504 17506 . - 0 name "fgeneshNG_pg.scaffold_1000001"
>> scaffold_1 JGI exon 17593 18711 . - . name "estExt_fgeneshNG_kg.C_10001"; transcriptId 82392
>> scaffold_1 JGI CDS 17694 18692 . - 0 name "estExt_fgeneshNG_kg.C_10001"; proteinId 82392; exonNumber 1
>> scaffold_1 JGI start_codon 18690 18692 . - 0 name "estExt_fgeneshNG_kg.C_10001"
>> scaffold_1 JGI stop_codon 17694 17696 . - 0 name "estExt_fgeneshNG_kg.C_10001"
>> scaffold_1 JGI exon 20239 20607 . - . name "gw1.1.441.1"; transcriptId 3765
>> scaffold_1 JGI CDS 20239 20607 . - 0 name "gw1.1.441.1"; proteinId 3765; exonNumber 1
>> scaffold_1 JGI exon 20723 22636 . - . name "fgeneshNG_pg.scaffold_1000004"; transcriptId 61070
>> scaffold_1 JGI CDS 20723 22636 . - 0 name "fgeneshNG_pg.scaffold_1000004"; proteinId 61070; exonNumber 1
>> scaffold_1 JGI start_codon 22634 22636 . - 0 name "fgeneshNG_pg.scaffold_1000004"
>> scaffold_1 JGI stop_codon 20723 20725 . - 0 name "fgeneshNG_pg.scaffold_1000004"
>> scaffold_1 JGI exon 26638 26779 . + . name "gw1.1.469.1"; transcriptId 3975
>> scaffold_1 JGI CDS 26638 26779 . + 0 name "gw1.1.469.1"; proteinId 3975; exonNumber 1
>> scaffold_1 JGI stop_codon 26777 26779 . + 0 name "gw1.1.469.1"
>> scaffold_1 JGI exon 26820 27127 . + . name "gw1.1.469.1"; transcriptId 3975
>> scaffold_1 JGI CDS 26820 27127 . + 1 name "gw1.1.469.1"; proteinId 3975; exonNumber 2
>> scaffold_1 JGI exon 28343 28396 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
>> scaffold_1 JGI CDS 28343 28396 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 4
>> scaffold_1 JGI stop_codon 28343 28345 . - 0 name "fgeneshHS_pm.scaffold_1000002"
>> scaffold_1 JGI exon 30378 30724 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
>> scaffold_1 JGI CDS 30378 30724 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 3
>> scaffold_1 JGI exon 30855 30858 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
>> scaffold_1 JGI CDS 30855 30858 . - 2 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 2
>> scaffold_1 JGI exon 31043 31231 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
>> scaffold_1 JGI CDS 31043 31231 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 1
>> scaffold_1 JGI start_codon 31229 31231 . - 0 name "fgeneshHS_pm.scaffold_1000002"
>> scaffold_1 JGI exon 31641 32449 . - . name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
>> scaffold_1 JGI CDS 31641 32449 . - 0 name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 2
>> scaffold_1 JGI stop_codon 31641 31643 . - 0 name "fgeneshNG_pg.scaffold_1000008"
>> scaffold_1 JGI exon 32497 32782 . - . name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
>> scaffold_1 JGI CDS 32497 32782 . - 2 name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 1
>> scaffold_1 JGI start_codon 32780 32782 . - 0 name "fgeneshNG_pg.scaffold_1000008"
>> scaffold_1 JGI exon 33022 33607 . + . name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
>> scaffold_1 JGI CDS 33022 33607 . + 0 name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 1
>> scaffold_1 JGI start_codon 33022 33024 . + 0 name "fgeneshNG_pg.scaffold_1000009"
>> scaffold_1 JGI exon 33654 34195 . + . name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
>> scaffold_1 JGI CDS 33654 34195 . + 1 name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 2
>> scaffold_1 JGI stop_codon 34193 34195 . + 0 name "fgeneshNG_pg.scaffold_1000009"
>> scaffold_1 JGI exon 34588 35499 . - . name "fgeneshNG_pg.scaffold_1000010"; transcriptId 61076
>> scaffold_1 JGI CDS 34588 35499 . - 0 name "fgeneshNG_pg.scaffold_1000010"; proteinId 61076; exonNumber 1
>> scaffold_1 JGI start_codon 35497 35499 . - 0 name "fgeneshNG_pg.scaffold_1000010"
>> scaffold_1 JGI stop_codon 34588 34590 . - 0 name "fgeneshNG_pg.scaffold_1000010"
>> scaffold_1 JGI exon 35721 36734 . - . name "fgeneshNG_pg.scaffold_1000011"; transcriptId 61077
>> scaffold_1 JGI CDS 35721 36734 . - 0 name "fgeneshNG_pg.scaffold_1000011"; proteinId 61077; exonNumber 1
>> scaffold_1 JGI start_codon 36732 36734 . - 0 name "fgeneshNG_pg.scaffold_1000011"
>> scaffold_1 JGI stop_codon 35721 35723 . - 0 name "fgeneshNG_pg.scaffold_1000011"
>> scaffold_1 JGI exon 38130 38556 . + . name "e_gw1.1.292.1"; transcriptId 29098
>> scaffold_1 JGI CDS 38130 38556 . + 0 name "e_gw1.1.292.1"; proteinId 29098; exonNumber 1
>> scaffold_1 JGI start_codon 38130 38132 . + 0 name "e_gw1.1.292.1"
>> scaffold_1 JGI exon 38713 39482 . + . name "e_gw1.1.292.1"; transcriptId 29098
>> scaffold_1 JGI CDS 38713 39482 . + 1 name "e_gw1.1.292.1"; proteinId 29098; exonNumber 2
>> scaffold_1 JGI stop_codon 39480 39482 . + 0 name "e_gw1.1.292.1"
>> scaffold_1 JGI exon 39590 39640 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
>> scaffold_1 JGI CDS 39590 39640 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 1
>> scaffold_1 JGI start_codon 39590 39592 . + 0 name "fgeneshNG_pg.scaffold_1000013"
>> scaffold_1 JGI exon 39683 40042 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
>> scaffold_1 JGI CDS 39683 40042 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 2
>> scaffold_1 JGI exon 40083 42497 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
>> scaffold_1 JGI CDS 40083 42497 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 3
>> scaffold_1 JGI stop_codon 42495 42497 . + 0 name "fgeneshNG_pg.scaffold_1000013"
>> scaffold_1 JGI exon 46095 46893 . - . name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
>> scaffold_1 JGI CDS 46095 46893 . - 0 name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 2
>> scaffold_1 JGI stop_codon 46095 46097 . - 0 name "fgeneshNG_pg.scaffold_1000014"
>> scaffold_1 JGI exon 46934 49452 . - . name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
>> scaffold_1 JGI CDS 46934 49452 . - 1 name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 1
>> scaffold_1 JGI start_codon 49450 49452 . - 0 name "fgeneshNG_pg.scaffold_1000014"
>> scaffold_1 JGI exon 49763 49873 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
>> scaffold_1 JGI CDS 49763 49873 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 1
>> scaffold_1 JGI start_codon 49763 49765 . + 0 name "estExt_fgeneshNG_pg.C_10015"
>> scaffold_1 JGI exon 49913 50214 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
>> scaffold_1 JGI CDS 49913 50214 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 2
>> scaffold_1 JGI exon 50258 50651 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
>> scaffold_1 JGI CDS 50258 50651 . + 2 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 3
>> scaffold_1 JGI exon 50698 50948 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
>> scaffold_1 JGI CDS 50698 50948 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 4
>> scaffold_1 JGI exon 50996 53506 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
>> scaffold_1 JGI CDS 50996 53471 . + 2 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 5
>> scaffold_1 JGI stop_codon 53469 53471 . + 0 name "estExt_fgeneshNG_pg.C_10015"
>> scaffold_1 JGI exon 53849 55612 . + . name "fgeneshNG_pg.scaffold_1000016"; transcriptId 61082
>> scaffold_1 JGI CDS 53849 55612 . + 0 name "fgeneshNG_pg.scaffold_1000016"; proteinId 61082; exonNumber 1
>> scaffold_1 JGI start_codon 53849 53851 . + 0 name "fgeneshNG_pg.scaffold_1000016"
>> scaffold_1 JGI stop_codon 55610 55612 . + 0 name "fgeneshNG_pg.scaffold_1000016"
>> scaffold_1 JGI exon 57035 58150 . - . name "gw1.1.142.1"; transcriptId 1251
>> scaffold_1 JGI CDS 57035 58150 . - 0 name "gw1.1.142.1"; proteinId 1251; exonNumber 1
>> scaffold_1 JGI exon 58531 61815 . + . name "fgeneshNG_pg.scaffold_1000018"; transcriptId 61084
>> scaffold_1 JGI CDS 58531 61815 . + 0 name "fgeneshNG_pg.scaffold_1000018"; proteinId 61084; exonNumber 1
>> scaffold_1 JGI start_codon 58531 58533 . + 0 name "fgeneshNG_pg.scaffold_1000018"
>> scaffold_1 JGI stop_codon 61813 61815 . + 0 name "fgeneshNG_pg.scaffold_1000018"
>> scaffold_1 JGI exon 62494 64545 . - . name "fgeneshNG_pg.scaffold_1000019"; transcriptId 61085
>> scaffold_1 JGI CDS 62494 64545 . - 0 name "fgeneshNG_pg.scaffold_1000019"; proteinId 61085; exonNumber 1
>> scaffold_1 JGI start_codon 64543 64545 . - 0 name "fgeneshNG_pg.scaffold_1000019"
>> scaffold_1 JGI stop_codon 62494 62496 . - 0 name "fgeneshNG_pg.scaffold_1000019"
>> scaffold_1 JGI exon 65565 65937 . + . name "e_gw1.1.364.1"; transcriptId 29060
>>
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D. scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/) 216-392-3087
> Ontario Institute for Cancer Research
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Gmod-gbrowse mailing list
> Gmod-gbrowse at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org
More information about the Gmod-help
mailing list