[Gmod-help] gff3 file conversion from jgi annotation

Alex Greninger Alexander.Greninger at ucsf.edu
Mon Jun 4 02:25:30 EDT 2012


Dear GMOD Help Desk,

Hi, I've spent the weekend searching online trying to answer this question and this is a last resort; I hope you can help!  

I'm trying to annotate a relatively small (50Mb) genome for which I have a good 300X coverage genome assembly and EST/RNA-Seq data.  I've been using the MAKER pipeline.  There are no good gene models from Augustus or preloaded into SNAP that I can use, so I'm trying to build a gene model HMM with SNAP using a close-cousin reference genome annotation that was done by JGI.  This is the only quasi-related organism when it comes to building the gene models for my assembly and my RNA-Seq data is probably not good enough to be able to bootstrap a gene model using iterative rounds of HMM building from the RNA-Seq data, per the "advanced" recommendations in the MAKER tutorial online.  So I really have to figure out how to build this gene model HMM from the JGI file.

This requires converting a GFF2/GTF2/JGI-type file to GFF3 or at least a ZFF file.  I say GFF2/GTF2/JGI-type file because I can't quite figure out what type it is (it seems like JGI has its own format).  Every homebrew script I've found online to solve this deprecated GFF2/GTF2/JGI -> GFF3 parsing problem does not jive with SNAP ZFF specification or MAKER's maker2zff script.  Before I bite the bullet and try to figure out the correct way to parse from the format I have (GFF2/GTF2/JGI) to GFF3, I figured I should write since this problem has to have been solved before by others who've built models off of JGI's/fgenesh annotations.  I include the first 100 lines of the gene model annotation file from JGI so you know what it looks like.  

This is the great democratization of sequencing and DIY annotation, so of course this is the first time I've come across these file formats and I'm a wet-bio person with only enough coding experience to be dangerous to others.  I really thank you for your time and for all the great websites and wikis y'all have put up that have kept me from having to email y'all until I ran across this problem.

Many, many thanks for your help! 

cheers,

alex

scaffold_1	JGI	exon	11332	11910	.	-	.	name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
scaffold_1	JGI	CDS	11332	11910	.	-	0	name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 3
scaffold_1	JGI	stop_codon	11332	11334	.	-	0	name "fgeneshNG_pg.scaffold_1000001"
scaffold_1	JGI	exon	12067	16940	.	-	.	name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
scaffold_1	JGI	CDS	12067	16940	.	-	0	name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 2
scaffold_1	JGI	exon	17467	17506	.	-	.	name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
scaffold_1	JGI	CDS	17467	17506	.	-	2	name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 1
scaffold_1	JGI	start_codon	17504	17506	.	-	0	name "fgeneshNG_pg.scaffold_1000001"
scaffold_1	JGI	exon	17593	18711	.	-	.	name "estExt_fgeneshNG_kg.C_10001"; transcriptId 82392
scaffold_1	JGI	CDS	17694	18692	.	-	0	name "estExt_fgeneshNG_kg.C_10001"; proteinId 82392; exonNumber 1
scaffold_1	JGI	start_codon	18690	18692	.	-	0	name "estExt_fgeneshNG_kg.C_10001"
scaffold_1	JGI	stop_codon	17694	17696	.	-	0	name "estExt_fgeneshNG_kg.C_10001"
scaffold_1	JGI	exon	20239	20607	.	-	.	name "gw1.1.441.1"; transcriptId 3765
scaffold_1	JGI	CDS	20239	20607	.	-	0	name "gw1.1.441.1"; proteinId 3765; exonNumber 1
scaffold_1	JGI	exon	20723	22636	.	-	.	name "fgeneshNG_pg.scaffold_1000004"; transcriptId 61070
scaffold_1	JGI	CDS	20723	22636	.	-	0	name "fgeneshNG_pg.scaffold_1000004"; proteinId 61070; exonNumber 1
scaffold_1	JGI	start_codon	22634	22636	.	-	0	name "fgeneshNG_pg.scaffold_1000004"
scaffold_1	JGI	stop_codon	20723	20725	.	-	0	name "fgeneshNG_pg.scaffold_1000004"
scaffold_1	JGI	exon	26638	26779	.	+	.	name "gw1.1.469.1"; transcriptId 3975
scaffold_1	JGI	CDS	26638	26779	.	+	0	name "gw1.1.469.1"; proteinId 3975; exonNumber 1
scaffold_1	JGI	stop_codon	26777	26779	.	+	0	name "gw1.1.469.1"
scaffold_1	JGI	exon	26820	27127	.	+	.	name "gw1.1.469.1"; transcriptId 3975
scaffold_1	JGI	CDS	26820	27127	.	+	1	name "gw1.1.469.1"; proteinId 3975; exonNumber 2
scaffold_1	JGI	exon	28343	28396	.	-	.	name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1	JGI	CDS	28343	28396	.	-	0	name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 4
scaffold_1	JGI	stop_codon	28343	28345	.	-	0	name "fgeneshHS_pm.scaffold_1000002"
scaffold_1	JGI	exon	30378	30724	.	-	.	name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1	JGI	CDS	30378	30724	.	-	0	name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 3
scaffold_1	JGI	exon	30855	30858	.	-	.	name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1	JGI	CDS	30855	30858	.	-	2	name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 2
scaffold_1	JGI	exon	31043	31231	.	-	.	name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1	JGI	CDS	31043	31231	.	-	0	name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 1
scaffold_1	JGI	start_codon	31229	31231	.	-	0	name "fgeneshHS_pm.scaffold_1000002"
scaffold_1	JGI	exon	31641	32449	.	-	.	name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
scaffold_1	JGI	CDS	31641	32449	.	-	0	name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 2
scaffold_1	JGI	stop_codon	31641	31643	.	-	0	name "fgeneshNG_pg.scaffold_1000008"
scaffold_1	JGI	exon	32497	32782	.	-	.	name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
scaffold_1	JGI	CDS	32497	32782	.	-	2	name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 1
scaffold_1	JGI	start_codon	32780	32782	.	-	0	name "fgeneshNG_pg.scaffold_1000008"
scaffold_1	JGI	exon	33022	33607	.	+	.	name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
scaffold_1	JGI	CDS	33022	33607	.	+	0	name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 1
scaffold_1	JGI	start_codon	33022	33024	.	+	0	name "fgeneshNG_pg.scaffold_1000009"
scaffold_1	JGI	exon	33654	34195	.	+	.	name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
scaffold_1	JGI	CDS	33654	34195	.	+	1	name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 2
scaffold_1	JGI	stop_codon	34193	34195	.	+	0	name "fgeneshNG_pg.scaffold_1000009"
scaffold_1	JGI	exon	34588	35499	.	-	.	name "fgeneshNG_pg.scaffold_1000010"; transcriptId 61076
scaffold_1	JGI	CDS	34588	35499	.	-	0	name "fgeneshNG_pg.scaffold_1000010"; proteinId 61076; exonNumber 1
scaffold_1	JGI	start_codon	35497	35499	.	-	0	name "fgeneshNG_pg.scaffold_1000010"
scaffold_1	JGI	stop_codon	34588	34590	.	-	0	name "fgeneshNG_pg.scaffold_1000010"
scaffold_1	JGI	exon	35721	36734	.	-	.	name "fgeneshNG_pg.scaffold_1000011"; transcriptId 61077
scaffold_1	JGI	CDS	35721	36734	.	-	0	name "fgeneshNG_pg.scaffold_1000011"; proteinId 61077; exonNumber 1
scaffold_1	JGI	start_codon	36732	36734	.	-	0	name "fgeneshNG_pg.scaffold_1000011"
scaffold_1	JGI	stop_codon	35721	35723	.	-	0	name "fgeneshNG_pg.scaffold_1000011"
scaffold_1	JGI	exon	38130	38556	.	+	.	name "e_gw1.1.292.1"; transcriptId 29098
scaffold_1	JGI	CDS	38130	38556	.	+	0	name "e_gw1.1.292.1"; proteinId 29098; exonNumber 1
scaffold_1	JGI	start_codon	38130	38132	.	+	0	name "e_gw1.1.292.1"
scaffold_1	JGI	exon	38713	39482	.	+	.	name "e_gw1.1.292.1"; transcriptId 29098
scaffold_1	JGI	CDS	38713	39482	.	+	1	name "e_gw1.1.292.1"; proteinId 29098; exonNumber 2
scaffold_1	JGI	stop_codon	39480	39482	.	+	0	name "e_gw1.1.292.1"
scaffold_1	JGI	exon	39590	39640	.	+	.	name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
scaffold_1	JGI	CDS	39590	39640	.	+	0	name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 1
scaffold_1	JGI	start_codon	39590	39592	.	+	0	name "fgeneshNG_pg.scaffold_1000013"
scaffold_1	JGI	exon	39683	40042	.	+	.	name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
scaffold_1	JGI	CDS	39683	40042	.	+	0	name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 2
scaffold_1	JGI	exon	40083	42497	.	+	.	name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
scaffold_1	JGI	CDS	40083	42497	.	+	0	name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 3
scaffold_1	JGI	stop_codon	42495	42497	.	+	0	name "fgeneshNG_pg.scaffold_1000013"
scaffold_1	JGI	exon	46095	46893	.	-	.	name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
scaffold_1	JGI	CDS	46095	46893	.	-	0	name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 2
scaffold_1	JGI	stop_codon	46095	46097	.	-	0	name "fgeneshNG_pg.scaffold_1000014"
scaffold_1	JGI	exon	46934	49452	.	-	.	name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
scaffold_1	JGI	CDS	46934	49452	.	-	1	name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 1
scaffold_1	JGI	start_codon	49450	49452	.	-	0	name "fgeneshNG_pg.scaffold_1000014"
scaffold_1	JGI	exon	49763	49873	.	+	.	name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1	JGI	CDS	49763	49873	.	+	0	name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 1
scaffold_1	JGI	start_codon	49763	49765	.	+	0	name "estExt_fgeneshNG_pg.C_10015"
scaffold_1	JGI	exon	49913	50214	.	+	.	name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1	JGI	CDS	49913	50214	.	+	0	name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 2
scaffold_1	JGI	exon	50258	50651	.	+	.	name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1	JGI	CDS	50258	50651	.	+	2	name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 3
scaffold_1	JGI	exon	50698	50948	.	+	.	name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1	JGI	CDS	50698	50948	.	+	0	name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 4
scaffold_1	JGI	exon	50996	53506	.	+	.	name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1	JGI	CDS	50996	53471	.	+	2	name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 5
scaffold_1	JGI	stop_codon	53469	53471	.	+	0	name "estExt_fgeneshNG_pg.C_10015"
scaffold_1	JGI	exon	53849	55612	.	+	.	name "fgeneshNG_pg.scaffold_1000016"; transcriptId 61082
scaffold_1	JGI	CDS	53849	55612	.	+	0	name "fgeneshNG_pg.scaffold_1000016"; proteinId 61082; exonNumber 1
scaffold_1	JGI	start_codon	53849	53851	.	+	0	name "fgeneshNG_pg.scaffold_1000016"
scaffold_1	JGI	stop_codon	55610	55612	.	+	0	name "fgeneshNG_pg.scaffold_1000016"
scaffold_1	JGI	exon	57035	58150	.	-	.	name "gw1.1.142.1"; transcriptId 1251
scaffold_1	JGI	CDS	57035	58150	.	-	0	name "gw1.1.142.1"; proteinId 1251; exonNumber 1
scaffold_1	JGI	exon	58531	61815	.	+	.	name "fgeneshNG_pg.scaffold_1000018"; transcriptId 61084
scaffold_1	JGI	CDS	58531	61815	.	+	0	name "fgeneshNG_pg.scaffold_1000018"; proteinId 61084; exonNumber 1
scaffold_1	JGI	start_codon	58531	58533	.	+	0	name "fgeneshNG_pg.scaffold_1000018"
scaffold_1	JGI	stop_codon	61813	61815	.	+	0	name "fgeneshNG_pg.scaffold_1000018"
scaffold_1	JGI	exon	62494	64545	.	-	.	name "fgeneshNG_pg.scaffold_1000019"; transcriptId 61085
scaffold_1	JGI	CDS	62494	64545	.	-	0	name "fgeneshNG_pg.scaffold_1000019"; proteinId 61085; exonNumber 1
scaffold_1	JGI	start_codon	64543	64545	.	-	0	name "fgeneshNG_pg.scaffold_1000019"
scaffold_1	JGI	stop_codon	62494	62496	.	-	0	name "fgeneshNG_pg.scaffold_1000019"
scaffold_1	JGI	exon	65565	65937	.	+	.	name "e_gw1.1.364.1"; transcriptId 29060



More information about the Gmod-help mailing list