[Gmod-help] gff3 file conversion from jgi annotation
Alex Greninger
Alexander.Greninger at ucsf.edu
Mon Jun 4 02:25:30 EDT 2012
Dear GMOD Help Desk,
Hi, I've spent the weekend searching online trying to answer this question and this is a last resort; I hope you can help!
I'm trying to annotate a relatively small (50Mb) genome for which I have a good 300X coverage genome assembly and EST/RNA-Seq data. I've been using the MAKER pipeline. There are no good gene models from Augustus or preloaded into SNAP that I can use, so I'm trying to build a gene model HMM with SNAP using a close-cousin reference genome annotation that was done by JGI. This is the only quasi-related organism when it comes to building the gene models for my assembly and my RNA-Seq data is probably not good enough to be able to bootstrap a gene model using iterative rounds of HMM building from the RNA-Seq data, per the "advanced" recommendations in the MAKER tutorial online. So I really have to figure out how to build this gene model HMM from the JGI file.
This requires converting a GFF2/GTF2/JGI-type file to GFF3 or at least a ZFF file. I say GFF2/GTF2/JGI-type file because I can't quite figure out what type it is (it seems like JGI has its own format). Every homebrew script I've found online to solve this deprecated GFF2/GTF2/JGI -> GFF3 parsing problem does not jive with SNAP ZFF specification or MAKER's maker2zff script. Before I bite the bullet and try to figure out the correct way to parse from the format I have (GFF2/GTF2/JGI) to GFF3, I figured I should write since this problem has to have been solved before by others who've built models off of JGI's/fgenesh annotations. I include the first 100 lines of the gene model annotation file from JGI so you know what it looks like.
This is the great democratization of sequencing and DIY annotation, so of course this is the first time I've come across these file formats and I'm a wet-bio person with only enough coding experience to be dangerous to others. I really thank you for your time and for all the great websites and wikis y'all have put up that have kept me from having to email y'all until I ran across this problem.
Many, many thanks for your help!
cheers,
alex
scaffold_1 JGI exon 11332 11910 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
scaffold_1 JGI CDS 11332 11910 . - 0 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 3
scaffold_1 JGI stop_codon 11332 11334 . - 0 name "fgeneshNG_pg.scaffold_1000001"
scaffold_1 JGI exon 12067 16940 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
scaffold_1 JGI CDS 12067 16940 . - 0 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 2
scaffold_1 JGI exon 17467 17506 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
scaffold_1 JGI CDS 17467 17506 . - 2 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 1
scaffold_1 JGI start_codon 17504 17506 . - 0 name "fgeneshNG_pg.scaffold_1000001"
scaffold_1 JGI exon 17593 18711 . - . name "estExt_fgeneshNG_kg.C_10001"; transcriptId 82392
scaffold_1 JGI CDS 17694 18692 . - 0 name "estExt_fgeneshNG_kg.C_10001"; proteinId 82392; exonNumber 1
scaffold_1 JGI start_codon 18690 18692 . - 0 name "estExt_fgeneshNG_kg.C_10001"
scaffold_1 JGI stop_codon 17694 17696 . - 0 name "estExt_fgeneshNG_kg.C_10001"
scaffold_1 JGI exon 20239 20607 . - . name "gw1.1.441.1"; transcriptId 3765
scaffold_1 JGI CDS 20239 20607 . - 0 name "gw1.1.441.1"; proteinId 3765; exonNumber 1
scaffold_1 JGI exon 20723 22636 . - . name "fgeneshNG_pg.scaffold_1000004"; transcriptId 61070
scaffold_1 JGI CDS 20723 22636 . - 0 name "fgeneshNG_pg.scaffold_1000004"; proteinId 61070; exonNumber 1
scaffold_1 JGI start_codon 22634 22636 . - 0 name "fgeneshNG_pg.scaffold_1000004"
scaffold_1 JGI stop_codon 20723 20725 . - 0 name "fgeneshNG_pg.scaffold_1000004"
scaffold_1 JGI exon 26638 26779 . + . name "gw1.1.469.1"; transcriptId 3975
scaffold_1 JGI CDS 26638 26779 . + 0 name "gw1.1.469.1"; proteinId 3975; exonNumber 1
scaffold_1 JGI stop_codon 26777 26779 . + 0 name "gw1.1.469.1"
scaffold_1 JGI exon 26820 27127 . + . name "gw1.1.469.1"; transcriptId 3975
scaffold_1 JGI CDS 26820 27127 . + 1 name "gw1.1.469.1"; proteinId 3975; exonNumber 2
scaffold_1 JGI exon 28343 28396 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1 JGI CDS 28343 28396 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 4
scaffold_1 JGI stop_codon 28343 28345 . - 0 name "fgeneshHS_pm.scaffold_1000002"
scaffold_1 JGI exon 30378 30724 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1 JGI CDS 30378 30724 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 3
scaffold_1 JGI exon 30855 30858 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1 JGI CDS 30855 30858 . - 2 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 2
scaffold_1 JGI exon 31043 31231 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
scaffold_1 JGI CDS 31043 31231 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 1
scaffold_1 JGI start_codon 31229 31231 . - 0 name "fgeneshHS_pm.scaffold_1000002"
scaffold_1 JGI exon 31641 32449 . - . name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
scaffold_1 JGI CDS 31641 32449 . - 0 name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 2
scaffold_1 JGI stop_codon 31641 31643 . - 0 name "fgeneshNG_pg.scaffold_1000008"
scaffold_1 JGI exon 32497 32782 . - . name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
scaffold_1 JGI CDS 32497 32782 . - 2 name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 1
scaffold_1 JGI start_codon 32780 32782 . - 0 name "fgeneshNG_pg.scaffold_1000008"
scaffold_1 JGI exon 33022 33607 . + . name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
scaffold_1 JGI CDS 33022 33607 . + 0 name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 1
scaffold_1 JGI start_codon 33022 33024 . + 0 name "fgeneshNG_pg.scaffold_1000009"
scaffold_1 JGI exon 33654 34195 . + . name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
scaffold_1 JGI CDS 33654 34195 . + 1 name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 2
scaffold_1 JGI stop_codon 34193 34195 . + 0 name "fgeneshNG_pg.scaffold_1000009"
scaffold_1 JGI exon 34588 35499 . - . name "fgeneshNG_pg.scaffold_1000010"; transcriptId 61076
scaffold_1 JGI CDS 34588 35499 . - 0 name "fgeneshNG_pg.scaffold_1000010"; proteinId 61076; exonNumber 1
scaffold_1 JGI start_codon 35497 35499 . - 0 name "fgeneshNG_pg.scaffold_1000010"
scaffold_1 JGI stop_codon 34588 34590 . - 0 name "fgeneshNG_pg.scaffold_1000010"
scaffold_1 JGI exon 35721 36734 . - . name "fgeneshNG_pg.scaffold_1000011"; transcriptId 61077
scaffold_1 JGI CDS 35721 36734 . - 0 name "fgeneshNG_pg.scaffold_1000011"; proteinId 61077; exonNumber 1
scaffold_1 JGI start_codon 36732 36734 . - 0 name "fgeneshNG_pg.scaffold_1000011"
scaffold_1 JGI stop_codon 35721 35723 . - 0 name "fgeneshNG_pg.scaffold_1000011"
scaffold_1 JGI exon 38130 38556 . + . name "e_gw1.1.292.1"; transcriptId 29098
scaffold_1 JGI CDS 38130 38556 . + 0 name "e_gw1.1.292.1"; proteinId 29098; exonNumber 1
scaffold_1 JGI start_codon 38130 38132 . + 0 name "e_gw1.1.292.1"
scaffold_1 JGI exon 38713 39482 . + . name "e_gw1.1.292.1"; transcriptId 29098
scaffold_1 JGI CDS 38713 39482 . + 1 name "e_gw1.1.292.1"; proteinId 29098; exonNumber 2
scaffold_1 JGI stop_codon 39480 39482 . + 0 name "e_gw1.1.292.1"
scaffold_1 JGI exon 39590 39640 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
scaffold_1 JGI CDS 39590 39640 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 1
scaffold_1 JGI start_codon 39590 39592 . + 0 name "fgeneshNG_pg.scaffold_1000013"
scaffold_1 JGI exon 39683 40042 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
scaffold_1 JGI CDS 39683 40042 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 2
scaffold_1 JGI exon 40083 42497 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
scaffold_1 JGI CDS 40083 42497 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 3
scaffold_1 JGI stop_codon 42495 42497 . + 0 name "fgeneshNG_pg.scaffold_1000013"
scaffold_1 JGI exon 46095 46893 . - . name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
scaffold_1 JGI CDS 46095 46893 . - 0 name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 2
scaffold_1 JGI stop_codon 46095 46097 . - 0 name "fgeneshNG_pg.scaffold_1000014"
scaffold_1 JGI exon 46934 49452 . - . name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
scaffold_1 JGI CDS 46934 49452 . - 1 name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 1
scaffold_1 JGI start_codon 49450 49452 . - 0 name "fgeneshNG_pg.scaffold_1000014"
scaffold_1 JGI exon 49763 49873 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1 JGI CDS 49763 49873 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 1
scaffold_1 JGI start_codon 49763 49765 . + 0 name "estExt_fgeneshNG_pg.C_10015"
scaffold_1 JGI exon 49913 50214 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1 JGI CDS 49913 50214 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 2
scaffold_1 JGI exon 50258 50651 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1 JGI CDS 50258 50651 . + 2 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 3
scaffold_1 JGI exon 50698 50948 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1 JGI CDS 50698 50948 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 4
scaffold_1 JGI exon 50996 53506 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
scaffold_1 JGI CDS 50996 53471 . + 2 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 5
scaffold_1 JGI stop_codon 53469 53471 . + 0 name "estExt_fgeneshNG_pg.C_10015"
scaffold_1 JGI exon 53849 55612 . + . name "fgeneshNG_pg.scaffold_1000016"; transcriptId 61082
scaffold_1 JGI CDS 53849 55612 . + 0 name "fgeneshNG_pg.scaffold_1000016"; proteinId 61082; exonNumber 1
scaffold_1 JGI start_codon 53849 53851 . + 0 name "fgeneshNG_pg.scaffold_1000016"
scaffold_1 JGI stop_codon 55610 55612 . + 0 name "fgeneshNG_pg.scaffold_1000016"
scaffold_1 JGI exon 57035 58150 . - . name "gw1.1.142.1"; transcriptId 1251
scaffold_1 JGI CDS 57035 58150 . - 0 name "gw1.1.142.1"; proteinId 1251; exonNumber 1
scaffold_1 JGI exon 58531 61815 . + . name "fgeneshNG_pg.scaffold_1000018"; transcriptId 61084
scaffold_1 JGI CDS 58531 61815 . + 0 name "fgeneshNG_pg.scaffold_1000018"; proteinId 61084; exonNumber 1
scaffold_1 JGI start_codon 58531 58533 . + 0 name "fgeneshNG_pg.scaffold_1000018"
scaffold_1 JGI stop_codon 61813 61815 . + 0 name "fgeneshNG_pg.scaffold_1000018"
scaffold_1 JGI exon 62494 64545 . - . name "fgeneshNG_pg.scaffold_1000019"; transcriptId 61085
scaffold_1 JGI CDS 62494 64545 . - 0 name "fgeneshNG_pg.scaffold_1000019"; proteinId 61085; exonNumber 1
scaffold_1 JGI start_codon 64543 64545 . - 0 name "fgeneshNG_pg.scaffold_1000019"
scaffold_1 JGI stop_codon 62494 62496 . - 0 name "fgeneshNG_pg.scaffold_1000019"
scaffold_1 JGI exon 65565 65937 . + . name "e_gw1.1.364.1"; transcriptId 29060
More information about the Gmod-help
mailing list