[Gmod-help] gff3 file conversion from jgi annotation
Scott Cain
scott at scottcain.net
Mon Jun 4 09:58:51 EDT 2012
Hi Alex,
I'm cc'ing your email to the GBrowse mailing list, where we discuss
file format conversions on a regular basis. Your GFF looks very
GFF2/GTF to me, and looks very similar to the JGI GFF. Have you tried
this script:
https://github.com/hyphaltip/genome-scripts/blob/master/data_format/gtf2gff3_3level.pl
There is a good chance that will do that job.
Scott
On Mon, Jun 4, 2012 at 2:25 AM, Alex Greninger
<Alexander.Greninger at ucsf.edu> wrote:
> Dear GMOD Help Desk,
>
> Hi, I've spent the weekend searching online trying to answer this question and this is a last resort; I hope you can help!
>
> I'm trying to annotate a relatively small (50Mb) genome for which I have a good 300X coverage genome assembly and EST/RNA-Seq data. I've been using the MAKER pipeline. There are no good gene models from Augustus or preloaded into SNAP that I can use, so I'm trying to build a gene model HMM with SNAP using a close-cousin reference genome annotation that was done by JGI. This is the only quasi-related organism when it comes to building the gene models for my assembly and my RNA-Seq data is probably not good enough to be able to bootstrap a gene model using iterative rounds of HMM building from the RNA-Seq data, per the "advanced" recommendations in the MAKER tutorial online. So I really have to figure out how to build this gene model HMM from the JGI file.
>
> This requires converting a GFF2/GTF2/JGI-type file to GFF3 or at least a ZFF file. I say GFF2/GTF2/JGI-type file because I can't quite figure out what type it is (it seems like JGI has its own format). Every homebrew script I've found online to solve this deprecated GFF2/GTF2/JGI -> GFF3 parsing problem does not jive with SNAP ZFF specification or MAKER's maker2zff script. Before I bite the bullet and try to figure out the correct way to parse from the format I have (GFF2/GTF2/JGI) to GFF3, I figured I should write since this problem has to have been solved before by others who've built models off of JGI's/fgenesh annotations. I include the first 100 lines of the gene model annotation file from JGI so you know what it looks like.
>
> This is the great democratization of sequencing and DIY annotation, so of course this is the first time I've come across these file formats and I'm a wet-bio person with only enough coding experience to be dangerous to others. I really thank you for your time and for all the great websites and wikis y'all have put up that have kept me from having to email y'all until I ran across this problem.
>
> Many, many thanks for your help!
>
> cheers,
>
> alex
>
> scaffold_1 JGI exon 11332 11910 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
> scaffold_1 JGI CDS 11332 11910 . - 0 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 3
> scaffold_1 JGI stop_codon 11332 11334 . - 0 name "fgeneshNG_pg.scaffold_1000001"
> scaffold_1 JGI exon 12067 16940 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
> scaffold_1 JGI CDS 12067 16940 . - 0 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 2
> scaffold_1 JGI exon 17467 17506 . - . name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
> scaffold_1 JGI CDS 17467 17506 . - 2 name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 1
> scaffold_1 JGI start_codon 17504 17506 . - 0 name "fgeneshNG_pg.scaffold_1000001"
> scaffold_1 JGI exon 17593 18711 . - . name "estExt_fgeneshNG_kg.C_10001"; transcriptId 82392
> scaffold_1 JGI CDS 17694 18692 . - 0 name "estExt_fgeneshNG_kg.C_10001"; proteinId 82392; exonNumber 1
> scaffold_1 JGI start_codon 18690 18692 . - 0 name "estExt_fgeneshNG_kg.C_10001"
> scaffold_1 JGI stop_codon 17694 17696 . - 0 name "estExt_fgeneshNG_kg.C_10001"
> scaffold_1 JGI exon 20239 20607 . - . name "gw1.1.441.1"; transcriptId 3765
> scaffold_1 JGI CDS 20239 20607 . - 0 name "gw1.1.441.1"; proteinId 3765; exonNumber 1
> scaffold_1 JGI exon 20723 22636 . - . name "fgeneshNG_pg.scaffold_1000004"; transcriptId 61070
> scaffold_1 JGI CDS 20723 22636 . - 0 name "fgeneshNG_pg.scaffold_1000004"; proteinId 61070; exonNumber 1
> scaffold_1 JGI start_codon 22634 22636 . - 0 name "fgeneshNG_pg.scaffold_1000004"
> scaffold_1 JGI stop_codon 20723 20725 . - 0 name "fgeneshNG_pg.scaffold_1000004"
> scaffold_1 JGI exon 26638 26779 . + . name "gw1.1.469.1"; transcriptId 3975
> scaffold_1 JGI CDS 26638 26779 . + 0 name "gw1.1.469.1"; proteinId 3975; exonNumber 1
> scaffold_1 JGI stop_codon 26777 26779 . + 0 name "gw1.1.469.1"
> scaffold_1 JGI exon 26820 27127 . + . name "gw1.1.469.1"; transcriptId 3975
> scaffold_1 JGI CDS 26820 27127 . + 1 name "gw1.1.469.1"; proteinId 3975; exonNumber 2
> scaffold_1 JGI exon 28343 28396 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1 JGI CDS 28343 28396 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 4
> scaffold_1 JGI stop_codon 28343 28345 . - 0 name "fgeneshHS_pm.scaffold_1000002"
> scaffold_1 JGI exon 30378 30724 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1 JGI CDS 30378 30724 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 3
> scaffold_1 JGI exon 30855 30858 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1 JGI CDS 30855 30858 . - 2 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 2
> scaffold_1 JGI exon 31043 31231 . - . name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1 JGI CDS 31043 31231 . - 0 name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 1
> scaffold_1 JGI start_codon 31229 31231 . - 0 name "fgeneshHS_pm.scaffold_1000002"
> scaffold_1 JGI exon 31641 32449 . - . name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
> scaffold_1 JGI CDS 31641 32449 . - 0 name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 2
> scaffold_1 JGI stop_codon 31641 31643 . - 0 name "fgeneshNG_pg.scaffold_1000008"
> scaffold_1 JGI exon 32497 32782 . - . name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
> scaffold_1 JGI CDS 32497 32782 . - 2 name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 1
> scaffold_1 JGI start_codon 32780 32782 . - 0 name "fgeneshNG_pg.scaffold_1000008"
> scaffold_1 JGI exon 33022 33607 . + . name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
> scaffold_1 JGI CDS 33022 33607 . + 0 name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 1
> scaffold_1 JGI start_codon 33022 33024 . + 0 name "fgeneshNG_pg.scaffold_1000009"
> scaffold_1 JGI exon 33654 34195 . + . name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
> scaffold_1 JGI CDS 33654 34195 . + 1 name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 2
> scaffold_1 JGI stop_codon 34193 34195 . + 0 name "fgeneshNG_pg.scaffold_1000009"
> scaffold_1 JGI exon 34588 35499 . - . name "fgeneshNG_pg.scaffold_1000010"; transcriptId 61076
> scaffold_1 JGI CDS 34588 35499 . - 0 name "fgeneshNG_pg.scaffold_1000010"; proteinId 61076; exonNumber 1
> scaffold_1 JGI start_codon 35497 35499 . - 0 name "fgeneshNG_pg.scaffold_1000010"
> scaffold_1 JGI stop_codon 34588 34590 . - 0 name "fgeneshNG_pg.scaffold_1000010"
> scaffold_1 JGI exon 35721 36734 . - . name "fgeneshNG_pg.scaffold_1000011"; transcriptId 61077
> scaffold_1 JGI CDS 35721 36734 . - 0 name "fgeneshNG_pg.scaffold_1000011"; proteinId 61077; exonNumber 1
> scaffold_1 JGI start_codon 36732 36734 . - 0 name "fgeneshNG_pg.scaffold_1000011"
> scaffold_1 JGI stop_codon 35721 35723 . - 0 name "fgeneshNG_pg.scaffold_1000011"
> scaffold_1 JGI exon 38130 38556 . + . name "e_gw1.1.292.1"; transcriptId 29098
> scaffold_1 JGI CDS 38130 38556 . + 0 name "e_gw1.1.292.1"; proteinId 29098; exonNumber 1
> scaffold_1 JGI start_codon 38130 38132 . + 0 name "e_gw1.1.292.1"
> scaffold_1 JGI exon 38713 39482 . + . name "e_gw1.1.292.1"; transcriptId 29098
> scaffold_1 JGI CDS 38713 39482 . + 1 name "e_gw1.1.292.1"; proteinId 29098; exonNumber 2
> scaffold_1 JGI stop_codon 39480 39482 . + 0 name "e_gw1.1.292.1"
> scaffold_1 JGI exon 39590 39640 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
> scaffold_1 JGI CDS 39590 39640 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 1
> scaffold_1 JGI start_codon 39590 39592 . + 0 name "fgeneshNG_pg.scaffold_1000013"
> scaffold_1 JGI exon 39683 40042 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
> scaffold_1 JGI CDS 39683 40042 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 2
> scaffold_1 JGI exon 40083 42497 . + . name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
> scaffold_1 JGI CDS 40083 42497 . + 0 name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 3
> scaffold_1 JGI stop_codon 42495 42497 . + 0 name "fgeneshNG_pg.scaffold_1000013"
> scaffold_1 JGI exon 46095 46893 . - . name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
> scaffold_1 JGI CDS 46095 46893 . - 0 name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 2
> scaffold_1 JGI stop_codon 46095 46097 . - 0 name "fgeneshNG_pg.scaffold_1000014"
> scaffold_1 JGI exon 46934 49452 . - . name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
> scaffold_1 JGI CDS 46934 49452 . - 1 name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 1
> scaffold_1 JGI start_codon 49450 49452 . - 0 name "fgeneshNG_pg.scaffold_1000014"
> scaffold_1 JGI exon 49763 49873 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1 JGI CDS 49763 49873 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 1
> scaffold_1 JGI start_codon 49763 49765 . + 0 name "estExt_fgeneshNG_pg.C_10015"
> scaffold_1 JGI exon 49913 50214 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1 JGI CDS 49913 50214 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 2
> scaffold_1 JGI exon 50258 50651 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1 JGI CDS 50258 50651 . + 2 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 3
> scaffold_1 JGI exon 50698 50948 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1 JGI CDS 50698 50948 . + 0 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 4
> scaffold_1 JGI exon 50996 53506 . + . name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1 JGI CDS 50996 53471 . + 2 name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 5
> scaffold_1 JGI stop_codon 53469 53471 . + 0 name "estExt_fgeneshNG_pg.C_10015"
> scaffold_1 JGI exon 53849 55612 . + . name "fgeneshNG_pg.scaffold_1000016"; transcriptId 61082
> scaffold_1 JGI CDS 53849 55612 . + 0 name "fgeneshNG_pg.scaffold_1000016"; proteinId 61082; exonNumber 1
> scaffold_1 JGI start_codon 53849 53851 . + 0 name "fgeneshNG_pg.scaffold_1000016"
> scaffold_1 JGI stop_codon 55610 55612 . + 0 name "fgeneshNG_pg.scaffold_1000016"
> scaffold_1 JGI exon 57035 58150 . - . name "gw1.1.142.1"; transcriptId 1251
> scaffold_1 JGI CDS 57035 58150 . - 0 name "gw1.1.142.1"; proteinId 1251; exonNumber 1
> scaffold_1 JGI exon 58531 61815 . + . name "fgeneshNG_pg.scaffold_1000018"; transcriptId 61084
> scaffold_1 JGI CDS 58531 61815 . + 0 name "fgeneshNG_pg.scaffold_1000018"; proteinId 61084; exonNumber 1
> scaffold_1 JGI start_codon 58531 58533 . + 0 name "fgeneshNG_pg.scaffold_1000018"
> scaffold_1 JGI stop_codon 61813 61815 . + 0 name "fgeneshNG_pg.scaffold_1000018"
> scaffold_1 JGI exon 62494 64545 . - . name "fgeneshNG_pg.scaffold_1000019"; transcriptId 61085
> scaffold_1 JGI CDS 62494 64545 . - 0 name "fgeneshNG_pg.scaffold_1000019"; proteinId 61085; exonNumber 1
> scaffold_1 JGI start_codon 64543 64545 . - 0 name "fgeneshNG_pg.scaffold_1000019"
> scaffold_1 JGI stop_codon 62494 62496 . - 0 name "fgeneshNG_pg.scaffold_1000019"
> scaffold_1 JGI exon 65565 65937 . + . name "e_gw1.1.364.1"; transcriptId 29060
>
--
------------------------------------------------------------------------
Scott Cain, Ph. D. scott at scottcain dot net
GMOD Coordinator (http://gmod.org/) 216-392-3087
Ontario Institute for Cancer Research
More information about the Gmod-help
mailing list