[Gmod-help] gff3 file conversion from jgi annotation

Scott Cain scott at scottcain.net
Mon Jun 4 09:58:51 EDT 2012


Hi Alex,

I'm cc'ing your email to the GBrowse mailing list, where we discuss
file format conversions on a regular basis.  Your GFF looks very
GFF2/GTF to me, and looks very similar to the JGI GFF.  Have you tried
this script:

  https://github.com/hyphaltip/genome-scripts/blob/master/data_format/gtf2gff3_3level.pl

There is a good chance that will do that job.

Scott


On Mon, Jun 4, 2012 at 2:25 AM, Alex Greninger
<Alexander.Greninger at ucsf.edu> wrote:
> Dear GMOD Help Desk,
>
> Hi, I've spent the weekend searching online trying to answer this question and this is a last resort; I hope you can help!
>
> I'm trying to annotate a relatively small (50Mb) genome for which I have a good 300X coverage genome assembly and EST/RNA-Seq data.  I've been using the MAKER pipeline.  There are no good gene models from Augustus or preloaded into SNAP that I can use, so I'm trying to build a gene model HMM with SNAP using a close-cousin reference genome annotation that was done by JGI.  This is the only quasi-related organism when it comes to building the gene models for my assembly and my RNA-Seq data is probably not good enough to be able to bootstrap a gene model using iterative rounds of HMM building from the RNA-Seq data, per the "advanced" recommendations in the MAKER tutorial online.  So I really have to figure out how to build this gene model HMM from the JGI file.
>
> This requires converting a GFF2/GTF2/JGI-type file to GFF3 or at least a ZFF file.  I say GFF2/GTF2/JGI-type file because I can't quite figure out what type it is (it seems like JGI has its own format).  Every homebrew script I've found online to solve this deprecated GFF2/GTF2/JGI -> GFF3 parsing problem does not jive with SNAP ZFF specification or MAKER's maker2zff script.  Before I bite the bullet and try to figure out the correct way to parse from the format I have (GFF2/GTF2/JGI) to GFF3, I figured I should write since this problem has to have been solved before by others who've built models off of JGI's/fgenesh annotations.  I include the first 100 lines of the gene model annotation file from JGI so you know what it looks like.
>
> This is the great democratization of sequencing and DIY annotation, so of course this is the first time I've come across these file formats and I'm a wet-bio person with only enough coding experience to be dangerous to others.  I really thank you for your time and for all the great websites and wikis y'all have put up that have kept me from having to email y'all until I ran across this problem.
>
> Many, many thanks for your help!
>
> cheers,
>
> alex
>
> scaffold_1      JGI     exon    11332   11910   .       -       .       name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
> scaffold_1      JGI     CDS     11332   11910   .       -       0       name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 3
> scaffold_1      JGI     stop_codon      11332   11334   .       -       0       name "fgeneshNG_pg.scaffold_1000001"
> scaffold_1      JGI     exon    12067   16940   .       -       .       name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
> scaffold_1      JGI     CDS     12067   16940   .       -       0       name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 2
> scaffold_1      JGI     exon    17467   17506   .       -       .       name "fgeneshNG_pg.scaffold_1000001"; transcriptId 61067
> scaffold_1      JGI     CDS     17467   17506   .       -       2       name "fgeneshNG_pg.scaffold_1000001"; proteinId 61067; exonNumber 1
> scaffold_1      JGI     start_codon     17504   17506   .       -       0       name "fgeneshNG_pg.scaffold_1000001"
> scaffold_1      JGI     exon    17593   18711   .       -       .       name "estExt_fgeneshNG_kg.C_10001"; transcriptId 82392
> scaffold_1      JGI     CDS     17694   18692   .       -       0       name "estExt_fgeneshNG_kg.C_10001"; proteinId 82392; exonNumber 1
> scaffold_1      JGI     start_codon     18690   18692   .       -       0       name "estExt_fgeneshNG_kg.C_10001"
> scaffold_1      JGI     stop_codon      17694   17696   .       -       0       name "estExt_fgeneshNG_kg.C_10001"
> scaffold_1      JGI     exon    20239   20607   .       -       .       name "gw1.1.441.1"; transcriptId 3765
> scaffold_1      JGI     CDS     20239   20607   .       -       0       name "gw1.1.441.1"; proteinId 3765; exonNumber 1
> scaffold_1      JGI     exon    20723   22636   .       -       .       name "fgeneshNG_pg.scaffold_1000004"; transcriptId 61070
> scaffold_1      JGI     CDS     20723   22636   .       -       0       name "fgeneshNG_pg.scaffold_1000004"; proteinId 61070; exonNumber 1
> scaffold_1      JGI     start_codon     22634   22636   .       -       0       name "fgeneshNG_pg.scaffold_1000004"
> scaffold_1      JGI     stop_codon      20723   20725   .       -       0       name "fgeneshNG_pg.scaffold_1000004"
> scaffold_1      JGI     exon    26638   26779   .       +       .       name "gw1.1.469.1"; transcriptId 3975
> scaffold_1      JGI     CDS     26638   26779   .       +       0       name "gw1.1.469.1"; proteinId 3975; exonNumber 1
> scaffold_1      JGI     stop_codon      26777   26779   .       +       0       name "gw1.1.469.1"
> scaffold_1      JGI     exon    26820   27127   .       +       .       name "gw1.1.469.1"; transcriptId 3975
> scaffold_1      JGI     CDS     26820   27127   .       +       1       name "gw1.1.469.1"; proteinId 3975; exonNumber 2
> scaffold_1      JGI     exon    28343   28396   .       -       .       name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1      JGI     CDS     28343   28396   .       -       0       name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 4
> scaffold_1      JGI     stop_codon      28343   28345   .       -       0       name "fgeneshHS_pm.scaffold_1000002"
> scaffold_1      JGI     exon    30378   30724   .       -       .       name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1      JGI     CDS     30378   30724   .       -       0       name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 3
> scaffold_1      JGI     exon    30855   30858   .       -       .       name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1      JGI     CDS     30855   30858   .       -       2       name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 2
> scaffold_1      JGI     exon    31043   31231   .       -       .       name "fgeneshHS_pm.scaffold_1000002"; transcriptId 44165
> scaffold_1      JGI     CDS     31043   31231   .       -       0       name "fgeneshHS_pm.scaffold_1000002"; proteinId 44165; exonNumber 1
> scaffold_1      JGI     start_codon     31229   31231   .       -       0       name "fgeneshHS_pm.scaffold_1000002"
> scaffold_1      JGI     exon    31641   32449   .       -       .       name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
> scaffold_1      JGI     CDS     31641   32449   .       -       0       name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 2
> scaffold_1      JGI     stop_codon      31641   31643   .       -       0       name "fgeneshNG_pg.scaffold_1000008"
> scaffold_1      JGI     exon    32497   32782   .       -       .       name "fgeneshNG_pg.scaffold_1000008"; transcriptId 61074
> scaffold_1      JGI     CDS     32497   32782   .       -       2       name "fgeneshNG_pg.scaffold_1000008"; proteinId 61074; exonNumber 1
> scaffold_1      JGI     start_codon     32780   32782   .       -       0       name "fgeneshNG_pg.scaffold_1000008"
> scaffold_1      JGI     exon    33022   33607   .       +       .       name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
> scaffold_1      JGI     CDS     33022   33607   .       +       0       name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 1
> scaffold_1      JGI     start_codon     33022   33024   .       +       0       name "fgeneshNG_pg.scaffold_1000009"
> scaffold_1      JGI     exon    33654   34195   .       +       .       name "fgeneshNG_pg.scaffold_1000009"; transcriptId 61075
> scaffold_1      JGI     CDS     33654   34195   .       +       1       name "fgeneshNG_pg.scaffold_1000009"; proteinId 61075; exonNumber 2
> scaffold_1      JGI     stop_codon      34193   34195   .       +       0       name "fgeneshNG_pg.scaffold_1000009"
> scaffold_1      JGI     exon    34588   35499   .       -       .       name "fgeneshNG_pg.scaffold_1000010"; transcriptId 61076
> scaffold_1      JGI     CDS     34588   35499   .       -       0       name "fgeneshNG_pg.scaffold_1000010"; proteinId 61076; exonNumber 1
> scaffold_1      JGI     start_codon     35497   35499   .       -       0       name "fgeneshNG_pg.scaffold_1000010"
> scaffold_1      JGI     stop_codon      34588   34590   .       -       0       name "fgeneshNG_pg.scaffold_1000010"
> scaffold_1      JGI     exon    35721   36734   .       -       .       name "fgeneshNG_pg.scaffold_1000011"; transcriptId 61077
> scaffold_1      JGI     CDS     35721   36734   .       -       0       name "fgeneshNG_pg.scaffold_1000011"; proteinId 61077; exonNumber 1
> scaffold_1      JGI     start_codon     36732   36734   .       -       0       name "fgeneshNG_pg.scaffold_1000011"
> scaffold_1      JGI     stop_codon      35721   35723   .       -       0       name "fgeneshNG_pg.scaffold_1000011"
> scaffold_1      JGI     exon    38130   38556   .       +       .       name "e_gw1.1.292.1"; transcriptId 29098
> scaffold_1      JGI     CDS     38130   38556   .       +       0       name "e_gw1.1.292.1"; proteinId 29098; exonNumber 1
> scaffold_1      JGI     start_codon     38130   38132   .       +       0       name "e_gw1.1.292.1"
> scaffold_1      JGI     exon    38713   39482   .       +       .       name "e_gw1.1.292.1"; transcriptId 29098
> scaffold_1      JGI     CDS     38713   39482   .       +       1       name "e_gw1.1.292.1"; proteinId 29098; exonNumber 2
> scaffold_1      JGI     stop_codon      39480   39482   .       +       0       name "e_gw1.1.292.1"
> scaffold_1      JGI     exon    39590   39640   .       +       .       name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
> scaffold_1      JGI     CDS     39590   39640   .       +       0       name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 1
> scaffold_1      JGI     start_codon     39590   39592   .       +       0       name "fgeneshNG_pg.scaffold_1000013"
> scaffold_1      JGI     exon    39683   40042   .       +       .       name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
> scaffold_1      JGI     CDS     39683   40042   .       +       0       name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 2
> scaffold_1      JGI     exon    40083   42497   .       +       .       name "fgeneshNG_pg.scaffold_1000013"; transcriptId 61079
> scaffold_1      JGI     CDS     40083   42497   .       +       0       name "fgeneshNG_pg.scaffold_1000013"; proteinId 61079; exonNumber 3
> scaffold_1      JGI     stop_codon      42495   42497   .       +       0       name "fgeneshNG_pg.scaffold_1000013"
> scaffold_1      JGI     exon    46095   46893   .       -       .       name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
> scaffold_1      JGI     CDS     46095   46893   .       -       0       name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 2
> scaffold_1      JGI     stop_codon      46095   46097   .       -       0       name "fgeneshNG_pg.scaffold_1000014"
> scaffold_1      JGI     exon    46934   49452   .       -       .       name "fgeneshNG_pg.scaffold_1000014"; transcriptId 61080
> scaffold_1      JGI     CDS     46934   49452   .       -       1       name "fgeneshNG_pg.scaffold_1000014"; proteinId 61080; exonNumber 1
> scaffold_1      JGI     start_codon     49450   49452   .       -       0       name "fgeneshNG_pg.scaffold_1000014"
> scaffold_1      JGI     exon    49763   49873   .       +       .       name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1      JGI     CDS     49763   49873   .       +       0       name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 1
> scaffold_1      JGI     start_codon     49763   49765   .       +       0       name "estExt_fgeneshNG_pg.C_10015"
> scaffold_1      JGI     exon    49913   50214   .       +       .       name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1      JGI     CDS     49913   50214   .       +       0       name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 2
> scaffold_1      JGI     exon    50258   50651   .       +       .       name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1      JGI     CDS     50258   50651   .       +       2       name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 3
> scaffold_1      JGI     exon    50698   50948   .       +       .       name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1      JGI     CDS     50698   50948   .       +       0       name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 4
> scaffold_1      JGI     exon    50996   53506   .       +       .       name "estExt_fgeneshNG_pg.C_10015"; transcriptId 77664
> scaffold_1      JGI     CDS     50996   53471   .       +       2       name "estExt_fgeneshNG_pg.C_10015"; proteinId 77664; exonNumber 5
> scaffold_1      JGI     stop_codon      53469   53471   .       +       0       name "estExt_fgeneshNG_pg.C_10015"
> scaffold_1      JGI     exon    53849   55612   .       +       .       name "fgeneshNG_pg.scaffold_1000016"; transcriptId 61082
> scaffold_1      JGI     CDS     53849   55612   .       +       0       name "fgeneshNG_pg.scaffold_1000016"; proteinId 61082; exonNumber 1
> scaffold_1      JGI     start_codon     53849   53851   .       +       0       name "fgeneshNG_pg.scaffold_1000016"
> scaffold_1      JGI     stop_codon      55610   55612   .       +       0       name "fgeneshNG_pg.scaffold_1000016"
> scaffold_1      JGI     exon    57035   58150   .       -       .       name "gw1.1.142.1"; transcriptId 1251
> scaffold_1      JGI     CDS     57035   58150   .       -       0       name "gw1.1.142.1"; proteinId 1251; exonNumber 1
> scaffold_1      JGI     exon    58531   61815   .       +       .       name "fgeneshNG_pg.scaffold_1000018"; transcriptId 61084
> scaffold_1      JGI     CDS     58531   61815   .       +       0       name "fgeneshNG_pg.scaffold_1000018"; proteinId 61084; exonNumber 1
> scaffold_1      JGI     start_codon     58531   58533   .       +       0       name "fgeneshNG_pg.scaffold_1000018"
> scaffold_1      JGI     stop_codon      61813   61815   .       +       0       name "fgeneshNG_pg.scaffold_1000018"
> scaffold_1      JGI     exon    62494   64545   .       -       .       name "fgeneshNG_pg.scaffold_1000019"; transcriptId 61085
> scaffold_1      JGI     CDS     62494   64545   .       -       0       name "fgeneshNG_pg.scaffold_1000019"; proteinId 61085; exonNumber 1
> scaffold_1      JGI     start_codon     64543   64545   .       -       0       name "fgeneshNG_pg.scaffold_1000019"
> scaffold_1      JGI     stop_codon      62494   62496   .       -       0       name "fgeneshNG_pg.scaffold_1000019"
> scaffold_1      JGI     exon    65565   65937   .       +       .       name "e_gw1.1.364.1"; transcriptId 29060
>



-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research



More information about the Gmod-help mailing list