GTF file parsing

Hello,

I’m preparing myself for my exam of bioinformatics. I have an exercise to prepare but I’m really stuck.

Background info:
The Gene Transfer File (GTF) format is one of the most frequently used file formats in bioinformatics. It stores genomic coordinates in a tab separated format. Every line is a single feature with a start and end coordinate. A feature can for instance be a gene, transcript or exon.
https://www.ensembl.org/info/website/upload/gff.html

Some lines in GTF format:
chr9 HAVANA gene 32566787 32568619 . + . gene_id “ENSG00000241043.1”; transcript_id “ENSG00000241043.1”; gene_type “protein_coding”; gene_status “KNOWN”; gene_name “GVQW1”; transcript_type “protein_coding”; transcript_status “KNOWN”; transcript_name “GVQW1”; level 2; havana_gene “OTTHUMG00000019744.1”;
chr13 ENSEMBL gene 113755563 113756608 . - . gene_id “ENSG00000268130.1”; transcript_id “ENSG00000268130.1”; gene_type “protein_coding”; gene_status “NOVEL”; gene_name “AL137002.1”; transcript_type “protein_coding”; transcript_status “NOVEL”; transcript_name “AL137002.1”; level 3;
chr10 HAVANA gene 27035522 27150016 . - . gene_id “ENSG00000136754.12”; transcript_id “ENSG00000136754.12”; gene_type “protein_coding”; gene_status “KNOWN”; gene_name “ABI1”; transcript_type “protein_coding”; transcript_status “KNOWN”; transcript_name “ABI1”; level 2; tag “ncRNA_host”; havana_gene “OTTHUMG00000017848.1”;

The assignment:
Write a program that counts the numbers of elements in the GTF file per chromosome.

Input:
The only input is a file name. The file itself contains coordinates in the GTF format. We don’t have the files so we can’t look at the data. We have to solve it in a platform for students where we can type our solution which is then tested with different automatic inputs. So it’s not a notebook like Jupyter. I don’t know how to work with these files, how I can open them if I don’t have them and since it’s not a notebook but just a platform where you give your solution and it then says if you have an error or not.

Output:
The number of elements in the GTF file by chromosome. Every line contains a chromosome and the number of elements, separated by a tab. The chromosomes are sorted alphabetically.

So for example:

Input:
gencode.gtf (because I already tried to open the file, I know that it tests against 16 files: gencode_0.gtf until gencode_15.gtf)

Output:
chr1 375
chr10 175
chr11 206
chr12 191
chr13 77
chr14 129
chr15 140
chr16 138
chr17 186
chr18 73
chr19 193
chr2 265
chr20 89
chr21 45
chr22 81
chr3 210
chr4 177
chr5 189
chr6 194
chr7 190
chr8 157
chr9 150
chrM 5
chrX 164
chrY 56

Any help would be greatly appreciated.
Thank you in advance,

KC

I’m into figuring out how to do things on your own, but in this case, just use Biopython:

https://biopython.org/wiki/GFF_Parsing

Then, reverse engineer how they do it.