[BioC] how to get the exon count by using htseq-count
Hu Fuyan [guest]
guest at bioconductor.org
Tue Mar 19 10:45:25 CET 2013
Dear teacher,
How to get the exon count by using htseq-count?
My script is:
python -m HTSeq.scripts.count -m intersection-strict --stranded=no -t exon -i exon_id accepted_hits_name.sorted.sam Homo_sapiens.GRCh37.70_new.gtf>outfile_htseqcount_exon
For example gene SLC25A13 has 18 exons, how can I get the counts for the 18 exons?
How can I get the result like this (coming from easyRNASeq):
"\"ENSG00000004864\"_1" 2
"\"ENSG00000004864\"_2" 4
"\"ENSG00000004864\"_3" 16
"\"ENSG00000004864\"_4" 3
"\"ENSG00000004864\"_5" 7
"\"ENSG00000004864\"_6" 8
"\"ENSG00000004864\"_7" 5
"\"ENSG00000004864\"_8" 4
"\"ENSG00000004864\"_9" 4
"\"ENSG00000004864\"_10" 1
"\"ENSG00000004864\"_11" 6
"\"ENSG00000004864\"_12" 4
"\"ENSG00000004864\"_13" 4
"\"ENSG00000004864\"_14" 6
"\"ENSG00000004864\"_15" 8
"\"ENSG00000004864\"_16" 5
"\"ENSG00000004864\"_17" 3
"\"ENSG00000004864\"_18" 25
Here is a part from my genes.gtf (human ensembl)
7 protein_coding exon 95951254 95951405 . - . gene_id "ENSG00000004864"; transcript_id "ENST00000265631"; exon_number "1"; gene_name "SLC25A13"; gene_biotype "protein_coding"; transcript_name "SLC25A13-001"; exon_id "ENSE00001830180";
7 protein_coding CDS 95951254 95951268 . - 0 gene_id "ENSG00000004864"; transcript_id "ENST00000265631"; exon_number "1"; gene_name "SLC25A13"; gene_biotype "protein_coding"; transcript_name "SLC25A13-001"; protein_id "ENSP00000265631";
7 protein_coding start_codon 95951266 95951268 . - 0 gene_id "ENSG00000004864"; transcript_id "ENST00000265631"; exon_number "1"; gene_name "SLC25A13"; gene_biotype "protein_coding"; transcript_name "SLC25A13-001";
7 protein_coding exon 95926210 95926263 . - . gene_id "ENSG00000004864"; transcript_id "ENST00000265631"; exon_number "2"; gene_name "SLC25A13"; gene_biotype "protein_coding"; transcript_name "SLC25A13-001"; exon_id "ENSE00003192771";
7 protein_coding CDS 95926210 95926263 . - 0 gene_id "ENSG00000004864"; transcript_id "ENST00000265631"; exon_number "2"; gene_name "SLC25A13"; gene_biotype "protein_coding"; transcript_name "SLC25A13-001"; protein_id "ENSP00000265631";
7 protein_coding exon 95906508 95906650 . - . gene_id "ENSG00000004864"; transcript_id "ENST00000265631"; exon_number "3"; gene_name "SLC25A13"; gene_biotype "protein_coding"; transcript_name "SLC25A13-001"; exon_id "ENSE00003069204";
7 protein_coding CDS 95906508 95906650 . - 0 gene_id "ENSG00000004864"; transcript_id "ENST00000265631"; exon_number "3"; gene_name "SLC25A13"; gene_biotype "protein_coding"; transcript_name "SLC25A13-001"; protein_id "ENSP00000265631";
7 protein_coding exon
-- output of sessionInfo():
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] easyRNASeq_1.4.2 ShortRead_1.16.4 latticeExtra_0.6-24
[4] RColorBrewer_1.0-5 Rsamtools_1.10.2 DESeq_1.10.1
[7] lattice_0.20-13 locfit_1.5-8 BSgenome_1.26.1
[10] GenomicRanges_1.10.7 Biostrings_2.26.3 IRanges_1.16.6
[13] edgeR_3.0.8 limma_3.14.4 biomaRt_2.14.0
[16] Biobase_2.18.0 genomeIntervals_1.14.0 BiocGenerics_0.4.0
[19] intervals_0.13.3
loaded via a namespace (and not attached):
[1] annotate_1.36.0 AnnotationDbi_1.20.5 bitops_1.0-5
[4] DBI_0.2-5 genefilter_1.40.0 geneplotter_1.36.0
[7] grid_2.15.2 hwriter_1.3 RCurl_1.95-3
[10] RSQLite_0.11.2 splines_2.15.2 stats4_2.15.2
[13] survival_2.37-4 XML_3.95-0.1 xtable_1.7-1
[16] zlibbioc_1.4.0
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list