[BioC] DEXSeq Problems
Simon Anders
anders at embl.de
Sun May 26 19:05:10 CEST 2013
Hi Margaret
Thanks for the feed-back. We have started working on an improved
vignette but this might still take a bit to get finished. The main issue
with the current one is that it somehow start in the middle, namely
after the use of the Python scripts, and then explains these
initial steps only at the end, with some parts even been moved to the
pasilla vignette. So, you are certainly right, this needs to be cleaned
up and brought in a chronological order.
To not let you wait for our new vignettes, let's go through your issues:
> Issue #1
>
> module load htseq/0.5.3p9
> python
> import HTSeq
> alignment_file = HTSeq.SAM_Reader(" sam file ")
>
> This step generates a SAM alignment object.
>
> However, I need to exit in order to use PBS code to run the
> dexseq_prepare_annotation.py
> and dexseq_count.py and generate gff and txt files,
> and thus lose the SAM alignment object which I am guessing should
> be used as the gtf file?
I am unusure where you found this code fragment. Have yiu started
reading the "Tour through HTSeq" page? This is nice, but really, there
is no need learn about how o program your own Python scripts that use
HTSeq when you just want to use the HTSeq-based Python scripts supplied
with DEXSeq.
You need to start of with preparing your annotation file. This is the
GTF file with the gene models for your species that you get from
Ensembl. (Make sure that the GTF file matches the genome you aligned
against.) Just run, on the bash shell
python dexseq_prepare_annotation.py Mus_musculus.GRCm38.71.gtf.gz \
mouse_flattened.gff
This takes the GTF file from Ensembl (here the current one for mouse,
taken from ftp://ftp.ensembl.org/pub/release-71/gtf/ ) and "flattens"
it. See Figure 1 of our paper for an explanation what we mean by
"flattening".
Then, you run for each of your BAM files the counting script:
python dexseq_count.py mouse_flattened.gff sample1.sam sample1.counts
Here, sample1.sam is the file produced by tophat (i.e.,
"accepted_hits.sam" for the respective sample). The file
"sample1.counts" contains the counts. Look into it to see whether it
makes sense.
Then, you follow the Section "Creating ExonCountSet objects from files
produced by HTSeq" to read in your count files. And then, you have the
ExonCountDataSet that you need to perform the analysis described in the
initial sections of the vignette.
> Can I actually skip the alignment_file step if I already posses a gtf file?
Not sure which step you mean. You always need to get a GTF file
(preferably from Ensembl, the UCSC ones only work after some tweaking),
"flatten it", and then count.
> Issue #2
>
> If I have a homo sapiens transcript.bam file (not from ensembl) how do
> I process the file in order to make it work with
> dexseq_prepare_annotation.py?
>
> Should I pass the transcript.sam file through the alignment_file code
> first?
A ".bam" file? Are you sure you don't mean a gtf or gff file?
Explain a bit more what this file contains, please.
> Issue #3
>
> Using an ensembl animal gene data, I have to specify strandedness=no within
> the
> python dexseq_count.py code and I am wondering if this will
> lead to errors down the line.. The dexseq_count code won't work without
> -s no, despite that the bam file is probably from the ensembl site.
Somehow you got confused here. The bam files contain the aligned reads
from your samples, not the annotation. And the "stranded" argument is
about whether the wet-lab protocol you used for sample preparation
preserves strandedness (i.e., whether the strand to which a read gets
aligned can tell you about which strand the original mRNA was
transcribed from.)
> Issue #4
>
> The two vignettes use different ways to create an ExonCountSet object.I am
> confused about which to use...
Well, they both work. We prefer to use the Python scripts. However, many
users said they preferred an solution that works completely in R,
without resorting to another language such as Python, and the
GenomicRanges workflow is an attempt to accomodate these wishes.
> Also, if I have a homo sapiens gene transcript data from illumina, put
> through tophat, what type of functions should I use to prep the data for
> the ExonCountSet
> creation? Same question for ensembl animal gene data...
I am still confused by what you mean by "data from ensemble". Do you
have your own RNA-Seq data, or are you using published ones?
> Should I prep the data by using this workflow:
> http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFeatures/inst/doc/GenomicFeatures.pdf
This is the vignette for the GenomicFeatures package. You can use this
package _instead_ of our Python scripts, but then, there is still no
need to work through this vignette. Rather, just follow the instructions
in the DEXSeq vignette, Section "Creating ExonCountSet objects From
GRanges, BamListFiles and transcriptDb objects".
I hope this helps you to get started.
Simon
More information about the Bioconductor
mailing list