[Bioc-sig-seq] question about N

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Sep 8 17:50:18 CEST 2011


On Thu, Sep 8, 2011 at 1:15 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 09/07/2011 07:24 PM, wang peter wrote:
>> dear members:
>>      i have a question, that is
>> i remove 5' and 3' N, and do statistics on reads with N in the middle
>> then, i removed reads with N less than the cutoff
>> but those remainding reads still contain N
>> so i must use their id to retrieve original reads, See #1
>> so what can i do?
> maybe match(id(qualifiedReads), id(reads)] but it's really hard for me to
> understand your question. Maybe someone else will help.


Shan Gao: Maybe it might help if you explain a bit more clearly about
what you're trying to do since I'm not sure what you really want

After you remove reads with N, why do you still have some left?
Why do you think yo need their ID if you can further identify these by
some threshold (which I guess you already calculated)?

Also, just as a random "note," you've mentioned several times that you
are interested in building pipelines for doing rna-seq. It seems like
you are in the "read cleaning/filtering" stage now. It might be
helpful to see if some tools outside of the bioc-universe might be
helpful to that end, like fastx_toolkit and cutadapt (for quality and
adapter cleaning). Maybe fastqc for quality stats, and others. The
advantages of tools like these are:

(i) no real programming required, since you can
(ii) drive them from the command line (which is good for "pipelines");
(iii) they're fast; and
(iv) have low memory requirements since they essentially run over your
fastq files one record at a time as they gather statistics.

If you can break up the steps of your pipeline into pieces that can
take advantage of such tools, you might find it very helpful.


> Martin
>> part of coding
>> seqsWithoutNend<-trimLRPatterns(Rpattern = letter_subject, Lpattern =
>> letter_subject,subject = injectedseqs)
>> nCount<-alphabetFrequency(seqsWithoutNend)[,"N"]
>> nDist<- table(nCount)
>> cutoff=2
>> filter1<- nFilter(threshold=cutoff)
>> qualifiedReads<- reads[filter1(seqsWithoutNend)]
>> reads[id(qualifiedReads)] #1
>>        [[alternative HTML version deleted]]
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> Location: M1-B861
> Telephone: 206 667-2793
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

More information about the Bioc-sig-sequencing mailing list