[Bioc-sig-seq] About PhredQuality

Mon Sep 12 05:33:50 CEST 2011

On 09/08/2011 01:14 PM, wang peter wrote:
>
> hi all, especially dear steve and martin:
>      thx for your kindly help, that lead me finish the first stage of my
> pipeline;
> such is coding:

Thanks for sharing your coding! I added 'trimEnds' to ShortRead (use the 
development versions of R and Bioconductor, ShortRead version 1.11.37 or 
greater). It can trim left or right ends, either with exact match (e.g., 
arguments a=c("A", "T"), relation="==") or below some threshold (e.g., 
a="I", relation="<="). So

## 'I' is Phred quality 20; "<=" and both ends by default
trimmed = trimEnds(reads, a="I")

This uses the quality scores for trimming, you can also trim reads, or 
return an IRanges (e.g., to trim a ShortReadQ object based on sequence)

rng = trimEnds(sread(reads), c("G", "C"), relation="==", right=FALSE,
                ranges=TRUE)
narrow(reads, start=start(rng), end=end(rng))

See ?trimEnds.

Martin

> fastqfile="query.fastq"
> library(ShortRead)
> reads <- readFastq(fastqfile);
> ids<- id(reads);
> seqs <- sread(reads);
> # first report the data quality
> qa <- qa(dirPath=".", pattern="query.fastq", type=c("fastq")) # read in
> fastq file
> report(qa, dest="qcReport1", type="html")
>
> #this is not necessary
> nCount<-alphabetFrequency(seqs)[,"N"]
> nDist<- table(nCount)
> #replaced all the nucleotide residue whose quality score below the
> cutoff by "N"
> qualityCutoff <- 20
> qual <- PhredQuality(quality(quality(reads))) # get quality score list
> as PhredQuality
> myqual_mat <- matrix(charToRaw(as.character(unlist(qual))),
> nrow=length(qual), byrow=TRUE) # convert quality score to matrix
> at <- myqual_mat <
> charToRaw(as.character(PhredQuality(as.integer(qualityCutoff))))
> letter_subject <- DNAString(paste(rep.int <http://rep.int>("N",
> width(seqs)[1]), collapse=""))
> letter <- as(Views(letter_subject, start=1, end=rowSums(at)),
> "DNAStringSet")
> injectedseqs <- replaceLetterAt(seqs, at, letter)
> #remove all the "N" at 5' and 3' end, and see the "N" distribution in reads
> seqsWithoutNend <-trimLRPatterns(Rpattern = letter_subject, Lpattern =
> letter_subject,subject = injectedseqs)
> trimCoords<-trimLRPatterns(Rpattern = letter_subject, Lpattern =
> letter_subject,subject = injectedseqs,ranges=T)
> nCount<-alphabetFrequency(seqsWithoutNend)[,"N"]
> nDist<- table(nCount)
> #from the "N" distribution, decide a cutoff to remove some reads, here
> the cutoff is set 2
> nCutoff=2
> middleN <-nCount < nCutoff
> reads<-reads[middleN]
> seqs <- seqs[middleN]
> qual <- qual[middleN]
> trimCoords<-trimCoords[middleN]
> seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
> qual1 <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
> qual <- SFastqQuality(qual1) # reapply quality score type
> trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
> table(width(trimmed))
> at last, we get all the high quality reads without "N" at 5' and 3' end

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793