[Bioc-sig-seq] perTile QA element in the FastqQA class

Tue Apr 13 23:02:30 CEST 2010

On 04/13/2010 11:39 AM, Sirisha Sunkara wrote:
> I am sorry, this line should read as:
> 
>> fqhead[fqhead$V2 == "8" & fqhead$V3 == "120",]
> 
> my dataframe looks like this:
> V2 is the lane number, V3 is the tile number, V4 and V5 being the x and
> y coordinates of the cluster position.
>> head(fqhead)
>            V1 V2 V3 V4   V5
> 1  @ILLUMINA06  8  1  6  849
> 2  @ILLUMINA06  8  1  6 1169
> 3  @ILLUMINA06  8  1  6 1163
> 4  @ILLUMINA06  8  1  6 1512
> 5  @ILLUMINA06  8  1  6 1251
> 6  @ILLUMINA06  8  1  6  372
> 7  @ILLUMINA06  8  1  6 1555
> 8  @ILLUMINA06  8  1  6 1644
> 9  @ILLUMINA06  8  1  6 2011
> 10 @ILLUMINA06  8  1  7 1835
> 
> Sirisha
> 
> 
> Sirisha Sunkara wrote:
>> Hi Martin,
>>
>> I am using the sequence.txt files generated by the Illumina pipeline
>> (OLB1.6/RTA1.6) as is, which seem to have the tile coordinates.
>>
>> Just so I can focus on the ReadIDs part for now (and I am sure this is
>> not exactly what you asked for), I parsed out the readIDs from the
>> fastq, and am working with those.
>>
>> This is what my fastqs look like:
>>
>> @ILLUMINA06:8:1:6:849#0/1
>> GCTCTTTTTGATTCTCAAATCCGGCGTCAACCATA
>> +ILLUMINA06:8:1:6:849#0/1
>> a`abaa_aaa]_a`_a_[]`]a_`aa_`_aa`aaa
>> @ILLUMINA06:8:1:6:1169#0/1
>> TAATGCCACTCCTCTCCCGACTGTTAACACTGCTG
>> +ILLUMINA06:8:1:6:1169#0/1
>> ab`_Z_aXa`bbababbbaabaaaaababaaa`V`
>>
>> My very basic attempt at this:
>>
>> > fqhead <-
>> read.table("./Contam_Screening/Run703/sequence_8_1_hdrs.txt", sep=":")
>>
>> To extract all entries for instance in lane 8, tile 120:
>> > fqhead[fqhead$V2 == "3" & fq$V3 == "120",]

OK; I might do this

f <- function(fl) {
    rfq = readFastq(fl)
    m <- sapply(strsplit(as.character(id(rfq)), ":"), "[", 2:3)
    lane <- unique(m[1,])  # should be just 1 lane per file
    stopifnot(1L == length(lane))
    tile <- m[2,]
    score <- alphabetScore(rfq) / width(rfq)
    data.frame(lane=as.integer(lane),
               tile=as.integer(tapply(tile, tile, unique)),
               count=as.integer(tapply(tile, tile, length)),
               score=tapply(score, tile, median))
}

which reads in the data, extracts the 'lane' and 'tile' indicators from
the name, calculates the average quality score of a read, counts how
many times each tile occurs, and calculates the median score per tile.
Do this for all lanes, rbind'ing the data frames together

  fls = list.files(""./Contam_Screening/Run703/", "sequence*_txt")
  df = do.call(rbind, lapply(fls, f))

and plot

  ShortRead:::.plotTileCounts(df)
  ShortRead:::.plotTileQualityScore(df)

This is not tested, so there might be bugs...

Martin

>>
>> I hope I am somewhat closer to what you asked for...
>>
>> Thanks a lot!
>> Sirisha
>>
>>
>> Martin Morgan wrote:
>>> On 04/12/2010 02:49 PM, Sirisha Sunkara wrote:
>>>  
>>>> Hi Martin,
>>>>
>>>> The qa function that reads in fastq format files, doesn't seem to
>>>> populate the perTile QA element with row information...
>>>> The row counts are zero for both the readCounts and
>>>> medianReadQualityScore list elements of perTile.
>>>>
>>>> Is this feature still work in progress..? Essentially, I am trying to
>>>> get the TileQC plots for lanes where there was no reference genome to
>>>> align (no export.txt files)
>>>>     
>>>
>>> Hi Sirisha -- fastq files can't be guaranteed to have tile info so
>>> ShortRead doesn't try to guess these, even if some software adopts
>>> conventions for embedding the information in the read ids.
>>>
>>> The tile images are generated by
>>>
>>>   ShortRead:::.plotTileCounts
>>>
>>> and
>>>
>>>   ShortRead:::.plotTileQualityScore
>>>
>>> both take a regular data.frame. For .plotTileCounts, the columns are
>>> 'type' (safe to ignore, I think), 'tile' (integer tile index), 'lane'
>>> (integer lane index), and 'count' (number of reads in this particular
>>> lane & tile). As an untested work-around, you could create a data frame
>>> like this by parsing your read IDs using standard R commands; provide an
>>> example of what the read IDs look like and I'll help you. For the
>>> .plotTileQualityScore, the columns are 'type', 'tile', 'lane', and
>>> 'score', where 'score' is the median 'qualityScore'
>>> (alphabetScore(quality(srq)) / width(quality(srq)) for some ShortReadQ
>>> object srq obtained by readFastq) over all reads in the tile.
>>>
>>> Martin
>>>
>>>  
>>>>>  qafq <- qa("./Contam_Screening/Run703/","s_8_1_sequence.txt",
>>>>>       
>>>> type="fastq")
>>>>   
>>>>> qafq
>>>>>       
>>>> class: FastqQA(9)
>>>> QA elements (access with qa[["elt"]]):
>>>>  readCounts: data.frame(1 3)
>>>>  baseCalls: data.frame(1 5)
>>>>  readQualityScore: data.frame(512 4)
>>>>  baseQuality: data.frame(94 3)
>>>>  alignQuality: data.frame(1 3)
>>>>  frequentSequences: data.frame(50 4)
>>>>  sequenceDistribution: data.frame(1663 4)
>>>>  perCycle: list(2)
>>>>    baseCall: data.frame(150 4)
>>>>    quality: data.frame(1081 5)
>>>>  perTile: list(2)
>>>>    readCounts: data.frame(0 4)
>>>>    medianReadQualityScore: data.frame(0 4)
>>>>
>>>> Thank You,
>>>> Sirisha
>>>>
>>>>   
>>>>> sessionInfo()
>>>>>       
>>>> R version 2.11.0 Under development (unstable) (2010-03-07 r51225)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods  
>>>> base   other attached packages:
>>>> [1] ShortRead_1.5.21    lattice_0.18-3      Biostrings_2.15.22
>>>> [4] GenomicRanges_0.1.0 IRanges_1.5.74      Rmpi_0.5-8       loaded
>>>> via a namespace (and not attached):
>>>> [1] Biobase_2.7.5 grid_2.11.0   hwriter_1.2   tools_2.11.0
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>     
>>>
>>>
>>>   
>>
>>
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793