[Bioc-sig-seq] Getting file names from list.files in a more useful order

Martin Morgan mtmorgan at fhcrc.org
Thu Oct 8 05:08:56 CEST 2009


Hi Michael --

Michael Muratet wrote:
> Greetings
> 
> I am working on adapting readIntensities from ShortRead to handle the
> new Illumina intensity file format, *.cif. Illumina has dropped the
> leading zeros from the file name so that if you use list.files to get
> file names from the old style you get:
> 
> list.files(pattern="int.txt.p.gz")
>   [1] "s_1_0001_int.txt.p.gz" "s_1_0002_int.txt.p.gz"
> "s_1_0003_int.txt.p.gz" "s_1_0004_int.txt.p.gz" "s_1_0005_int.txt.p.gz"
>   [6] "s_1_0006_int.txt.p.gz" "s_1_0007_int.txt.p.gz"
> "s_1_0008_int.txt.p.gz" "s_1_0009_int.txt.p.gz" "s_1_0010_int.txt.p.gz"
>  [11] "s_1_0011_int.txt.p.gz" "s_1_0012_int.txt.p.gz"
> "s_1_0013_int.txt.p.gz" "s_1_0014_int.txt.p.gz" "s_1_0015_int.txt.p.gz"
>  [16] "s_1_0016_int.txt.p.gz" "s_1_0017_int.txt.p.gz"
> "s_1_0018_int.txt.p.gz" "s_1_0019_int.txt.p.gz" "s_1_0020_int.txt.p.gz"
> 
> which puts everything in the order that one would like to read. I
> believe this is because the lexical sorting matches the arithmetic order
> of the tiles.
> 
> The new scheme yields:
> 
> list.files(pattern="cif")
>   [1] "s_1_1.cif"   "s_1_10.cif"  "s_1_100.cif" "s_1_101.cif"
> "s_1_102.cif" "s_1_103.cif" "s_1_104.cif" "s_1_105.cif" "s_1_106.cif"
>  [10] "s_1_107.cif" "s_1_108.cif" "s_1_109.cif" "s_1_11.cif" 
> "s_1_110.cif" "s_1_111.cif" "s_1_112.cif" "s_1_113.cif" "s_1_114.cif"
>  [19] "s_1_115.cif" "s_1_116.cif" "s_1_117.cif" "s_1_118.cif"
> "s_1_119.cif" "s_1_12.cif"  "s_1_120.cif" "s_1_13.cif"  "s_1_14.cif"


you could extract the lane and tile information along the lines of

  files = c("s_1_1.cif", "s_1_10.cif")
  lanes = as.integer(sub("s_([[:digit:]]+).*", "\\1", files))
  tiles = as.integer(sub(".*_([[:digit:]]+).cif", "\\1", files))

and then order the files with

  files[order(lanes, tiles)]

In earlier versions, I think the file name is actually configurable by
the pipeline software, and recorded in the xml configuration files; few
people seemed to actually do this though.

> which complicates building the requisite data structures because it's
> not in tile order.
> 
> The new convention is further complicated by the fact that the intensity
> files are now arranged in sub-folders by cycle and lane.
> 
> I could buffer everything until it's all read and then organize it
> appropriately, but it seems like it would much simpler if I could get
> the vector into tile order instead of lexical order. I don't see a
> command or other simple way to do this, but I'm hoping someone will be
> able to offer a suggestion. Anybody have any ideas?
> 
> Thanks
> 
> Mike
> 
> 
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list