[Bioc-sig-seq] Getting file names from list.files in a more useful order

Michael Muratet mmuratet at hudsonalpha.org
Wed Oct 7 23:05:59 CEST 2009


Greetings

I am working on adapting readIntensities from ShortRead to handle the  
new Illumina intensity file format, *.cif. Illumina has dropped the  
leading zeros from the file name so that if you use list.files to get  
file names from the old style you get:

list.files(pattern="int.txt.p.gz")
   [1] "s_1_0001_int.txt.p.gz" "s_1_0002_int.txt.p.gz"  
"s_1_0003_int.txt.p.gz" "s_1_0004_int.txt.p.gz" "s_1_0005_int.txt.p.gz"
   [6] "s_1_0006_int.txt.p.gz" "s_1_0007_int.txt.p.gz"  
"s_1_0008_int.txt.p.gz" "s_1_0009_int.txt.p.gz" "s_1_0010_int.txt.p.gz"
  [11] "s_1_0011_int.txt.p.gz" "s_1_0012_int.txt.p.gz"  
"s_1_0013_int.txt.p.gz" "s_1_0014_int.txt.p.gz" "s_1_0015_int.txt.p.gz"
  [16] "s_1_0016_int.txt.p.gz" "s_1_0017_int.txt.p.gz"  
"s_1_0018_int.txt.p.gz" "s_1_0019_int.txt.p.gz" "s_1_0020_int.txt.p.gz"

which puts everything in the order that one would like to read. I  
believe this is because the lexical sorting matches the arithmetic  
order of the tiles.

The new scheme yields:

list.files(pattern="cif")
   [1] "s_1_1.cif"   "s_1_10.cif"  "s_1_100.cif" "s_1_101.cif"  
"s_1_102.cif" "s_1_103.cif" "s_1_104.cif" "s_1_105.cif" "s_1_106.cif"
  [10] "s_1_107.cif" "s_1_108.cif" "s_1_109.cif" "s_1_11.cif"   
"s_1_110.cif" "s_1_111.cif" "s_1_112.cif" "s_1_113.cif" "s_1_114.cif"
  [19] "s_1_115.cif" "s_1_116.cif" "s_1_117.cif" "s_1_118.cif"  
"s_1_119.cif" "s_1_12.cif"  "s_1_120.cif" "s_1_13.cif"  "s_1_14.cif"

which complicates building the requisite data structures because it's  
not in tile order.

The new convention is further complicated by the fact that the  
intensity files are now arranged in sub-folders by cycle and lane.

I could buffer everything until it's all read and then organize it  
appropriately, but it seems like it would much simpler if I could get  
the vector into tile order instead of lexical order. I don't see a  
command or other simple way to do this, but I'm hoping someone will be  
able to offer a suggestion. Anybody have any ideas?

Thanks

Mike



More information about the Bioc-sig-sequencing mailing list