[BioC] writing a fasta file in blocks

Wed Jun 9 09:06:58 CEST 2010

Quoting Fahim Md <fahim.md at gmail.com>:

> Hi,
> Thanks for reply.
> I tried Biostrings as well as seqinr packages but they almost do the same
> thing.
> Hi sean: I am attaching the code and try to be as informative as possible:
>
> 2> dataFile[1:5,]                                #The dataFile[ , ]
> contains the data in the following format. Please note that the probeSetID
> is same for 11 entries and all corresponds to a particular probe.
>                                    sequence                      probeSetID
>
> [1,] "GCTACTTTACTCCAGAATTTTGTTA"        "1367452_at"
> [2,] "TTAGAAAGCCGCAATTTGGTCCCGC"      "1367452_at"
> [3,] "GCCACATCCTGACTACTGCAGTATA"      "1367452_at"
> [4,] "AGTATAGTTTCTCTCCTCTTTCATT"        "1367452_at"
> [5,] "ACATAAGTAACTGGTGTGTGTGCAC"     "1367452_at"

I think that this

   f = function(x)
   {
       m = c(FALSE, x[-1] == x[-length(x)])
       csum = cumsum(m)
       1 + csum - cummax((!m) * csum)
   }
gives the digits for sorted x
(this is due to Bill Dunlap,  
http://tolstoy.newcastle.edu.au/R/e4/devel/08/04/1206.html), so

id = paste("> ", dataFile[["probeSetID"]], ".", f(dataFile[["probeSetID"]],
            sep="")

gives the labels and

   fasta = character(2 * nrow(dataFile))
   fasta[c(TRUE, FALSE)] = id
   fasta[c(FALSE, TRUE)] = dataFile[["sequence"]])
   write(fasta, "/some/file.fasta")

does the rest. This is untested.

Martin

>
> 2> nRow            # The number of probes are very high, and this is the
> reason that I dont want to use writeFASTA again and again
> [1] 342410
>
>
> #below are my code stub
>      i=1;                                       #loop counter
>     while(i <= nRow)
>      {
>      desc = dataFile[i,'probeSetID'];
>      j=1;
>      while(desc == dataFile[i+1,'probeSetID'])          #If the current
> descriptor and the next descriptor is same then I am using j for extension
> to the descriptor
>         {
>         probeSetID = paste(dataFile[i,'probeSetID'], '.', j,
> sep='');                        # I want to give each probe an extension
>         sequence =  dataFile[i,'sequence'];
>         fa.string = list(seq = sequence, desc = probeSetID)
>         writeFASTA(fa.string, file = outFile, desc= probeSetID, append=TRUE,
> width =50)   #This is what I dont want ?? I tried to use lapply function
> assuming that it will help but couldnot do it. but anyway lapply is also
> using loop internally.
>         i=i+1;
>         j=j+1;
>         }
>     probeSetID = paste(dataFile[i,'probeSetID'], '.', j, sep='');
>     sequence =  dataFile[i,'sequence'];
>     fa.string = list(seq = sequence, desc = probeSetID)            #This is
> writing the 11th probe(last  probe) of each probeSet.
>     writeFASTA(fa.string, file = outFile, desc= probeSetID, append=TRUE,
> width =50)
>     i=i+1;
>     }
>
> #This code is giving me the desired output as below:
>> 1367452_at.1
> GCTACTTTACTCCAGAATTTTGTTA
>> 1367452_at.2
> TTAGAAAGCCGCAATTTGGTCCCGC
>> 1367452_at.3
> GCCACATCCTGACTACTGCAGTATA
>> 1367452_at.4
> AGTATAGTTTCTCTCCTCTTTCATT
>
> ......
>> 1367452_at.11
> AAAATCCTCGCATACCTTGTTCGAT
>> 1367453_at.1
> GATGGATCCTACTGATGCCAAGTAT
>> 1367453_at.2
> GCCAAGTATCACATGCAGCGTTGCA
>> 1367453_at.3
>
>
> 1> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-pc-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.16.2 IRanges_1.6.4
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0
> 1>
>
> Best Regards.
> Fahim
> ........................
>
>
>
> On Tue, Jun 8, 2010 at 9:50 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 06/09/2010 02:31 AM, Kasper Daniel Hansen wrote:
>> > Doing what Fahim suggests internally in writeFASTA has been on my todo
>> > list for a while, and it will significantly speed up the writing of
>> > fasta files with many small records.  Guess I should do it now, and
>> > cross it off my list.
>> >
>> > But Fahim: I am not sure it is possible to do what you want to do with
>> > the current function (at least if you are using Biostrings), but I
>> > could be wrong.  If you want to investigate further, note that the
>> > file can be a connection (?connection).
>> >
>> > Kasper
>> >
>> > On Tue, Jun 8, 2010 at 4:26 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>> >> On Mon, Jun 7, 2010 at 11:12 PM, Fahim Md <fahim.md at gmail.com> wrote:
>> >>
>> >>> I have a data File, the format of which is given below. It has two
>> fields,
>> >>> namely  sequence and probeset name.
>> >>>    sequence                                       Probe Set Name
>> >>> GCTACTTTACTCCAGAATTTTGTTA      1367452_at.1
>> >>> TTAGAAAGCCGCAATTTGGTCCCGC    1367452_at.2
>> >>> GCCACATCCTGACTACTGCAGTATA     1367452_at.3
>> >>> ............
>> >>> AAAAAAAAGGGGGGGTCCCCCCCC     1234567_at.1
>> >>>
>> >>>
>> >>> Now, I want to convert that into FASTA format as follows
>> >>>
>> >>>> 1367452_at.1
>> >>> GCTACTTTACTCCAGAATTTTGTTA
>> >>>> 1367452_at.2
>> >>> TTAGAAAGCCGCAATTTGGTCCCGC
>> >>>> 1367452_at.3
>> >>> GCCACATCCTGACTACTGCAGTATA
>> >>> .......
>> >>> ....
>> >>>> 1234567_at.1
>> >>> AAAAAAAAAAAAACCCCCCCCCCCC
>> >>>
>> >>>
>> >>> I am getting the required output by using writeFASTA(..) function but
>> it is
>> >>> too slow because I am using loop and in every loop it access the file
>> to
>> >>> write into.
>> >>>
>> >>> Is there anyway through which I can write this fasta information into
>> some
>> >>> variable and once I am done I write back that  variable into the
>> required
>> >>> file.
>>
>> For short sequences where line wrapping is not important, you might
>> input the data with
>>
>>  df = read.table(...)
>>
>> and the like, create a template for the output
>>
>>  fasta = character(nrow(df))
>>
>> then fill it in (no loop required)
>>
>>  fasta[c(TRUE, FALSE)] = paste(">", df[["Probe.Set.Name"]])
>>  fasta[c(FALSE, TRUE)] = df[["sequence"]]
>>
>> and save it
>>
>>  write(fasta, "/some/file.fasta")
>>
>> Martin
>>
>> >>>
>> >>>
>> >> Hi,Fahim.
>> >>
>> >> Probably most appropriate for here and not bioc-devel.  Perhaps a
>> >> reproducible code example and some sessionInfo() would be helpful.
>> >>
>> >> Sean
>> >>
>> >>        [[alternative HTML version deleted]]
>> >>
>> >> _______________________________________________
>> >> Bioconductor mailing list
>> >> Bioconductor at stat.math.ethz.ch
>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
>
>
> --
> Mohammad Fahim
> Bioinforformatics Lab
> University of Louisville
> Louisville, KY, USA
> Ph:  +1-502-409-1167
>