[BioC] writing a fasta file in blocks
mtmorgan at fhcrc.org
mtmorgan at fhcrc.org
Wed Jun 9 09:06:58 CEST 2010
Quoting Fahim Md <fahim.md at gmail.com>:
> Hi,
> Thanks for reply.
> I tried Biostrings as well as seqinr packages but they almost do the same
> thing.
> Hi sean: I am attaching the code and try to be as informative as possible:
>
> 2> dataFile[1:5,] #The dataFile[ , ]
> contains the data in the following format. Please note that the probeSetID
> is same for 11 entries and all corresponds to a particular probe.
> sequence probeSetID
>
> [1,] "GCTACTTTACTCCAGAATTTTGTTA" "1367452_at"
> [2,] "TTAGAAAGCCGCAATTTGGTCCCGC" "1367452_at"
> [3,] "GCCACATCCTGACTACTGCAGTATA" "1367452_at"
> [4,] "AGTATAGTTTCTCTCCTCTTTCATT" "1367452_at"
> [5,] "ACATAAGTAACTGGTGTGTGTGCAC" "1367452_at"
I think that this
f = function(x)
{
m = c(FALSE, x[-1] == x[-length(x)])
csum = cumsum(m)
1 + csum - cummax((!m) * csum)
}
gives the digits for sorted x
(this is due to Bill Dunlap,
http://tolstoy.newcastle.edu.au/R/e4/devel/08/04/1206.html), so
id = paste("> ", dataFile[["probeSetID"]], ".", f(dataFile[["probeSetID"]],
sep="")
gives the labels and
fasta = character(2 * nrow(dataFile))
fasta[c(TRUE, FALSE)] = id
fasta[c(FALSE, TRUE)] = dataFile[["sequence"]])
write(fasta, "/some/file.fasta")
does the rest. This is untested.
Martin
>
> 2> nRow # The number of probes are very high, and this is the
> reason that I dont want to use writeFASTA again and again
> [1] 342410
>
>
> #below are my code stub
> i=1; #loop counter
> while(i <= nRow)
> {
> desc = dataFile[i,'probeSetID'];
> j=1;
> while(desc == dataFile[i+1,'probeSetID']) #If the current
> descriptor and the next descriptor is same then I am using j for extension
> to the descriptor
> {
> probeSetID = paste(dataFile[i,'probeSetID'], '.', j,
> sep=''); # I want to give each probe an extension
> sequence = dataFile[i,'sequence'];
> fa.string = list(seq = sequence, desc = probeSetID)
> writeFASTA(fa.string, file = outFile, desc= probeSetID, append=TRUE,
> width =50) #This is what I dont want ?? I tried to use lapply function
> assuming that it will help but couldnot do it. but anyway lapply is also
> using loop internally.
> i=i+1;
> j=j+1;
> }
> probeSetID = paste(dataFile[i,'probeSetID'], '.', j, sep='');
> sequence = dataFile[i,'sequence'];
> fa.string = list(seq = sequence, desc = probeSetID) #This is
> writing the 11th probe(last probe) of each probeSet.
> writeFASTA(fa.string, file = outFile, desc= probeSetID, append=TRUE,
> width =50)
> i=i+1;
> }
>
> #This code is giving me the desired output as below:
>> 1367452_at.1
> GCTACTTTACTCCAGAATTTTGTTA
>> 1367452_at.2
> TTAGAAAGCCGCAATTTGGTCCCGC
>> 1367452_at.3
> GCCACATCCTGACTACTGCAGTATA
>> 1367452_at.4
> AGTATAGTTTCTCTCCTCTTTCATT
>
> ......
>> 1367452_at.11
> AAAATCCTCGCATACCTTGTTCGAT
>> 1367453_at.1
> GATGGATCCTACTGATGCCAAGTAT
>> 1367453_at.2
> GCCAAGTATCACATGCAGCGTTGCA
>> 1367453_at.3
>
>
> 1> sessionInfo()
> R version 2.11.0 (2010-04-22)
> x86_64-pc-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Biostrings_2.16.2 IRanges_1.6.4
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0
> 1>
>
> Best Regards.
> Fahim
> ........................
>
>
>
> On Tue, Jun 8, 2010 at 9:50 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 06/09/2010 02:31 AM, Kasper Daniel Hansen wrote:
>> > Doing what Fahim suggests internally in writeFASTA has been on my todo
>> > list for a while, and it will significantly speed up the writing of
>> > fasta files with many small records. Guess I should do it now, and
>> > cross it off my list.
>> >
>> > But Fahim: I am not sure it is possible to do what you want to do with
>> > the current function (at least if you are using Biostrings), but I
>> > could be wrong. If you want to investigate further, note that the
>> > file can be a connection (?connection).
>> >
>> > Kasper
>> >
>> > On Tue, Jun 8, 2010 at 4:26 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>> >> On Mon, Jun 7, 2010 at 11:12 PM, Fahim Md <fahim.md at gmail.com> wrote:
>> >>
>> >>> I have a data File, the format of which is given below. It has two
>> fields,
>> >>> namely sequence and probeset name.
>> >>> sequence Probe Set Name
>> >>> GCTACTTTACTCCAGAATTTTGTTA 1367452_at.1
>> >>> TTAGAAAGCCGCAATTTGGTCCCGC 1367452_at.2
>> >>> GCCACATCCTGACTACTGCAGTATA 1367452_at.3
>> >>> ............
>> >>> AAAAAAAAGGGGGGGTCCCCCCCC 1234567_at.1
>> >>>
>> >>>
>> >>> Now, I want to convert that into FASTA format as follows
>> >>>
>> >>>> 1367452_at.1
>> >>> GCTACTTTACTCCAGAATTTTGTTA
>> >>>> 1367452_at.2
>> >>> TTAGAAAGCCGCAATTTGGTCCCGC
>> >>>> 1367452_at.3
>> >>> GCCACATCCTGACTACTGCAGTATA
>> >>> .......
>> >>> ....
>> >>>> 1234567_at.1
>> >>> AAAAAAAAAAAAACCCCCCCCCCCC
>> >>>
>> >>>
>> >>> I am getting the required output by using writeFASTA(..) function but
>> it is
>> >>> too slow because I am using loop and in every loop it access the file
>> to
>> >>> write into.
>> >>>
>> >>> Is there anyway through which I can write this fasta information into
>> some
>> >>> variable and once I am done I write back that variable into the
>> required
>> >>> file.
>>
>> For short sequences where line wrapping is not important, you might
>> input the data with
>>
>> df = read.table(...)
>>
>> and the like, create a template for the output
>>
>> fasta = character(nrow(df))
>>
>> then fill it in (no loop required)
>>
>> fasta[c(TRUE, FALSE)] = paste(">", df[["Probe.Set.Name"]])
>> fasta[c(FALSE, TRUE)] = df[["sequence"]]
>>
>> and save it
>>
>> write(fasta, "/some/file.fasta")
>>
>> Martin
>>
>> >>>
>> >>>
>> >> Hi,Fahim.
>> >>
>> >> Probably most appropriate for here and not bioc-devel. Perhaps a
>> >> reproducible code example and some sessionInfo() would be helpful.
>> >>
>> >> Sean
>> >>
>> >> [[alternative HTML version deleted]]
>> >>
>> >> _______________________________________________
>> >> Bioconductor mailing list
>> >> Bioconductor at stat.math.ethz.ch
>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >>
>> >
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
>
>
> --
> Mohammad Fahim
> Bioinforformatics Lab
> University of Louisville
> Louisville, KY, USA
> Ph: +1-502-409-1167
>
More information about the Bioconductor
mailing list