[R] problem in reading a sequence file

Tue Jul 5 17:21:51 CEST 2011

On Tue, Jul 05, 2011 at 04:53:32PM +0200, albert coster wrote:

I'm taking this back to the list so others can follow up.

> Yes, the file is consists of one string (sequence) per line.
> 
> The files format is following: 
> 
> Sequence
> NNNNNNNNNNATTAAAGGGC

OK - in that case (and as you want a vector anyway) you can use
scan('seq.txt', what=character)()

> 
> > seqfile<-read.table("seq.txt")
> Warning message:
> In read.table("seq.txt") :
>   incomplete final line found by readTableHeader on 'seq.txt'

OK - that means you don't have a newline ('\n') at the end of your
sequence file and read.table is warning you about that.

> > str(seqfile)
> 'data.frame': 2 obs. of  1 variable:
>  $ V1: Factor w/ 2 levels "NNNNNNNNNNATTAAAGGGC",..: 2 1

This indicates that there are at least two lines in the file (so you
got two levels in the factor). So I would guess there is an empy line
before your sequence or you really have the word 'Sequence' on line 1.

For sequence data it probably does not make much sense to let R
convert to factor and a character colunm would be prefered. This can
be accomplished by using one of the options 'as.is',
'stringsAsFactors' or 'colClasses'. 

If you use scan you'll need to get rid of the extra line first. If
you stick with read.table you can specify the first line as your
header line using the header=TRUE option. Now you can address column
'Sequence' as such. Example:

> dat <- read.table('seq.txt', as.is=T, header=TRUE)
> dat$Sequence
[1] "NNNNNNNNNNATTAAAGGGC"
> dat[, 'Sequence']
[1] "NNNNNNNNNNATTAAAGGGC"
> str(dat)
'data.frame':   1 obs. of  1 variable:
 $ Sequence: chr "NNNNNNNNNNATTAAAGGGC"

cu
	Philipp

-- 
Dr. Philipp Pagel
Lehrstuhl für Genomorientierte Bioinformatik
Technische Universität München
Wissenschaftszentrum Weihenstephan
85350 Freising, Germany
http://webclu.bio.wzw.tum.de/~pagel/