[BioC] Problems Reading SNPs into snpMatrix

Wed May 11 21:04:38 CEST 2011

rex-bash-3.2$ cat demo.txt # tab deilmited
1110043 rs2847443       TT
1110059 rs2847443       TT
1110060 rs2847443       TT
1110063 rs2847443       TA
1110066 rs2847443       TT
1110067 rs2847443       TT
1110070 rs2847443       TT
rex-bash-3.2$ fg
R214

> ids = read.table("demo.txt", colClasses=c("character", "NULL", "NULL"), h=FALSE)[[1]]
> snp.ids = read.table("demo.txt", colClasses=c("NULL", "character", "NULL"), h=FALSE)[[1]]
> x = read.snps.long("demo.txt", sample.id=ids, snp.id=snp.ids[1], codes="nucleotide", fields=c(sample=1, snp=2, genotype=3), sep="\t")
7 genotypes successfully read
> as(x, "character")
     [,1]
[1,] "B/B"
[2,] "B/B"
[3,] "B/B"
[4,] "A/B"
[5,] "B/B"
[6,] "B/B"
[7,] "B/B"
> sessionInfo()
R version 2.14.0 Under development (unstable) (2011-04-22 r55596)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] splines   stats     graphics  grDevices datasets  utils     methods
[8] base

other attached packages:
[1] snpStats_1.1.15    Matrix_0.999375-50 lattice_0.19-23    survival_2.36-5

loaded via a namespace (and not attached):
[1] grid_2.14.0  tools_2.14.0
Warning message:
'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible


On Wed, May 11, 2011 at 2:38 PM, Chris K. Fuller <chris at genome.ucsf.edu> wrote:
> Hello List,
>
> I'm at a loss to explain why the read.snps.long function in snpStats
> refuses to properly read my SNPs.  I've attempted this several
> different ways without success.  I have 195 samples (individuals) and
> 440,000 SNPs, arranged in the "one call per line" format.  I'm running
> this on Linux Ubuntu, and all R packages are up to date.  I've
> reproduced these cases below.
>
> Example 1:  Reading SNPs with Nucleotide Genotype
>
> With a file like:
> 1110043 rs2847443       TT
> 1110059 rs2847443       TT
> 1110060 rs2847443       TT
> 1110063 rs2847443       TA
> 1110066 rs2847443       TT
> 1110067 rs2847443       TT
> 1110070 rs2847443       TT
>
> I use the command:
> foo = read.snps.long('one_call_per_line.txt',
> sample.id=as.character(sample_id[[1]]),
> snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2,
> genotype = 3), codes = "nucleotide", sep='\t', verbose=TRUE,
> in.order=TRUE)
>
> And receive the result:
> Error in read.snps.long("one_call_per_line.txt", sample.id =
> as.character(sample_id[[1]]),  :
> Nucleotide coded genotype should be 2 character string: line 1
>
>
> Example 2:  Perform my own SNP genotype coding and try again
>
> With a file like:
> 110043  rs2847443       3
> 1110059 rs2847443       3
> 1110060 rs2847443       3
> 1110063 rs2847443       2
> 1110066 rs2847443       3
> 1110067 rs2847443       3
> 1110070 rs2847443       3
>
> I use this command:
> foo = read.snps.long('one_call_per_line_coded.txt',
> sample.id=as.character(sample_id[[1]]),
> snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2,
> genotype = 3), codes = c("1", "2", "3"), sep='\t', verbose=TRUE,
> in.order=TRUE)
>
> And receive this result:
> Reading SnpMatrix with 195 rows and 443136 columns
>                             Cumulative totals
>                    -----------------------------------
>    File      Line     Accepted   Rejected     No call     Skipped    File name
>     1    9142000          0            0         9141999         0 0
> ...er_line_coded.txt
>
>
> In the first example, it apparently doesn't treat the third column of
> my tab-delimited text file as two characters.  Why?  I also generated
> a file with quotes around each nucleotide (e.g. 'AA'), but this did
> not change the behavior.  In the second example, it seems to treat
> everything as no call.  Why?
>
> I've been at this for far longer than I care to admit, so any insight
> would be greatly appreciated.
>
> Thank you,
>
> Chris Fuller
> chris at genome.ucsf.edu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>