[BioC] Problems Reading SNPs into snpMatrix
Vincent Carey
stvjc at channing.harvard.edu
Wed May 11 21:04:38 CEST 2011
rex-bash-3.2$ cat demo.txt # tab deilmited
1110043 rs2847443 TT
1110059 rs2847443 TT
1110060 rs2847443 TT
1110063 rs2847443 TA
1110066 rs2847443 TT
1110067 rs2847443 TT
1110070 rs2847443 TT
rex-bash-3.2$ fg
R214
> ids = read.table("demo.txt", colClasses=c("character", "NULL", "NULL"), h=FALSE)[[1]]
> snp.ids = read.table("demo.txt", colClasses=c("NULL", "character", "NULL"), h=FALSE)[[1]]
> x = read.snps.long("demo.txt", sample.id=ids, snp.id=snp.ids[1], codes="nucleotide", fields=c(sample=1, snp=2, genotype=3), sep="\t")
7 genotypes successfully read
> as(x, "character")
[,1]
[1,] "B/B"
[2,] "B/B"
[3,] "B/B"
[4,] "A/B"
[5,] "B/B"
[6,] "B/B"
[7,] "B/B"
> sessionInfo()
R version 2.14.0 Under development (unstable) (2011-04-22 r55596)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] splines stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] snpStats_1.1.15 Matrix_0.999375-50 lattice_0.19-23 survival_2.36-5
loaded via a namespace (and not attached):
[1] grid_2.14.0 tools_2.14.0
Warning message:
'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible
On Wed, May 11, 2011 at 2:38 PM, Chris K. Fuller <chris at genome.ucsf.edu> wrote:
> Hello List,
>
> I'm at a loss to explain why the read.snps.long function in snpStats
> refuses to properly read my SNPs. I've attempted this several
> different ways without success. I have 195 samples (individuals) and
> 440,000 SNPs, arranged in the "one call per line" format. I'm running
> this on Linux Ubuntu, and all R packages are up to date. I've
> reproduced these cases below.
>
> Example 1: Reading SNPs with Nucleotide Genotype
>
> With a file like:
> 1110043 rs2847443 TT
> 1110059 rs2847443 TT
> 1110060 rs2847443 TT
> 1110063 rs2847443 TA
> 1110066 rs2847443 TT
> 1110067 rs2847443 TT
> 1110070 rs2847443 TT
>
> I use the command:
> foo = read.snps.long('one_call_per_line.txt',
> sample.id=as.character(sample_id[[1]]),
> snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2,
> genotype = 3), codes = "nucleotide", sep='\t', verbose=TRUE,
> in.order=TRUE)
>
> And receive the result:
> Error in read.snps.long("one_call_per_line.txt", sample.id =
> as.character(sample_id[[1]]), :
> Nucleotide coded genotype should be 2 character string: line 1
>
>
> Example 2: Perform my own SNP genotype coding and try again
>
> With a file like:
> 110043 rs2847443 3
> 1110059 rs2847443 3
> 1110060 rs2847443 3
> 1110063 rs2847443 2
> 1110066 rs2847443 3
> 1110067 rs2847443 3
> 1110070 rs2847443 3
>
> I use this command:
> foo = read.snps.long('one_call_per_line_coded.txt',
> sample.id=as.character(sample_id[[1]]),
> snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2,
> genotype = 3), codes = c("1", "2", "3"), sep='\t', verbose=TRUE,
> in.order=TRUE)
>
> And receive this result:
> Reading SnpMatrix with 195 rows and 443136 columns
> Cumulative totals
> -----------------------------------
> File Line Accepted Rejected No call Skipped File name
> 1 9142000 0 0 9141999 0 0
> ...er_line_coded.txt
>
>
> In the first example, it apparently doesn't treat the third column of
> my tab-delimited text file as two characters. Why? I also generated
> a file with quotes around each nucleotide (e.g. 'AA'), but this did
> not change the behavior. In the second example, it seems to treat
> everything as no call. Why?
>
> I've been at this for far longer than I care to admit, so any insight
> would be greatly appreciated.
>
> Thank you,
>
> Chris Fuller
> chris at genome.ucsf.edu
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list