[R] read.delim skips first column (why?)
Petr PIKAL
petr.pikal at precheza.cz
Tue Jul 14 11:34:22 CEST 2009
Hi
> str(read.table("test.txt", header=T))
'data.frame': 9 obs. of 12 variables:
$ snp : Factor w/ 9 levels
"rs1113188","rs1113397",..: 9 5 7 8 3 4 6 1 2
$ gene : Factor w/ 1 level "TRP2": 1 1 1 1 1 1 1 1 1
$ chromosome : int 3 3 3 3 3 3 3 3 3
It can be sometimes tricky to upload files to R. I would recommend if
read.delim fils try read.table which has less assumptions and try to set
parameters (heade, sep, dec....) to get your file right
Regards
Petr
r-help-bounces at r-project.org napsal dne 14.07.2009 11:11:10:
> Hi,
> I have uploaded a copy of the file here:
> - http://pastebin.com/fd0edfab
>
> the file has also been passed throught the unix command tool unexpand,
but
> it doesn't solve the problem.
>
> using head=TRUE instead of head=T has also the same effect.
>
> the output of print(names) is:
> > print(names(ngly), quote=TRUE)
> [1] "snp" "gene"
> [3] "chromosome" "distance_from_gene_center"
> [5] "position" "ame"
> [7] "csasia" "easia"
> [9] "eur" "mena"
> [11] "oce" "ssafr"
> [13] "X" "X.1"
> [15] "X.2"
>
> Thank you to all the people who answered me to my mail address, but I
> couldn't solve the problem yet.
>
>
> On Tue, Jul 14, 2009 at 12:36 AM, jim holtman <jholtman at gmail.com>
wrote:
>
> > Can you send your file as an attachment since it is impossible to see
> > where the separator characters are.
> >
> > On Mon, Jul 13, 2009 at 1:27 PM, Giovanni Marco
> > Dall'Olio<dalloliogm at gmail.com> wrote:
> > > Hi people,
> > > I have a text file like this one posted:
> > >
> > > snp_id gene chromosome distance_from_gene_center
> > > position pop1 pop2 pop3 pop4 pop5 pop6 pop7
> > > rs2129081 RAPT2 3 -129993 "upstream" 0.439009
> > > 1.169210 NA 0.233020 0.093042 NA
> > > -0.902596
> > > rs1202698 RAPT2 3 -128695 "upstream" NA
> > > 1.815000 NA 0.399079 1.814270 1.382950
> > > NA
> > > rs1163207 RAPT2 3 -128224 "upstream" NA NA
> > > NA NA NA NA NA
> > > rs1834127 RAPT2 3 -128106 "upstream" NA NA
> > > NA NA NA NA 2.180670
> > > rs2114211 RAPT2 3 -126738 "upstream" -0.468279
> > > -1.447620 NA 0.010616 -0.414581 NA
> > > 0.550447
> > > rs2113151 RAPT2 3 -124620 "upstream" -0.897660
> > > -1.971020 NA -0.920327 -0.764658 NA
> > > 0.337127
> > > rs2524130 RAPT2 3 -123029 "upstream" -0.109795
> > > -0.004646 -0.412059 1.116740 0.667567
> > > -0.924529 0.962841
> > > rs1381318 RAPT2 3 -12818 "upstream" -0.911662
> > > -1.791580 NA -0.945716 -1.239640 NA
> > > 0.004876
> > > rs2113319 RAPT2 3 -122028 "upstream" -0.911662
> > > -1.738610 NA -0.945716 -1.240950 NA -0.005318
> > >
> > > When I use read.delim (or any read function) on it, R skips the
first
> > > column, and I don' understand why.
> > >
> > > For example:
> > > $: R
> > >> data = read.delim('snp_file.txt', head=T, sep='\t')
> > >
> > > Now, I would expect data$snp_id to contain snp ids, and data$gene to
> > contain
> > > gene names; but it is not like this:
> > >
> > >> data$snp_id
> > > [1] RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2 RAPT2
> > > Levels: RAPT2
> > >> data$gene
> > > [1] 3 3 3 3 3 3 3 3 3
> > >
> > >> summary(data)
> > > snp_id gene chromosome distance_from_gene_center
> > > RAPT2:9 Min. :3 Min. :-129993 upstream:9
> > > 1st Qu.:3 1st Qu.:-128224
> > > Median :3 Median :-126738
> > > Mean :3 Mean :-113806
> > > 3rd Qu.:3 3rd Qu.:-123029
> > > Max. :3 Max. : -12818
> > > ....
> > >
> > >> data$pop7
> > > [1] NA NA NA NA NA NA NA NA NA
> > >
> > >
> > > Notice that it did use snp_id as the header for the first column,
but it
> > > skips completely al the data from that column, and all the fields
are
> > > shifted, so the last column is filled with NA values.
> > >
> > > What I am doing wrong? Can it be a problem of my data files? I have
tried
> > to
> > > modify them a bit (add new columns, etc..) but it didn't work.
> > >
> > > I am running R from an Ubuntu system:
> > >> sessionInfo()
> > > R version 2.9.1 (2009-06-26)
> > > i486-pc-linux-gnu
> > >
> > > locale:
> > >
> >
>
LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C
> > >
> > > attached base packages:
> > > [1] stats graphics grDevices utils datasets methods base
> > >
> > >
> > >
> > >
> > > --
> > > Giovanni Dall'Olio, phd student
> > > Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
> > >
> > > My blog on bioinformatics: http://bioinfoblog.it
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem that you are trying to solve?
> >
>
>
>
> --
> Giovanni Dall'Olio, phd student
> Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>
> My blog on bioinformatics: http://bioinfoblog.it
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list