[R] help with read.csv() for files with different number of columns

Wed Aug 30 00:55:08 CEST 2017

> On Aug 29, 2017, at 2:59 PM, Jim Lemon <drjimlemon at gmail.com> wrote:
> 
> Hi Ace,
> You can just read the file first to find out:
> 
> max_fields<-function(file,sep=" ") {
> rlines<-readLines(file)
> return(max(unlist(lapply(sapply(rlines,strsplit,sep),length))))
> }
> nmax<-max_fields(test.txt,"\t")
> 
> Jim

Or just:

 table( count.fields( readLines(file_name) ) )

May need to play with the 'comment.char' and the 'quotes' to investigate the impact of unmatched single quotes or octothorpes in the raw data.

Then you can isolate the aberrant lines with `which` applied to the `count.fields` resultant vector.

-- 
David.
> 
> 
> 
> 
> On Wed, Aug 30, 2017 at 2:22 AM, Fix Ace <acefix at rocketmail.com> wrote:
>> Thank you very much! Looks like I have to know the length of each record
>> ahead of time.
>> 
>> Ace
>> 
>> 
>> On Monday, August 28, 2017 12:56 AM, Jim Lemon <drjimlemon at gmail.com> wrote:
>> 
>> 
>> Hi Ace,
>> With tabs as separators:
>> 
>> testdf<-read.table("test.txt",header=FALSE,fill=TRUE,sep="\t",
>> col.names=paste("V",1:19,sep=""),stringsAsFactors=FALSE)
>> 
>> Also note that I got the number of columns wrong the first time.
>> 
>> Jim
>> 
>> 
>> On Mon, Aug 28, 2017 at 12:56 PM, Fix Ace <acefix at rocketmail.com> wrote:
>>> Hi, Jim,
>>> 
>>> Thank you very much for pointing out the format issue. Here is the
>>> original
>>> text:
>>> 
>>> ===
>>> I have a text file (test.txt) with different number of columns:
>>> 
>>> 0610007P14Rik%%% Tcf19 Gtf2i
>>> 0610010O12Rik%%% Ivns1abp Etv6
>>> 1100001G20Rik%%% Nmi
>>> 1500015O10Rik%%% Foxi1 Ascl3 Sirt3
>>> 1700003E16Rik%%% Ascl2 Ifnar2
>>> 1700028J19Rik%%% Musk Nfe2l3
>>> 1810011O10Rik%%% Ppp1r13b Bpnt1 Cdkn2c Foxc1 Sox10 Smarca2
>>> 1810019D21Rik%%% Asb8
>>> 1810037I17Rik%%% Zfp612
>>> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i
>>> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6
>>> 
>>> I wold like to read it into R using
>>> 
>>>> test=read.csv("test.txt",sep="\t",header=FALSE)
>>> 
>>> However, when I check the r object "test", I found that all the rows have
>>> 5
>>> columns:
>>> 
>>>> test
>>>                 V1            V2      V3    V4      V5
>>> 1  0610007P14Rik%%%        Tcf19  Gtf2i
>>> 2  0610010O12Rik%%%      Ivns1abp    Etv6
>>> 3  1100001G20Rik%%%          Nmi
>>> 4  1500015O10Rik%%%        Foxi1  Ascl3  Sirt3
>>> 5  1700003E16Rik%%%        Ascl2  Ifnar2
>>> 6  1700028J19Rik%%%          Musk  Nfe2l3
>>> 7  1810011O10Rik%%%      Ppp1r13b  Bpnt1 Cdkn2c  Foxc1
>>> 8            Sox10      Smarca2
>>> 9  1810019D21Rik%%%          Asb8
>>> 10 1810037I17Rik%%%        Zfp612
>>> 11 1810055G02Rik%%%        Nkx2-3  Maged1  Runx1    Ugp2
>>> 12            Elk4        Spdef  Tcf19  Isl2  Gtf2i
>>> 13          Ctnnbl1        Tcea3    Ank2 Zfp612 Creb3l1
>>> 14            Nupr1 3632451O06Rik Creb3l4  Lass6
>>> 
>>> Basically it breaks some rows into more than one rows. For example, row 7
>>> in
>>> the original record becomes two rows. Looks like the "test" always has 5
>>> columns.
>>> 
>>> How does this happen? How should I fix it to make one record into one two
>>> in
>>> R object?
>>> 
>>> ==
>>> 
>>> Please let me know if it is readable now. Thank you very much for your
>>> time!
>>> 
>>> Kind regards,
>>> 
>>> Ace
>>> 
>>> 
>>> On Sunday, August 27, 2017 7:25 PM, Jim Lemon <drjimlemon at gmail.com>
>>> wrote:
>>> 
>>> 
>>> Hi Ace,
>>> As your example seems to have spaces as separators,
>>> 
>>> testdf<-read.table("test.txt",header=FALSE,fill=TRUE,
>>> col.names=paste("V",1:14,sep=""),stringsAsFactors=FALSE)
>>> 
>>> By specifying the number of columns with "col.names" and using
>>> "fill=TRUE" you can get a data frame with zero length strings where
>>> values are missing in the input file.
>>> 
>>> Jim
>>> 
>>> On Mon, Aug 28, 2017 at 6:25 AM, Fix Ace via R-help
>>> <r-help at r-project.org> wrote:
>>>> Dear R community,
>>>> I have a text file (test.txt) with different number of columns:
>>>> 0610007P14Rik%%% Tcf19 Gtf2i 0610010O12Rik%%% Ivns1abp Etv6
>>>> 1100001G20Rik%%% Nmi 1500015O10Rik%%% Foxi1 Ascl3 Sirt3 1700003E16Rik%%%
>>>> Ascl2 Ifnar2 1700028J19Rik%%% Musk Nfe2l3 1810011O10Rik%%% Ppp1r13b Bpnt1
>>>> Cdkn2c Foxc1 Sox10 Smarca2 1810019D21Rik%%% Asb8 1810037I17Rik%%% Zfp612
>>>> 1810055G02Rik%%% Nkx2-3 Maged1 Runx1 Ugp2 Elk4 Spdef Tcf19 Isl2 Gtf2i
>>>> Ctnnbl1 Tcea3 Ank2 Zfp612 Creb3l1 Nupr1 3632451O06Rik Creb3l4 Lass6
>>>> I wold like to read it into R using
>>>>> test=read.csv("test.txt",sep="\t",header=FALSE)
>>>> However, when I check the r object "test", I found that all the rows have
>>>> 5 columns:
>>>>> test                V1            V2      V3    V4      V51
>>>>> 0610007P14Rik%%%        Tcf19  Gtf2i              2  0610010O12Rik%%%
>>>>> Ivns1abp    Etv6              3  1100001G20Rik%%%          Nmi
>>>>> 4  1500015O10Rik%%%        Foxi1  Ascl3  Sirt3        5
>>>>> 1700003E16Rik%%%
>>>>> Ascl2  Ifnar2              6  1700028J19Rik%%%          Musk  Nfe2l3
>>>>> 7  1810011O10Rik%%%      Ppp1r13b  Bpnt1 Cdkn2c  Foxc18            Sox10
>>>>> Smarca2                      9  1810019D21Rik%%%          Asb8
>>>>> 10 1810037I17Rik%%%        Zfp612                      11
>>>>> 1810055G02Rik%%%
>>>>> Nkx2-3  Maged1  Runx1    Ugp212            Elk4        Spdef  Tcf19
>>>>> Isl2
>>>>> Gtf2i13          Ctnnbl1        Tcea3    Ank2 Zfp612 Creb3l114
>>>>> Nupr1 3632451O06Rik Creb3l4  Lass6
>>>> Basically it breaks some rows into more than one rows. For example, row 7
>>>> in the original record becomes two rows. Looks like the "test" always has
>>>> 5
>>>> columns.
>>>> How does this happen? How should I fix it to make one record into one two
>>>> in R object?
>>>> Thank you very much!
>>>> Ace
>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>       [[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> 
>> 
>> 
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law