[R] How to read a file containing two types of rows - (for the Netflix challenge data format)
Chris Evans
chr|@ho|d @end|ng |rom p@yctc@org
Fri Jan 31 11:39:48 CET 2020
I am sure Rainer's approach is good and I know my R programming is truly terrible but here's a crude script in base R that does what you want
# rawDat <- readLines(con = "netflix.dat")
fil <- tempfile(fileext = ".dat")
cat("*1:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
value1,value2, value3
*2:*
value1,value2, value3
value1,value2, value3
*3:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
*4:*",
file = fil,
sep = "\n")
rawDat <- readLines(fil, n = -1)
unlink(fil) # tidy up data input
### create a data frame for output
### this first line will be overwritten by the actual data
outDF <- as.data.frame(list(id = 1,
value1 = "",
value2 = "",
value3 = ""),
stringsAsFactors = FALSE) # necessary to avoid mess with character to factor conversion
j <- 0 # counter for entries
for (i in 1:length(rawDat)) {
rawDat[i] <- trimws(rawDat[i])
if (nchar(rawDat[i]) == 0) next # skip empty lines
if (grepl(":*", rawDat[i], fixed = TRUE)) {
### got an ID line
id <- sub("\\*([0123456789]*):\\*", "\\1", rawDat[i])
} else {
### not an ID line so one of the one or more following lines of data
### I have assumed these are all of the same form
j <- j + 1
rawDat[i] <- gsub(" ", "", rawDat[i], fixed = TRUE)
tmpDat <- unlist(strsplit(rawDat[i], ","))
outDF[j,1] <- id
outDF[j,2:4] <- tmpDat
}
}
outDF
I am slowly adapting to the tidyverse but this is something I still find easier to do in very crude for loop, base R.
Plea: my formal programming training is one week of "Introduction to FORTRAN" on teletypes in 1975, but I confess it's
both lack of formal training _and_ lack of native ability that means my coding is so bad.
If any gurus have a moment, show us really elegant and tidyverse ways to do this!
Very best all,
Chris
----- Original Message -----
> From: "Rainer M Krug" <Rainer using krugs.de>
> To: "Emmanuel Levy" <emmanuel.levy using gmail.com>
> Cc: "R-help Mailing List" <r-help using r-project.org>
> Sent: Friday, 31 January, 2020 10:55:46
> Subject: Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)
> I did something similar yesterday…
>
> Use readLine() to read at in and identify the “*1:*, … with a regex. Than you
> have your dividers. In a second step, use read.csv(skip = …, Ncollumns = …) to
> read the enclosed blocks, and last, combine them accordingly.
>
> This is written without an R installation, so the argument names are likely
> wrong.
>
> Rainer
>
>
>> On 31 Jan 2020, at 10:04, Emmanuel Levy <emmanuel.levy using gmail.com> wrote:
>>
>> Hi,
>>
>> I'd like to use the Netflix challenge data and just can't figure out how to
>> efficiently "scan" the files.
>> https://www.kaggle.com/netflix-inc/netflix-prize-data
>>
>> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
>> 3 values associated to each ID:
>>
>> The format is as follows:
>> *1:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *2:*
>> value1,value2, value3
>> value1,value2, value3
>> *3:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *4:*
>> etc ...
>>
>> And I want to create a matrix where each line is of the form:
>>
>> ID value1, value2, value3
>>
>> Si "ID" needs to be duplicated - I could write a Perl script to convert
>> this format to CSV, but I'm sure there's a simple R trick.
>>
>> Thanks for suggestions!
>>
>> Emmanuel
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology,
> UCT), Dipl. Phys. (Germany)
>
> Orcid ID: 0000-0002-7490-0066
>
> Department of Evolutionary Biology and Environmental Studies
> University of Zürich
> Office Y34-J-74
> Winterthurerstrasse 190
> 8075 Zürich
> Switzerland
>
> Office: +41 (0)44 635 47 64
> Cell: +41 (0)78 630 66 57
> email: Rainer.Krug using uzh.ch
> Rainer using krugs.de
> Skype: RMkrug
>
> PGP: 0x0F52F982
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Chris Evans <chris using psyctc.org> Visiting Professor, University of Sheffield <chris.evans using sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris.evans using roehampton.ac.uk> and other places
but <chris using psyctc.org> remains my main Email address. I have a work web site at:
https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see:
https://www.psyctc.org/pelerinage2016/semigrating-to-france/
That page will also take you to my blog which started with earlier joys in France and Spain!
If you want to book to talk, I am trying to keep that to Thursdays and my diary is at:
https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.
More information about the R-help
mailing list