[R] removing non-table lines

Mon Sep 19 10:23:11 CEST 2022

Hi Nick,

Here's one way to do it. It is based on the heuristic that you keep
each line with the correct number of fields. The correct number of
fields is automatically determined by the number of fields in the last
line of the file.

Here's the contents of a sample csv file - "tmp3.csv" where the first
few rows contain random garbage text.

Collecting numpy
Downloading numpy-1.23.1.tar.gz (10.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━
10.7/10.7 MB 14.8 MB/s eta 0:00:00
Installing build dependencies: started
Installing build dependencies: finished with status 'error'
2003-07-25,100,100
2003-07-28,102.51597849244192,100
2003-07-29,102.85076595312975,102.51597849244192
2003-07-30,101.62321193060768,102.85076595312975
2003-07-31,102.008724764127,101.62321193060768

Here is my R code to process this file

library(readr)

g <- function(s) { length(unlist(strsplit(s,","))) }

processFile <- function(filename) {
    a <- readr::read_lines(filename)
    N <- g(a[length(a)]) ## get the "correct" number of columns from
the last row
    a2 <- sapply(a, \(x) strsplit(x,","))   ## split each row into its
comma-separated fields
    names(a2) <- NULL
    iV <- unlist(lapply(a2, function(x) {g(x) == N}))   ## identify
the rows with exactly N fields
    a2 <- a2[iV] ## heuristic: keep only the lines with N fields
    a3 <- as.data.frame(do.call(rbind,a2))
    a3
}

myDf <- processFile("tmp3.csv")
print(myDf)

HTH,
Eric

On Mon, Sep 19, 2022 at 2:26 AM <avi.e.gross using gmail.com> wrote:
>
> Adding to what Nick said, extra lines like those described often are in some comment format like beginning with "#" or some consistent characters that can be filtered out using comment.char='#' for example in read.csv() or comment="string" in the tidyverse function read_csv().
>
> And, of course you can skip lines if that makes sense albeit it can be tricky with header lines.
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Rui Barradas
> Sent: Sunday, September 18, 2022 6:19 PM
> To: Nick Wray <nickmwray using gmail.com>; r-help using r-project.org
> Subject: Re: [R] removing non-table lines
>
> Helo,
>
> Unfortunatelly there are many files with a non tabular data section followed by the data. R's read.table has a skip argument:
>
> skip
> integer: the number of lines of the data file to skip before beginning to read data.
>
> If you do not know how many lines to skip because it's not always the same number, here are some ideas.
>
> Is there a pattern in the initial section? Maybe a end-of-section line or maybe the text lines come in a specified order and a last line in that order can be detected with a regex.
>
> Is there a pattern in the tables' column headers? Once again a regex might be the solution.
>
> Is the number of initial lines variable because there are file versions?
> If there are, did the versions evolve over time, a frequent case?
>
> What you describe is not unfrequent, it's always a nuisance and error prone but it should be solvable once patterns are found. Inspect a small number of files with a text editor and try to find both common points and differences. That's half way to a solution.
>
> Hope this helps,
>
> Rui Barradas
>
> Às 20:39 de 18/09/2022, Nick Wray escreveu:
> > Hello - I am having to download lots of rainfall and temperature data
> > in csv form from the UK Met Office.  The data isn't a problem - it's
> > in nice columns and can be read into R easily - the problem is that in
> > each csv there are 60 or so lines of information first which are not
> > part of the columnar data.  If I read the whole csv into R the column
> > data is now longer in columns but in some disorganised form - if I
> > manually delete all the text lines above and download I get a nice
> > neat data table.  As the text lines can't be identified in R by line
> > numbers etc I can't find a way of deleting them in R and atm have to
> > do it by hand which is slow.  It might be possible to write a
> > complicated and dirty algorithm to rearrange the meteorological data
> > back into columns but I suspect that it might be hard to get right and consistent across every csv sheet and any errors
> > might be hard to spot.   I can't find anything on the net about this - has
> > anyone else had to deal with this problem and if so do they have any
> > solutions using R?
> > Thanks Nick Wray
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.