[R] EOF within quoted string
Mohan.Radhakrishnan at cognizant.com
Mohan.Radhakrishnan at cognizant.com
Fri Aug 11 10:58:28 CEST 2017
Yes. I tried that already. Not straightforward.
data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F, quote="", sep=",", encoding="UTF-8")
This line does read it haphazardly. The emails in the column are split into multiple columns and there are several columns with just ‘NA’. Totally 202 columns.
And then I removed columns with NA’s and concatenated all the text and finally got it.
munged <- data[, unlist(lapply(data, function(x) !all(is.na(x))))]
munged <- munged[-1,]
munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , collapse = " ")
munged <- munged[,c("V1","V2","text")]
print(head(munged$text))
Mohan
From: Adams, Jean [mailto:jvadams at usgs.gov]
Sent: Thursday, August 10, 2017 8:03 PM
To: Radhakrishnan, Mohan (Cognizant) <Mohan.Radhakrishnan at cognizant.com>
Cc: R help <r-help at r-project.org>
Subject: Re: [R] EOF within quoted string
You might want to try some of the suggestions mentioned in this post: https://stackoverflow.com/q/17414776/2140956
Jean
On Thu, Aug 10, 2017 at 7:59 AM, <Mohan.Radhakrishnan at cognizant.com<mailto:Mohan.Radhakrishnan at cognizant.com>> wrote:
Hi,
Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using
data <- read.csv("20_newsgroups.csv",header=TRUE)
throws this.
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
So, for example, the first line in the file is this. This column contains only such text. Is there a way read it ?
From: cubbie at garnet.berkeley.edu<mailto:cubbie at garnet.berkeley.edu> () Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a Organization: University of California, Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu> gajarsky at pilot.njin.net<mailto:gajarsky at pilot.njin.net> writes: morgan and guzman will have era's 1 run higher than last year, and the cubs will be idiots and not pitch harkey as much as hibbard. castillo won't be good (i think he's a stud pitcher) This season so far, Morgan and Guzman helped to lead the Cubs at top in ERA, even better than THE rotation at Atlanta. Cubs ERA at 0.056 while Braves at 0.059. We know it is early in the season, we Cubs fans have learned how to enjoy the short triumph while it is still there.
Thanks,
Mohan
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
[[alternative HTML version deleted]]
More information about the R-help
mailing list