[R] convert list to Dataframe

onyourmark william108 at gmail.com
Sun Nov 1 15:38:23 CET 2009


I did this on the source files which were semi-colon delimted (to delimit the
fields, I am not sure what character denotes the new tweet)

After loading the tm package

> txt <- system.file("texts", "txt", package = "tm")
> (twitter <- Corpus(DirSource(txt),
+ readerControl = list(language = "lat")))

then

twitter <- tm_map(twitter, removeWords, stopwords("english"))

That last command took about an hour to complete.



onyourmark wrote:
> 
> Hi. I have a huge list called twitter:
> 
>> dim(twitter)
> NULL
>> str(twitter)
> List of 1
>  $ :Classes 'PlainTextDocument', 'TextDocument', 'character'  atomic
> [1:35575] 11999;10:47:14;20;10;2009;ObamaLouverture;Trails Mixed Lessons
> For Governance From Campaigner-in-chief: President obama jumps  campaign
> 09  tuesday.. http://bit.ly/2eHMaN;Florida;USA;FL;;;27.6648274;-81.5157535
> 12210;10:47:37;20;10;2009;David_Stringer;William Hague heading  Washington 
> meets  Gen. Jim Jones, Sen. John McCain  others. Will Obama team raise
> worries  EU ties?;London, England;United Kingdom;Greater
> London;Westminster;;51.5001524;-0.1262362
> 12355;10:47:53;20;10;2009;Singsabit;RT @Drudge_Report PAPER: Excuses
> wearing thin  Obama, media pals... http://tinyurl.com/yfw6cd9;So.
> California;USA;CA;;;36.778261;-119.4179324
> 12407;10:47:59;20;10;2009;obamavideonews;Obama News Obama   Afghanistan
> troop decision timing (AFP) : AFP - Pres.. http://bit.ly/3KPUr8 #obama
> #video;USA;USA;;;;37.09024;-95.712891 ...
>   .. ..- attr(*, "Author")= chr(0) 
>   .. ..- attr(*, "DateTimeStamp")= POSIXlt[1:9], format: "2009-10-31
> 04:46:56"
>   .. ..- attr(*, "Description")= chr(0) 
>   .. ..- attr(*, "Heading")= chr(0) 
>   .. ..- attr(*, "ID")= chr "1"
>   .. ..- attr(*, "Language")= chr "en"
>   .. ..- attr(*, "LocalMetaData")= list()
>   .. ..- attr(*, "Origin")= chr(0) 
>  - attr(*, "CMetaData")=List of 3
>   ..$ NodeID  : num 0
>   ..$ MetaData:List of 2
>   .. ..$ create_date: POSIXlt[1:9], format: "2009-10-31 04:46:56"
>   .. ..$ creator    : Named chr ""
>   .. .. ..- attr(*, "names")= chr "LOGNAME"
>   ..$ Children: NULL
>   ..- attr(*, "class")= chr "MetaDataNode"
>  - attr(*, "DMetaData")='data.frame':   1 obs. of  1 variable:
>   ..$ MetaID: num 0
>  - attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list"
> 
> It contains tweets but in many languages. The "columns" are separated by
> semi-colons. I am using the tm package and it is a "corpus".
> 
> It looks like this:
> 
> 547282;06:37:17;21;10;2009;dani_jade18;@Laura_Whyte1   day
> :p;Huddersfield/Lincoln;United
> Kingdom;Kirklees;Kirklees;;53.6468475;-1.7727296
> 547283;06:37:17;21;10;2009;fabiomafra;alguém traz mais lenha pro
> computador da facool? BOM DIA.;Belo Horizonte - MG -
> BR;Brazil;MG;;;-19.8157306;-43.9542226
> 547284;06:37:17;21;10;2009;romanotr;Вау, "Репортеры без границ"
> опубликовали список стран со свободой слова, из 173 Грузия на 81 месте
> опережая Украину. Успехи,успехи...;Portugal
> Aveiro;Portugal;Aveiro;;;40.6411848;-8.6536169
> 547285;06:37:18;21;10;2009;Y_T_;Playing: Beth Orton &lt\;Someone's
> Daughter&gt\;;Kanazawa, Japan;Japan;Ishikawa
> Prefecture;;;36.5613254;136.6562051
> Error: invalid input
> '547286;06:37:18;21;10;2009;Atogey;支持你,国家需要他们,但是国家的未来不能靠他们…RT
> @zuola ￿我觉得 @wenyunc
> 
> I want to convert it to "fields" or columns and so I thought I should
> convert it to a dataframe. I tried
> 
>> twitterDF<-as.data.frame(twitter)
> Error in sort.list(y) : 
>   invalid input
> '547286;06:37:18;21;10;2009;Atogey;支持你,国家需要他们,但是国家的未来不能靠他们…RT
> @zuola ￿我觉得 @wenyunchao
> 一点都不乐观。真正的乐观应该是:你关我又怎么样,反正政治斗争不会丢掉性命,老子出来后更是一条好汉。北风还是舍不得*霸地位、肉、书、女人和网络的,不过牢里不会提供这些。另…;山西,浙江;China;Zhejiang;;;28.695035;119.751054'
> in 'utf8towcs'
>> 
> 
> Can anyone suggest what I can do? 
> 
> P.S. Actually, I would love to remove all the non-English tweets but I
> have no clue about how to do that.
> 
> 

-- 
View this message in context: http://old.nabble.com/convert-list-to-Dataframe-tp26148889p26148898.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list