[R] convert list to Dataframe

Duncan Murdoch murdoch at stats.uwo.ca
Sun Nov 1 15:06:23 CET 2009


On 01/11/2009 7:43 AM, onyourmark wrote:
> Hi. I have a huge list called twitter:

It's a list, but more importantly it's a VCorpus and a Corpus.  You 
should use the functions appropriate to those classes to extract the 
strings making up the data, declare their encoding properly (or convert 
them to your native encoding), then use read.delim() on a textConnection 
to read them in.

Duncan Murdoch

> 
>> dim(twitter)
> NULL
>> str(twitter)
> List of 1
>  $ :Classes 'PlainTextDocument', 'TextDocument', 'character'  atomic
> [1:35575] 11999;10:47:14;20;10;2009;ObamaLouverture;Trails Mixed Lessons For
> Governance From Campaigner-in-chief: President obama jumps  campaign 09 
> tuesday.. http://bit.ly/2eHMaN;Florida;USA;FL;;;27.6648274;-81.5157535
> 12210;10:47:37;20;10;2009;David_Stringer;William Hague heading  Washington 
> meets  Gen. Jim Jones, Sen. John McCain  others. Will Obama team raise
> worries  EU ties?;London, England;United Kingdom;Greater
> London;Westminster;;51.5001524;-0.1262362
> 12355;10:47:53;20;10;2009;Singsabit;RT @Drudge_Report PAPER: Excuses wearing
> thin  Obama, media pals... http://tinyurl.com/yfw6cd9;So.
> California;USA;CA;;;36.778261;-119.4179324
> 12407;10:47:59;20;10;2009;obamavideonews;Obama News Obama   Afghanistan
> troop decision timing (AFP) : AFP - Pres.. http://bit.ly/3KPUr8 #obama
> #video;USA;USA;;;;37.09024;-95.712891 ...
>   .. ..- attr(*, "Author")= chr(0) 
>   .. ..- attr(*, "DateTimeStamp")= POSIXlt[1:9], format: "2009-10-31
> 04:46:56"
>   .. ..- attr(*, "Description")= chr(0) 
>   .. ..- attr(*, "Heading")= chr(0) 
>   .. ..- attr(*, "ID")= chr "1"
>   .. ..- attr(*, "Language")= chr "en"
>   .. ..- attr(*, "LocalMetaData")= list()
>   .. ..- attr(*, "Origin")= chr(0) 
>  - attr(*, "CMetaData")=List of 3
>   ..$ NodeID  : num 0
>   ..$ MetaData:List of 2
>   .. ..$ create_date: POSIXlt[1:9], format: "2009-10-31 04:46:56"
>   .. ..$ creator    : Named chr ""
>   .. .. ..- attr(*, "names")= chr "LOGNAME"
>   ..$ Children: NULL
>   ..- attr(*, "class")= chr "MetaDataNode"
>  - attr(*, "DMetaData")='data.frame':   1 obs. of  1 variable:
>   ..$ MetaID: num 0
>  - attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list"
> 
> It contains tweets but in many languages. The "columns" are separated by
> semi-colons. I am using the tm package and it is a "corpus".
> 
> It looks like this:
> 
> 547282;06:37:17;21;10;2009;dani_jade18;@Laura_Whyte1   day
> :p;Huddersfield/Lincoln;United
> Kingdom;Kirklees;Kirklees;;53.6468475;-1.7727296
> 547283;06:37:17;21;10;2009;fabiomafra;alguém traz mais lenha pro computador
> da facool? BOM DIA.;Belo Horizonte - MG -
> BR;Brazil;MG;;;-19.8157306;-43.9542226
> 547284;06:37:17;21;10;2009;romanotr;Вау, "Репортеры без границ" опубликовали
> список стран со свободой слова, из 173 Грузия на 81 месте опережая Украину.
> Успехи,успехи...;Portugal Aveiro;Portugal;Aveiro;;;40.6411848;-8.6536169
> 547285;06:37:18;21;10;2009;Y_T_;Playing: Beth Orton &lt\;Someone's
> Daughter&gt\;;Kanazawa, Japan;Japan;Ishikawa
> Prefecture;;;36.5613254;136.6562051
> Error: invalid input
> '547286;06:37:18;21;10;2009;Atogey;æ”¯æŒä½ ï¼Œå›½å®¶éœ€è¦ä»–ä»¬ï¼Œä½†æ˜¯å›½å®¶çš„æœªæ¥ä¸èƒ½é ä»–ä»¬â€¦RT
> @zuola ￿我觉得 @wenyunc
> 
> I want to convert it to "fields" or columns and so I thought I should
> convert it to a dataframe. I tried
> 
>> twitterDF<-as.data.frame(twitter)
> Error in sort.list(y) : 
>   invalid input
> '547286;06:37:18;21;10;2009;Atogey;æ”¯æŒä½ ï¼Œå›½å®¶éœ€è¦ä»–ä»¬ï¼Œä½†æ˜¯å›½å®¶çš„æœªæ¥ä¸èƒ½é ä»–ä»¬â€¦RT
> @zuola ￿我觉得 @wenyunchao
> ä¸€ç‚¹éƒ½ä¸ä¹è§‚ã€‚çœŸæ­£çš„ä¹è§‚åº”è¯¥æ˜¯ï¼šä½ å…³æˆ‘åˆæ€Žä¹ˆæ ·ï¼Œåæ­£æ”¿æ²»æ–—äº‰ä¸ä¼šä¸¢æŽ‰æ€§å‘½ï¼Œè€å­å‡ºæ¥åŽæ›´æ˜¯ä¸€æ¡å¥½æ±‰ã€‚åŒ—é£Žè¿˜æ˜¯èˆä¸å¾—*霸地位、肉、书、女人和网络的,不过牢里不会提供这些。另…;山西,浙江;China;Zhejiang;;;28.695035;119.751054'
> in 'utf8towcs'
> 
> Can anyone suggest what I can do? 
> 
> P.S. Actually, I would love to remove all the non-English tweets but I have
> no clue about how to do that.
>




More information about the R-help mailing list