[R] Problem with filling dataframe's column

Wed Jun 14 01:24:45 CEST 2023

Bert,

I stand corrected. What I said may have once been true but apparently the implementation seems to have changed at some level.

I did not factor that in.

Nevertheless, whether you use an index as a key or as an offset into an attached vector of labels, it seems to work the same and I think my comment applies well enough that changing a few labels instead of scanning lots of entries can sometimes be a good think. As far as I can tell, external interface seem the same for now. 

One issue with R for a long time was how they did not do something more like a Python dictionary and it looks like …

ABOVE

From: Bert Gunter <bgunter.4567 using gmail.com> 
Sent: Tuesday, June 13, 2023 6:15 PM
To: avi.e.gross using gmail.com
Cc: javad bayat <j.bayat194 using gmail.com>; R-help using r-project.org
Subject: Re: [R] Problem with filling dataframe's column

Below.

On Tue, Jun 13, 2023 at 2:18 PM <avi.e.gross using gmail.com <mailto:avi.e.gross using gmail.com> > wrote:
>
>  
> Javad,
>
> There may be nothing wrong with the methods people are showing you and if it satisfied you, great.
>
> But I note you have lots of data in over a quarter million rows. If much of the text data is redundant, and you want to simplify some operations such as changing some of the values to others I multiple ways, have you done any learning about an R feature very useful for dealing with categorical data called "factors"?
>
> If you have a vector or a column in a data.frame that contains text, then it can be replaced by a factor that often takes way less space as it stores a sort of dictionary of all the unique values and just records numbers like 1,2,3 to tell which one each item is.

-- This is false. It used to be true a **long time ago**, but R has for quite a while used hashing/global string tables to avoid this problem. See here <https://stackoverflow.com/questions/50310092/why-does-r-use-factors-to-store-characters>  for details/references.
As a result, I think many would argue that working with strings *as strings,* not factors, if often a better default, though of course there are still situations where factors are useful (e.g. in ordering results by factor levels where the desired level order is not alphabetical).

**I would appreciate correction/ clarification if my claims are wrong or misleading! **

In any case, please do check such claims before making them on this list.

Cheers,
Bert

	[[alternative HTML version deleted]]