[R] md5sum issues

Duncan Murdoch murdoch@dunc@n @end|ng |rom gm@||@com
Wed Feb 3 18:02:47 CET 2021


On 03/02/2021 11:15 a.m., Jeff Newmiller wrote:
> This CR vs LF vs CRLF newline discrepancy has been around since the 70s and the CP/M operating system. And it remains an issue in over-the-wire internet text protocols today, which actually use the CRLF version like Windows. Sorry, UNIX... world domination of LF encoding failed.
> 
> The problem with pretending there is no issue as Duncan is advocating 

That misrepresents my position.  Obviously there's an issue.  I'm 
suggesting a simple solution.

Duncan Murdoch

is that text is treated differently than binary, and every time you 
pretend it isn't it comes back to bite you. Applying binary algorithms 
like MD5 to text is one of these areas where your expectation that this 
will be successful is what creates the problem in the first place. A 
similar issue occurs in file encoding.. two files may both contain the 
word "Hello" but if they are encoded in UCS16 and UTF8 respectively then 
the MD5 results will be different.
> 
> Git does not (currently) support differences in encoding, but it does support text vs non-text (newline) differences because they are unavoidable. Pushing forward with your expectation that text files should compare the same in binary by assuming text will always be like UNIX text just defers the problem for another day.
> 
> Since I don't know what problem you are actually trying to solve, I cannot offer a concrete solution. But I would begin by not assuming that MD5 works the same on text and binary files... because it doesn't.
> 
> On February 3, 2021 2:48:56 AM PST, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>> On 03/02/2021 4:42 a.m., Ivan Calandra wrote:
>>> Thank you Ivan and Duncan for your help.
>>>
>>> I understand your point Duncan, but the thing is that I do have an
>> issue
>>> here.
>>> Is it then due to RStudio or even Windows? If it is, I can forget
>> about
>>> a solution on that end, so I would focus on what I can do, and this
>> Git
>>> setting seems to be the best place to start.
>>
>> In my opinion, you should run
>>
>>   git config --global core.autocrlf false
>>
>> in an RStudio terminal session.  That will set the git options so they
>> don't mess up the md5sum values.
>>
>> You should also go to the RStudio options, and in the Code section,
>> Saving tab, choose Serialization to be Posix (LF) and default text
>> encoding to be UTF-8.
>>
>> Unfortunately, RStudio will still mess up the .Rproj file (see
>> https://github.com/rstudio/rstudio/issues/1929); there's not much you
>> can do about that.  Just try not to commit the Windows version to the
>> repository if any non-Windows users are sharing it.
>>
>> But do note that other people have different opinions.  They argue that
>>
>> files should be converted to Windows native format by git.  That works
>> in some narrow use cases, but as soon as you try to extract a file from
>>
>> git on one system and work on it on another, it breaks.
>>
>> Duncan Murdoch
>>
>>
>>>
>>> Or am I missing something (I am still a newbie on these things...)?
>>>
>>> Ivan C
>>>
>>> --
>>> Dr. Ivan Calandra
>>> TraCEr, laboratory for Traceology and Controlled Experiments
>>> MONREPOS Archaeological Research Centre and
>>> Museum for Human Behavioural Evolution
>>> Schloss Monrepos
>>> 56567 Neuwied, Germany
>>> +49 (0) 2631 9772-243
>>> https://www.researchgate.net/profile/Ivan_Calandra
>>>
>>> On 03/02/2021 10:06, Duncan Murdoch wrote:
>>>> On 03/02/2021 2:14 a.m., Ivan Krylov wrote:
>>>>> On Tue, 2 Feb 2021 17:01:05 +0100
>>>>> Ivan Calandra <calandra using rgzm.de> wrote:
>>>>>
>>>>>> This happens to all text-based files (Rmd, MD, CSV...) but not to
>>>>>> non-editable files (PDF, XLSX...).
>>>>>
>>>>> This is probably caused by Git helpfully converting text files from
>> LF
>>>>> (0x10) line endings to CR LF (0x13 0x10) when checking out the
>>>>> repository clone on Windows (and back when checking in).
>>>>>
>>>>> This configuration option is described in Pro Git:
>>>>>
>> https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#_core_autocrlf
>>>>>
>>>>
>>>> I agree with Ivan K, but don't agree with the advice in that book.
>>>>
>>>> It's best to just leave files alone, not to convert between LF and
>>>> CR-LF.  I don't think this confuses many Windows editors these days,
>>>> but if your editor forces files into CR-LF form, you should fix the
>>>> editor, not try to work around it.
>>>>
>>>> In my opinion everyone should run
>>>>
>>>>    git config --global core.autocrlf false
>>>>
>>>> Some more arguments for this (in the context of Github Actions) are
>> here:
>>>>
>>>>
>>>>
>> https://github.community/t/git-config-core-autocrlf-should-default-to-false/16140
>>>>
>>>>
>>>> Duncan Murdoch
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list