[R] bibtex::read.bib -- extracting bibentry keys

Michael Friendly friendly at yorku.ca
Mon Aug 6 19:39:09 CEST 2012


On 8/6/2012 11:54 AM, Achim Zeileis wrote:
> On Mon, 6 Aug 2012, Michael Friendly wrote:
>
>> I have two versions of a bibtex database which have gotten badly out 
>> of sync. I need to find find all the entries in bib2 which are not 
>> contained in bib1, according to their bibtex keys. But I can't figure 
>> out how to extract a list of the bibentry keys in these databases.
>
> read.bib() returns a "bibentry" object so you can simply do this as usual
> for "bibentry" objects with $key:
One thing that was confusing was that read.bib returns a "bibentry" 
object, all of whose
elements are also "bibentry" objects.
>
> x <- read.bib(...)
> x$key
>
> or maybe
>
> unlist(x$key)
>
> Whatever is more convenient for you. See ?bibentry for more details.
That is what I was missing -- it would have helped to find a link to 
utils::bibentry in the [rather scanty] documentation for
read.bib. I'm now a happy camper in this regard. What I wanted is given by:

bib1 <- read.bib("C:/localtexmf/bibtex/bib/timeref.bib")
length(bib1)
keys1 <- unlist(bib1$key)

bib2 <- read.bib("W:/texmf/bibtex/bib/timeref.bib")
length(bib2)
keys2 <- unlist(bib2$key)


 > which(! keys1 %in% keys2)
[1] 133 249 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 
627 628
 > keys1[which(! keys1 %in% keys2)]
[1] "Langren:1646" "Fisher:1915a" "Stigler:2012"
[4] "Wainer:2011" "Minard:1860a" "CNAM:1906"
[7] "Wainer:2012" "Wainer-Ramsay:2010" "Stephenson-Galneder:1969"
[10] "Waters:1964" "Agathe:1988" "Gascoigne:2007"
[13] "Krzywinski:2009" "Bolle:1929" "Balbi:1829"
[16] "Bills-Li:2005" "Lewi:2006" "Fletcher:1851"
[19] "Perrot:1976"
 >

As a side note, I searched extensively for bibtex tools that would help 
me resolve the differences between two
related bibtex files, but none was as simple as this, once I could get 
the keys. Thanks to Roman for providing this
infrastructure!

So, ignoring for now differences in the contents of the bibentries, a 
useful tool for my purpose is bibdiff(),

bibdiff <- function(bib1, bib2) {
keys1 <- unlist(bib1$key)
keys2 <- unlist(bib2$key)
only1 <- keys1[which(! keys1 %in% keys2)]
only2 <- keys2[which(! keys2 %in% keys1)]
cat("Only in bib1:\n")
print(only1)
cat("Only in bib2:\n")
print(only2)
}

 > bibdiff(bib1, bib2)
Only in bib1:
[1] "Langren:1646" "Fisher:1915a" "Stigler:2012"
[4] "Wainer:2011" "Minard:1860a" "CNAM:1906"
[7] "Wainer:2012" "Wainer-Ramsay:2010" "Stephenson-Galneder:1969"
[10] "Waters:1964" "Agathe:1988" "Gascoigne:2007"
[13] "Krzywinski:2009" "Bolle:1929" "Balbi:1829"
[16] "Bills-Li:2005" "Lewi:2006" "Fletcher:1851"
[19] "Perrot:1976"
Only in bib2:
[1] "Langren:1644" "Quetelet:1842"
 >

which gives me the complete answer, as far as it goes.

>
>> A minor question: Is there someway to prevent read.bib from ignoring 
>> entries that do not contain all required fields?
>
> Also not really an issue with read.bib itself. read.bib() wants to 
> return a "bibentry" object but bibentry() just allows to create 
> objects that are valid BibTeX, i.e., have all required fields.
>
It turns out that read.bib seems to be pickier than bibtex itself -- it 
does not accommodate crossref= fields, used for
InCollection items; these resolve correctly using bibtex.
For some books in my database, the publisher is unknown. bibtex generates
warnings (I think) and does include the references. It would be nicer if 
there was an argument to read.bib, e.g.,
strict = {T/F} where strict=FALSE would allow entries not containing all 
required fields. But perhaps that's buried
too deep in the implementation.

 > bib1 <- read.bib("C:/localtexmf/bibtex/bib/timeref.bib")
ignoring entry 'Donoho-etal:1988' (line 40) because :
A bibentry of bibtype ‘InCollection’ has to correctly specify the 
field(s): booktitle

ignoring entry 'Martonne:1919:map' (line 90) because :
A bibentry of bibtype ‘InCollection’ has to correctly specify the 
field(s): booktitle, publisher, year

ignoring entry 'Touraine:2002' (line 5423) because :
A bibentry of bibtype ‘Book’ has to correctly specify the field(s): 
publisher

ignoring entry 'Cotes:1722' (line 6004) because :
A bibentry of bibtype ‘Book’ has to correctly specify the field(s): 
publisher

ignoring entry 'Quetelet:1842' (line 6605) because :
A bibentry of bibtype ‘Book’ has to correctly specify the field(s): 
publisher

ignoring entry 'Wenzlick:1950' (line 6663) because :
A bibentry of bibtype ‘Unpublished’ has to correctly specify the 
field(s): note

ignoring entry 'Verniquet:1791' (line 6695) because :
A bibentry of bibtype ‘Book’ has to correctly specify the field(s): 
publisher

 > length(bib1)
[1] 628
 >

>> A suggestion: it would be nice if bibtex provided some extractor 
>> functions for bibentry fields.
>
> So that only a subset of fields is read as opposed to all fields?
>
> If you read all fields, you can easily subset afterwards (again using 
> $-notation).

No, it was only lack of documentation, and perhaps an example or two for 
read.bib that caused me to
stumble.
>
> hth,
> Z


-- 
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street    Web:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA



More information about the R-help mailing list