[R] "Complex?" import of pdf files (criminal records) into R table
"Biedermann, Jürgen"
juergen.biedermann at charite.de
Thu Oct 15 10:43:09 CEST 2009
Hi there,
I'm facing the decision if it would be possible to transform several
more or less complex pdf files into an R Table-Format or if it has to be
done manually. I think it would be a impudent to expect a complete
solution, but I would be grateful if anyone could give me an advice on
how the structure of such a R-program could look like, and if it's
possible in general.
Here the problem:
Each pdf file belongs to a person. The pdf files actually represent the
anonymous criminal record of a person. Each entry should lead to one row
with the person number as key. The different lines should form the
columns. The criminal record actually looks like this:
---------------------------------------------------
Header with irrelevant text for us | Date: xx.xx.xxxx (relevant for us)
Anonymous person number: xxxxxxxxxxx
Entries in the register
1. xx.xx.1902 -City-
Be in force since: xx.xx.1902
Date of offense:xx.xx.xxxx
Elements of the offence: For example "Rape"
Section in law: §176, §178 Abs. 1
Sentenced to 5 years imprisonment
"Irrelevant text for us"
Accommodation in an forensic psychiatry
Accommodation sentenced on probation
Rest of sentence sentenced on probation until the xx.xx.xxxx
2. xx.xx.1910
Be in force since: ....
.....
-----------------------------------------------------------------------
The problem is that the entries do not always have the same structure.
The first 6 lines are structurally the same in each entry of the
criminal record (each entry has a line for the judgement date, the "be
in force" date, the date of offence, the elements of the offence, the
Sections in law, and the sentence).
But then depending on the sentence different lines emerge which contain
information if the person was sentenced on probation, if the probation
was withdrawn again, when the person was released etc.
So, I think, these lines should be allocated to different columns
depending on key words. The definition of the key words for most cases
would not be the problem, actually. If a certain column is not relevant
in an entry (so, the key word didn't emerge) NA should be put in the place.
But because sometimes (in rare cases), the entries contain spelling
errors, at the end, all the lines of an entry, which could not be
allocated to a column should be put in a column to check them manually.
In the end the table should look more of less like this.
--------------------------------------------------
"Per.Numb";"EntryNumber";"Judg.Date";"DateOffen.";...;"Probation.until";
"Released";"Not allocated"
xxxx1 1 xx.xx.1902 xx.xx.1901 ... xx.xx.1905 NA "blablabla"
xxxx1 2 xx.xx.1910 xx.xx.1909 ... NA 1925 "blablabla"
xxxx2 1 xx.xx.1924 xx.xx.1923 ... NA NA "blablabla"
------------------------------------------------------------------
Could anyone help me?
Thanks
Greetings
Jürgen
More information about the R-help
mailing list