[Statlist] Séminaires de Statistique - Institut de Statistique, Université de Neuchâtel

Mon Jun 26 12:17:05 CEST 2006

Séminaires de Statistique 
Institut de Statistique, Université de Neuchâtel 
Espace de l'Europe 4, Neuchâtel
http://www2.unine.ch/statistics

Mardi 04 juillet 2006 à 11h00 
-----------------------------------------------
Prof. Gabriella Schoier, University of Trieste, Italy

Title: 	An algorithm for documents reconstruction

Abstract : In this talk we present a solution to the problem of documents reconstruction, by considering an alternative to an algorithm used in the case of social network analysis [1] that is the MRNM (Modified Recursive Neighbourhood Mean) algorithm [2]. The advantage of this approach is a reduced more flexible structure on which different techniques can be applied. The need of reconstructing documents which have been destroyed by means of a shredder may arise in different fields such as in the forensics and investigative sciences. In a computer-based reconstruction, the pieces are described by numerical features, which represent the visual content of the strips. Usually, the pieces of different pages have been mixed. In our case data represent a group of paper strips, obtained by a virtually destruction; of ten text pages (the pages have been acquired by means of a scanner, and then virtually shredded with an opportune software) and they are organized in a matrix in which each row represent a single strip; from each strip some numerical variables (columns) are extracted [3]. The variables we propose to investigate in this paper, describe distances between two consecutive text lines in the same strip. The starting point for the algorithm is this matrix of distances, built on the base of the relation (two strips belong or not to the same page) among the units (strips) of the network (set of units and relation(s) defined over it). After the initialization of the so called positional variables, which summarise the information given by the matrix of distances, the iterative algorithm is performed. The algorithm ends when a stable solution is obtained [2]. 
The output is a reduction of the network to a set of positional variables which have been submitted to a cluster analysis in order to obtain clusters of similar strips. In order to evaluate the goodness of the clustering an index is proposed. 

References: 	
[1] Moody J., (2001) Peer Influence Groups: Identifying Dense Clusters in Large Networks. Social Networks.   	
[2] Schoier G. , Melfi G. , (2004) A Different Approach for the Analysis of Web Access Logs. In: M. Vichi et al., editors, New developments in Classification and Data Analysis. Springer.   	
[3] Ukovich A., Zacchigna A., Ramponi G., Schoier G. , (2006) Using clustering for document reconstruction. Proc Electronic Imaging 2006, IS & T SPIE's 18th Symposium   	

For more info on the Statistical Seminars organised by the 
Institute of Statistics, see http://www2.unine.ch/statistics/page9123.html