A benchmark dataset for NER models and noisy OCR predictions has been created based on our corpus of directories and is freely available:
8, 765 reference entries from a selection of directories were manually corrected and annotated with 34, 242 entities. Entries contain 3.9 entities on average. Reference tagged entities were then projected on OCR predictions done by tree different OCR systems. This resulted in a variable loss of entries:
The resulting intersection of the sets of valid entries contains 7, 725 entries for the tree OCR systems (and the reference), or 8, 341 entries if we consider PERO OCR and Tesseract only.
This dataset has been used in the work described in the article:
.
The dataset is described in more detail and can be downloaded on the Zenodo repository: .
A benchmark dataset for Nested Named Entity Recognition has been created using the initial dataset for NER models (see Benchmark dataset for NER models used on noisy OCR predictions at the top this page):
This dataset was used in the following work presented at ICDAR 2023:
.
The dataset is fully described and can be downloaded from Zenodo: : .
A benchmark dataset for historical maps segmentation has been created and used to organise a competition at the ICDAR 2021 conference: .
Réf. Annuaires historiques parisiens, 1798-1914. Extraction structurée et géolocalisée à l’adresse des listes nominatives [par ordre alphabétique et par activité] dans les volumes numérisés, SoDUCo Team, V.4 - novembre 2023, Paris DataSet Nakala:
For more information about this dataset, see the 2nd Session of the SoDUCo-BnF seminar: Relocating addresses from commerce directories, Paris, XIXth century. A corpus of urban locations at a large scale, 10 novembre 2022, Paris, Bibliothèque nationale de France. cf. Session Program and related presentations
An approach was proposed to create a geohistorical knowledge graph that would enable the evolution of shops and businesses in Paris to be tracked over time, based on named entities extracted from trade directories and addresses extracted from old maps processed as part of the project. A first knowledge graph about the activities related to photography has been constructed and published on the Web of Data. Among other things, it can be used to answer the following questions:
This geohistorical knowledge graph can be queried either though a cartographic interface or through a SPARQL endpoint
.
It can also be downloaded from the Github repository
related to the article
describing the geohistorical knowledge graph creation process.
This knowledge graph was created automatically: it necessarily contains errors resulting from the various processes used to produce it (directory entries segmentation, OCR, NER, geocoding, linking, etc.).
We plan to publish more data about other activities and business types by the end of the project.
Two data sets designed to further contextualize the data contained in the directories. The first is the census of the population at the level of Paris districts (domiciled and de facto population), while the second is spatial data of districts.