Open datasets

Benchmark dataset for NER models used on noisy OCR predictions

A benchmark dataset for NER models and noisy OCR predictions has been created based on our corpus of directories and is freely available:

suitable for OCR evaluation,
suitable for NER fine-tuning self-supervised and supervised,
suitable for NER evaluation.

8, 765 reference entries from a selection of directories were manually corrected and annotated with 34, 242 entities. Entries contain 3.9 entities on average. Reference tagged entities were then projected on OCR predictions done by tree different OCR systems. This resulted in a variable loss of entries:

for PERO OCR: 8, 392 valid entries were generated,
for Tesseract: 8, 700 valid entries were generated,
for Kraken: 7, 990 valid entries were generated.

The resulting intersection of the sets of valid entries contains 7, 725 entries for the tree OCR systems (and the reference), or 8, 341 entries if we consider PERO OCR and Tesseract only.

This dataset has been used in the work described in the article: .

The dataset is described in more detail and can be downloaded on the Zenodo repository: .

Benchmark dataset for Nested NER models used on noisy OCR predictions

A benchmark dataset for Nested Named Entity Recognition has been created using the initial dataset for NER models (see Benchmark dataset for NER models used on noisy OCR predictions at the top this page):

Suitable for evaluating nested named entity recognition approaches (fine-tuning) with a two-levels hierarchy of named entities;
Includes for each entry :
- the manually corrected OCR text;
- PERO OCR output;
- Tesseract output.

This dataset was used in the following work presented at ICDAR 2023: .

The dataset is fully described and can be downloaded from Zenodo: : .

Benchmark dataset for historical maps segmentation

A benchmark dataset for historical maps segmentation has been created and used to organise a competition at the ICDAR 2021 conference: .

Dataset "Annuaires historiques parisiens, 1798-1914"

Réf. Annuaires historiques parisiens, 1798-1914. Extraction structurée et géolocalisée à l’adresse des listes nominatives [par ordre alphabétique et par activité] dans les volumes numérisés, SoDUCo Team, V.4 - novembre 2023, Paris DataSet Nakala:

For more information about this dataset, see the 2nd Session of the SoDUCo-BnF seminar: Relocating addresses from commerce directories, Paris, XIXth century. A corpus of urban locations at a large scale, 10 novembre 2022, Paris, Bibliothèque nationale de France. cf. Session Program and related presentations

The geohistorical knowledge graph of shops and businesses

An approach was proposed to create a geohistorical knowledge graph that would enable the evolution of shops and businesses in Paris to be tracked over time, based on named entities extracted from trade directories and addresses extracted from old maps processed as part of the project. A first knowledge graph about the activities related to photography has been constructed and published on the Web of Data. Among other things, it can be used to answer the following questions:

What was the address of business X in 1861?
How many shops or businesses of this type were located in the rue de Rivoli in 1856?
Which shops or businesses were located in an area defined by a bounding box or polygon in 1875?
Which shops or businesses moved during their existence?
Which shops or businesses were taken over by another owner carrying on the same activity?

This geohistorical knowledge graph can be queried either though a cartographic interface or through a SPARQL endpoint . It can also be downloaded from the Github repository related to the article describing the geohistorical knowledge graph creation process.

This knowledge graph was created automatically: it necessarily contains errors resulting from the various processes used to produce it (directory entries segmentation, OCR, NER, geocoding, linking, etc.).

We plan to publish more data about other activities and business types by the end of the project.

Datasets "Population des quartiers de Paris (1801-1911)" and "Quartiers de Paris (1860-1919)"

Two data sets designed to further contextualize the data contained in the directories. The first is the census of the population at the level of Paris districts (domiciled and de facto population), while the second is spatial data of districts.

Social Dynamics in Urban Context Open tools, models and data - Paris and its suburbs, 1789-1950