This project builds on a selection of the published datasets of the early Encyclopedia Britannica editions (1771-1911) produced by the Nineteenth-Century Knowledge Project, led by Peter M. Logan. One hypothesis employed as a use case to guide KDL contribution is that nascent disciplines (domains of knowledge) may not have developed a rich descriptive terminology compared to more established fields and that therefore this linguistic complexity or density can be operationalised by analysis the entries associated to these domains in the Encyclopedia. A particularly important desideratum of KDL exploratory work was to privilege inductive methods and to minimise reliance on extrinsic information.
The scope of this exploratory project was for KDL to:
- familiarise itself with the dataset (7th and 9th editions of the Encyclopedia Britannica);
- experiment various computational linguistic methods (e.g. taxonomic classification, zero-shot classification, semantic topic modelling, semantic classification) and tools to assess their usefulness, limitations and potentials with respect to some research hypotheses around trends and shifts in the historical evolution of knowledge domains as represented in the Encyclopedia entries from cultural and linguistic perspectives;
- identity the most promising classifier (a semantic search approach was selected);
- develop prototype interfaces to manually review the quality of the resulting clusters and assist project partners with uncovering the contemporary organisation of knowledge rather than imposing modern constructs and prior knowledge.
This resulted in two prototype interfaces (we pages) and associated documentation:
- keyword search (7th edition, 1830-1842, and 9th edition, 1875-1889) with domain filtering;
- semantic search (7th edition, 1830-1842).
Both of these are based on a language model made of vectors (also called “embeddings”), one for each word form used in the datasets and one for each Encyclopedia entry. The model is entirely trained from the corpus using off-the-shelf word2vec libraries (Gensim and top2vec). This algorithm ensures that words used in similar constructs have similar vectors. The vector of an Encyclopedia entry is obtained by averaging the vectors of the words it contains. This model can therefore be used to search for the most similar words or entries to another given word or entry based on their distance in that joint space.
The first interface allows allows users to find entries by their heading in the Encyclopedia, their subject terms as extracted by the 19thC Knowledge Project or their domain (assigned by KDL via a classification process based on the models of contemporary descriptions of 6 core domains and associated seed words).
While the second interface allows users to interact directly with a copy of the language model (compressed for the web, which slightly degrades its quality) to retrieve the semantically closest entries and words to any given word or entry heading. A click on any result will trigger a new query. This interface lets users break free from the pre-defined domains and interrogate the corpus for any frequent terms, whether they are high level categories of knowledge, specific concepts or actual persons or places.
Team
- Arianna Ciula KDL Research Software Analyst
- Geoffroy Noël KDL Research Software Engineer
- Marion Thain Researcher
- Peter Logan Principal investigator
Project links
Partner institution
Keywords
- Machine Learning and AI
- History
- Linguistics
- Languages and Literature