Automatic Subject Cataloguing at the German National Library
作者 | |
---|---|
出版日期 | 08 Apr 2025 |
內容 | The German National Library (DNB) began developing solutions for automatic subject cataloguing 15 years ago. The main reason for this was the huge and ever-growing number of digital media works that needed to be indexed. Today, the DNB uses open source algorithms and frameworks to assign various types of thematic meta information in this way. This practice paper provides a deeper insight into automatic subject cataloguing at the DNB. We look at the data and vocabularies used as well as at the different methods and approaches. The vocabulary for classification is based on the Dewey Decimal Classification (DDC). For verbal subject indexing we use the German Integrated Authority File (GND). The use case of automatic classification is divided into the assignment of DDC Subject Categories and DDC Short Numbers. Due to the large size of the GND vocabulary, the use case of automatic indexing is an extreme multi-label classification (XMLC) problem. A brief report is given about the construction and the performance of our models. Based on these use cases, we present some implementation aspects of our “subject cataloguing machine” EMa, the environment for automatic subject cataloguing in productive use. We point out the basic feature set and provide a high-level introduction of the productive EMa system. The modular design of the EMa software architecture with the open source software Annif as a central toolkit is described. The development of EMa is an ongoing task at the DNB. It requires continuous development and maintenance, technological and human resources. Applied research activities in the DNB's AI project are closely related to the EMa ensuring that relevant scientific findings get integrated into its development. |
刊名 | LIBER Quarterly |
卷期 | Volume 35, No. 1, 2025 |
頁數 | 1-29 |
關鍵字 | German National Library; automatic classification; automatic indexing; natural language processing; machine learning |
網址連結 |