Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development

Ogrodniczuk, Maciej; Tomaszewska, Aleksandra; Ziembicki, Daniel; Żurowski, Sebastian; Tuora, Ryszard; Zwierzchowska, Aleksandra

Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development

Files

s_zurowski_062.pdf (215.71 KB)

Date

2024-05

Authors

Ogrodniczuk, Maciej

Tomaszewska, Aleksandra

Ziembicki, Daniel

Żurowski, Sebastian

Tuora, Ryszard

Zwierzchowska, Aleksandra

Publisher

ELRA Language Resource Association

Abstract

This paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management – Semantic Annotation Framework (SemAF), which outlines a set of core discourse relations adaptable for diverse languages and genres. The paper overviews the corpus architecture, annotation procedures, the challenges that the annotators have encountered, as well as key statistical data concerning discourse relations and connectives in the corpus. It further discusses the initial phases of the discourse parser tailored for the ISO 24617-8 framework. Evaluations on the efficacy and potential refinement areas of the corpus annotation and parsing strategies are also presented. The final part of the paper touches upon anticipated research plans to improve discourse analysis techniques in the project and to conduct discourse studies involving multiple languages.

Keywords

Polish Discourse Corpus, discourse analysis, ISO 24617-8, discourse parsing, natural language processing, corpus linguistics, discourse annotation

Citation

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

URI

http://repozytorium.umk.pl/handle/item/7015

Collections

Książki, rozdziały (WHum)

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Poland

Full item page Statistics

Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license