Home

Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development

Repozytorium Uniwersytetu Mikołaja Kopernika

Pokaż prosty rekord

dc.contributor.author Ogrodniczuk, Maciej
dc.contributor.author Tomaszewska, Aleksandra
dc.contributor.author Ziembicki, Daniel
dc.contributor.author Żurowski, Sebastian
dc.contributor.author Tuora, Ryszard
dc.contributor.author Zwierzchowska, Aleksandra
dc.date.accessioned 2024-05-20T18:32:24Z
dc.date.available 2024-05-20T18:32:24Z
dc.date.issued 2024-05
dc.identifier.citation Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
dc.identifier.isbn 978-2-493814-10-4
dc.identifier.issn 2951-2093
dc.identifier.issn 2522-2686
dc.identifier.uri http://repozytorium.umk.pl/handle/item/7015
dc.description.abstract This paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management – Semantic Annotation Framework (SemAF), which outlines a set of core discourse relations adaptable for diverse languages and genres. The paper overviews the corpus architecture, annotation procedures, the challenges that the annotators have encountered, as well as key statistical data concerning discourse relations and connectives in the corpus. It further discusses the initial phases of the discourse parser tailored for the ISO 24617-8 framework. Evaluations on the efficacy and potential refinement areas of the corpus annotation and parsing strategies are also presented. The final part of the paper touches upon anticipated research plans to improve discourse analysis techniques in the project and to conduct discourse studies involving multiple languages.
dc.description.sponsorship The work was financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00–00C002/19 and the Polish Ministry of Education and Science grant 2022/WK/09.
dc.language.iso eng
dc.publisher ELRA Language Resource Association
dc.rights Attribution-NonCommercial-NoDerivs 3.0 Poland
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/pl/
dc.subject Polish Discourse Corpus
dc.subject discourse analysis
dc.subject ISO 24617-8
dc.subject discourse parsing
dc.subject natural language processing
dc.subject corpus linguistics
dc.subject discourse annotation
dc.title Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development
dc.type info:eu-repo/semantics/article


Pliki:

Należy do następujących kolekcji

Pokaż prosty rekord

Attribution-NonCommercial-NoDerivs 3.0 Poland Ta pozycja jest udostępniona na licencji Attribution-NonCommercial-NoDerivs 3.0 Poland