Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development

dc.contributor.authorOgrodniczuk, Maciej
dc.contributor.authorTomaszewska, Aleksandra
dc.contributor.authorZiembicki, Daniel
dc.contributor.authorŻurowski, Sebastian
dc.contributor.authorTuora, Ryszard
dc.contributor.authorZwierzchowska, Aleksandra
dc.date.accessioned2024-05-20T18:32:24Z
dc.date.available2024-05-20T18:32:24Z
dc.date.issued2024-05
dc.description.abstractThis paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management – Semantic Annotation Framework (SemAF), which outlines a set of core discourse relations adaptable for diverse languages and genres. The paper overviews the corpus architecture, annotation procedures, the challenges that the annotators have encountered, as well as key statistical data concerning discourse relations and connectives in the corpus. It further discusses the initial phases of the discourse parser tailored for the ISO 24617-8 framework. Evaluations on the efficacy and potential refinement areas of the corpus annotation and parsing strategies are also presented. The final part of the paper touches upon anticipated research plans to improve discourse analysis techniques in the project and to conduct discourse studies involving multiple languages.pl
dc.description.sponsorshipThe work was financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00–00C002/19 and the Polish Ministry of Education and Science grant 2022/WK/09.pl
dc.identifier.citationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)pl
dc.identifier.isbn978-2-493814-10-4
dc.identifier.issn2951-2093
dc.identifier.issn2522-2686
dc.identifier.urihttp://repozytorium.umk.pl/handle/item/7015
dc.language.isoengpl
dc.publisherELRA Language Resource Associationpl
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 Poland*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/pl/*
dc.subjectPolish Discourse Corpuspl
dc.subjectdiscourse analysispl
dc.subjectISO 24617-8pl
dc.subjectdiscourse parsingpl
dc.subjectnatural language processingpl
dc.subjectcorpus linguisticspl
dc.subjectdiscourse annotationpl
dc.titlePolish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Developmentpl
dc.typeinfo:eu-repo/semantics/articlepl

Files

Original bundle

Loading...
Thumbnail Image
Name:
s_zurowski_062.pdf
Size:
215.71 KB
Format:
Adobe Portable Document Format

License bundle

Loading...
Thumbnail Image
Name:
license.txt
Size:
1.34 KB
Format:
Item-specific license agreed upon to submission
Description: