Publications
- A survey of corpora for Germanic low-resource languages and dialectsVerena Blaschke, Hinrich Schütze & Barbara PlankNoDaLiDa 2023 | Abstract Cite Website PDF Slides
Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available.
Verena Blaschke, Hinrich Schütze, and Barbara Plank (2023). “A survey of corpora for Germanic low-resource languages and dialects.” In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 392–414. Tórshavn, Faroe Islands. University of Tartu Library.
@inproceedings{blaschke2023survey, title = {A survey of corpora for {G}ermanic low-resource languages and dialects}, author = {Blaschke, Verena and Sch{\"u}tze, Hinrich and Plank, Barbara}, year = {2023}, month = may, booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)}, address = {T{\'o}rshavn, Faroe Islands}, publisher = {University of Tartu Library}, url = {https://aclanthology.org/2023.nodalida-1.41}, pages = {392--414}, }
- Does manipulating tokenization aid cross-lingual transfer?
A study on POS tagging for non-standardized languagesVerena Blaschke, Hinrich Schütze & Barbara PlankVarDial @ EACL 2023 | Abstract Cite PDF Slides Video Poster CodeOne of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging.
In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.Verena Blaschke, Hinrich Schütze, and Barbara Plank (2023). “Does manipulating tokenization aid cross-lingual transfer? A study on POS tagging for non-standardized languages.” In Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 40–54. Dubrovnik, Croatia. Association for Computational Linguistics.
@inproceedings{blaschke2023manipulating, title = {Does manipulating tokenization aid cross-lingual transfer? {A} study on {POS} tagging for non-standardized languages}, author = {Blaschke, Verena and Sch{\"u}tze, Hinrich and Plank, Barbara}, booktitle = {Proceedings of the Tenth Workshop on NLP for Similar Languages, Varieties and Dialects}, year = {2023}, month = may, address = {Dubrovnik, Croatia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.vardial-1.5}, pages = {40--54}, }
- [Forthcoming] Navigable atom-rule interactions in PSL models enhanced by rule verbalizations, with an application to etymological inferenceVerena Blaschke, Thora Daneyko, Jekaterina Kaparina, Zhuge Gao & Johannes DellertILP 2022 | Abstract Slides Poster Code (PSL-Infrastructure) Code (PSL-RAGviewer)
Adding to the budding landscape of advanced analysis tools for Probabilistic Soft Logic (PSL), we present a graphical explorer for grounded PSL models. It exposes the structure of the model from the perspective of any single atom, listing the ground rules in which it occurs. The other atoms in these rules serve as links for navigation through the resulting rule-atom graph (RAG). As additional diagnostic criteria, each associated rule is further classified as exerting upward or downward pressure on the atom’s value, and as active or inactive depending on its importance for the MAP estimate.
Our RAG viewer further includes a general infrastructure for making PSL results explainable by stating the reasoning patterns in terms of domain language. For this purpose, we provide a Java interface for “talking” predicates and rules which can generate verbalized explanations of the atom interactions effected by each rule. If the model’s rules are structured similarly to the way the domain is conceptualized by users, they will receive an intuitive explanation of the result in natural language.
As an example application, we present the current state of the loanword detection component of EtInEn, our upcoming software for machine-assisted etymological theory development. - CyberWallE at SemEval-2020 task 11:
An analysis of feature engineering for ensemble models for propaganda detectionVerena Blaschke, Maxim Korniyenko & Sam TureskiSemEval @ COLING 2020 | Abstract Cite PDF Poster CodeThis paper describes our participation in the SemEval-2020 task Detection of Propaganda Techniques in News Articles. We participate in both subtasks: Span Identification (SI) and Technique Classification (TC). We use a bi-LSTM architecture in the SI subtask and train a complex ensemble model for the TC subtask. Our architectures are built using embeddings from BERT in combination with additional lexical features and extensive label post-processing. Our systems achieve a rank of 8 out of 35 teams in the SI subtask (F1-score: 43.86%) and 8 out of 31 teams in the TC subtask (F1-score: 57.37%).
Verena Blaschke, Maxim Korniyenko, and Sam Tureski (2020). “CyberWallE at SemEval-2020 task 11: An analysis of feature engineering for ensemble models for propaganda detection.” In Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval 2020), pp. 1469–1480. Barcelona (online). International Committee for Computational Linguistics. DOI: 10.18653/v1/2020.semeval-1.192
@inproceedings{blaschke2020cyberwalle, title = {{C}yber{W}all{E} at {S}em{E}val-2020 task 11: An analysis of feature engineering for ensemble models for propaganda detection}, author = {Blaschke, Verena and Korniyenko, Maxim and Tureski, Sam}, booktitle = {Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval 2020)}, year = {2020}, address = {Barcelona (online)}, publisher = {International Committee for Computational Linguistics}, url = {https://aclanthology.org/2020.semeval-1.192}, doi = {10.18653/v1/2020.semeval-1.192}, pages = {1469--1480}, }
- TĂĽbingen-Oslo Team at the VarDial 2018 evaluation campaign:
An analysis of n-gram features in language variety identificationÇağrı Çöltekin, Taraka Rama & Verena BlaschkeVarDial @ COLING 2018 | Abstract Cite PDFThis paper describes our systems for the VarDial 2018 evaluation campaign. We participated in all language identification tasks, namely, Arabic dialect identification (ADI), German dialect identification (GDI), discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). In all of the tasks, we only used textual transcripts (not using audio features for ADI). We submitted system runs based on support vector machine classifiers (SVMs) with bag of character and word n-grams as features, and gated bidirectional recurrent neural networks (RNNs) using units of characters and words. Our SVM models outperformed our RNN models in all tasks, obtaining the first place on the DFS task, third place on the ADI task, and second place on others according to the official rankings. As well as describing the models we used in the shared task participation, we present an analysis of the n-gram features used by the SVM models in each task, and also report additional results (that were run after the official competition deadline) on the GDI surprise dialect track.
Çağrı Çöltekin, Taraka Rama, and Verena Blaschke (2018). “Tübingen-Oslo Team at the VarDial 2018 evaluation campaign: An analysis of n-gram features in language variety identification.” In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pp. 55–65. Santa Fe, New Mexico, USA. Association for Computational Linguistics.
@inproceedings{coltekin2018tubingen, title = {{T}{\"u}bingen-{O}slo Team at the {V}ar{D}ial 2018 evaluation campaign: An analysis of n-gram features in language variety identification}, author = {{\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i} and Rama, Taraka and Blaschke, Verena}, booktitle = {Proceedings of the Fifth Workshop on {NLP} for Similar Languages, Varieties and Dialects ({V}ar{D}ial 2018)}, year = {2018}, address = {Santa Fe, New Mexico, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/W18-3906}, pages = {55--65}, }
Talks
(Excluding paper presentations, which are linked in the Publications section where applicable.)
- [Accepted] Configurable language-specific tokenization for CLDF databasesJohannes Dellert & Verena BlaschkeExploiting standardized cross-linguistic data in historical linguistics @ ICHL 2023 | Abstract
In any workflow for computational historical linguistics, tokenization of IPA sequences is a crucial preprocessing step, as it shapes the alignments which provide the input of algorithms for cognate detection and proto-form reconstruction. This is also true for EtInEn (Dellert 2019), our forthcoming integrated development environment for etymological theories. An EtInEn project can be created from any CLDF database such as the ones that have been aggregated and unified by the Lexibank initiative (List ea. 2022). Whereas the tools for preparing CLDF databases (Forkel & List 2020) encourage the application of a uniform tokenization across all languages in a dataset, our view is that in many contexts, it is more natural to tokenize phonetic sequences in ways that differ between languages. To provide a simple example, many geminates in Italian need to be aligned to consonant clusters in other Romance languages (e.g. notte vs. Romanian noapte “night”), which is much easier if they are tokenized into two instances of the same consonant, whereas geminates in Swedish are best treated as cognate to their shortened counterparts in other Germanic languages.
To provide comprehensive support for such cases, EtInEn includes configurable language-specific tokenizers as an additional abstraction layer that allows to reshape forms after the import, and also serves as a generic way to bridge phonetic surface forms and the underlying forms that historical linguists are primarily interested in. Each tokenizer is defined by a token alphabet which is used for greedy tokenization, a list of allophone sets which can be used to abstract over irrelevant subphonemic distinctions, and a list of non-IPA symbols that are defined in terms of phonetic features. The initial state of each tokenizer is based on an analysis of the tokens used by the imported CLDF database. Tokenizer definitions are stored in a human-editable plain-text format which we would like to propose as a new standard.
In EtInEn, tokenizer definitions are manipulated through a graphical editor in which the potential tokens for each language are arranged in the familiar layout of consonant and vowel charts, enhanced by additional panels for diphthongs and tones. Currently defined tokens are highlighted, and allophone sets are summarized under their canonical symbols. Basic edit operations serve to group several sounds into an allophone set, and to join or split a multi-symbol sequence, such as a diphthong or a sound with a coarticulation. More complex operations support workflows for parallel configuration of multiple tokenizers.
Additional non-IPA symbols can be given semantics in terms of a combination of phonetic features, and declared to be part of the token set for any language. On the representational level, this provides the option to use non-IPA symbols for form display, whereas underlyingly, the system will interpret the symbols in terms of their features. On the conceptual level, underspecified definitions provide support for metasymbols. In addition to some predefined metasymbols (such as V for vowels and C for consonants), the user can assign additional symbols to arbitrary classes of sounds. These are then available throughout EtInEn for various purposes, such as concisely representing the conditioning environments for a soundlaw, or summarizing the probabilistic output of an automated reconstruction module.
In addition to configurable tokenizers, EtInEn provides the option to define form-specific tokenization overrides, allowing to substitute the result of automated tokenization with any sequence over the current token alphabet for the relevant language. This is currently our strategy for handling otherwise challenging phenomena such as metathesis or root-pattern morphology, which we normalize into alignable and concatenative representations. This forms a bridge to existing standards for representing morphology in the CLDF framework (e.g. Schweikhard & List 2020), which currently only support the annotation of morpheme boundaries in terms of simple splits in phonetic IPA sequences.
References:
Dellert, Johannes (2019): “Interactive Etymological Inference via Statistical Relational Learning.” Workshop on Computer-Assisted Language Comparison at SLE-2019.
Forkel, Robert and Johann-Mattis List (2020): “CLDFBench. Give your Cross-Linguistic data a lift.” Proceedings of LREC 2020, 6997-7004.
List, Johann-Mattis, Robert Forkel, S. J. Greenhill, Christoph Rzymski, Johannes Englisch & Russell Gray (2022): “Lexibank, A public repository of standardized wordlists with computed phonological and lexical features.” Scientific Data 9.316, 1-31.
Schweikhard, Nathanael E. and Johann-Mattis List (2020): “Developing an annotation framework for word formation processes in comparative linguistics.“ SKASE Journal of Theoretical Linguistics 17(1), 2-26. - Correlating borrowing events across concepts to derive a data-driven source of evidence for loanword etymologiesVerena Blaschke & Johannes DellertModel and Evidence in Quantitative Comparative Linguistics @ DGfS 2021 | Abstract Slides Code
Computational methods for approximating various aspects of the reasoning of a historical linguist have great potential as components of a future generation of systems for more rapid machine-aided theory development (List 2019). One of the main challenges for such methods is that some of the heuristics and reasoning patterns commonly used in historical linguistics are difficult to formalize completely. Etymological arguments frequently appeal more to the shared experience of experts than to a fully developed theoretical framework. Computationally emulating this process will require experience in the shape of data with annotations that represent the heuristics and preferences employed within human expert communities.
Our first application of this general paradigm focuses on informal evidence used for establishing loanword etymologies. Classical arguments for assigning a loanword etymology to a word rely on deviations from the sound laws which would have applied if the word had been inherited, or borrowed at a different point in time. For instance, it is clear that the German word Person is a borrowing and not strictly cognate with Latin persona, because otherwise the initial p would have had to undergo a sound shift to f. Such a criterion would be rather straightforward to formalize based on a formal description of the expected sound laws. However, this criterion is only helpful if some known sound law would have applied to a part of the phonetic material of the word in question. In many cases, we are not in this comfortable position, and the etymological discussion will be based on more elusive evidence.
In some cases, historical, geographical or archaeological knowledge will help to make the decision, but the most systematically exploitable type of evidence builds on the tendency for loanwords to appear in batches. For instance, if some language has already been established as a donor language for some words, it becomes more likely as a candidate donor for other words as well, even if the evidence from the individual words alone would not warrant such a conclusion. Even more crucially, arguments often rely on the observation that words from the same semantic field tend to get borrowed together. This applies to obvious cases like numbers and month names as well as to less obviously connected sets of concepts such as tools belonging to a certain craft (Tadmor 2009, Carling et al. 2009).
A helpful automated method for inferring possible loanword relations will have to emulate at least some of these types of informal reasoning. As a first step in this direction, we develop data-driven measures of how much evidence establishing one borrowing event provides for assuming others. We also explore the extent to which such a correlation structure of borrowing events can be extracted from the limited amounts of existing cross-linguistic loanword data.
Given a set of parallel wordlists annotated with loanword status and semantic concept information, we extract how often each concept was loaned and by which pairs of donor and target languages. To quantify the non-independence of borrowing events for each pair of concepts, we average the normalized pointwise mutual information across 1,000 bootstrap samples. In order to additionally retrieve some directional signal that can be interpreted as an approximation to implicational universals of borrowing, the same procedure is applied to the conditional probabilities of concept pairs given one of the concepts.
We execute our methods on WOLD (Haspelmath and Tadmor 2009), and find that even from this limited sample of 41 languages, it is possible to extract quite a few of the expected within-domain correlations (such as the ones between numbers or between kinship terms), which validated our approach. In addition, we also receive some more surprising cross-domain correlations (such as between NARROW and HOLE and between KNEEL and DEFEAT, but also between BEESWAX and KIDNEY) which require further investigation.
References:
Carling, Gerd, Sandra Cronhamn, Robert Farren, Elnur Aliyev, and Johan Frid. 2019. “The causality of borrowing: Lexical loans in Eurasian languages.” PloS one 14(10): e0223588.
Haspelmath, Martin and Uri Tadmor, eds. 2009. World Loanword Database. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available at https://wold.clld.org/.
List, Johann-Mattis. 2019. “Automated methods for the investigation of language contact, with a focus on lexical borrowing.” Language and Linguistics Compass 13(10): e12355.
Tadmor, Uri. 2009. “Loanwords in the world’s languages: Findings and results.” In Martin Haspelmath, and Uri Tadmor, eds. Loanwords in the world’s languages: A comparative handbook. Berlin: De Gruyter Mouton. 55-75. - Clustering dialect varieties based on historical sound correspondencesVerena BlaschkeGSCL Student Award nominee presentations @ KONVENS 2019 | Abstract Summary Slides Code
While information on historical sound shifts plays an important role for examining the relationships between related language varieties, it has rarely been used for computational dialectology. This thesis explores the performance of two algorithms for clustering language varieties based on sound correspondences between Proto-Germanic and modern continental West Germanic dialects. Our experiments suggest that the results of agglomerative clustering match common dialect groupings more closely than the results of (divisive) bipartite spectral graph co-clustering. We also observe that adding phonetic context information to the sound correspondences yields clusters that are more frequently associated with representative and distinctive sound correspondences).
Theses
- Explainable machine learning in linguistics and applied NLP:
Two case studies of Norwegian dialectometry and sexism detection in French tweetsThis thesis presents an exploration of explainable machine learning in the context of a traditional linguistic area (dialect classification) and an applied task (sexism detection). In both tasks, the input features deemed especially relevant for the classification form meaningful groups that fit in with previous research on the topic, although not all such features are easy to understand or provide plausible explanations. In the case of dialect classification, some important features show that the model also learned patterns that are not typically presented by dialectologists. For both case studies, I use LIME [1] to rank features by their importance for the classification. For the sexism detection task, I additionally examine attention weights, which produce feature rankings that are in many cases similar to the LIME results but that are overall worse at showcasing tokens that are especially characteristic of sexist tweets.
[1] M. T. Ribeiro, S. Singh, and C. Guestrin (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1135–1144. Association for Computing Machinery. - Clustering dialect varieties based on historical sound correspondences
While information on historical sound shifts plays an important role for examining the relationships between related language varieties, it has rarely been used for computational dialectology. This thesis explores the performance of two algorithms for clustering language varieties based on sound correspondences between Proto-Germanic and modern continental West Germanic dialects. Our experiments suggest that the results of agglomerative clustering match common dialect groupings more closely than the results of (divisive) bipartite spectral graph co-clustering. We also observe that adding phonetic context information to the sound correspondences yields clusters that are more frequently associated with representative and distinctive sound correspondences).
Other
- LanguageStructure/TuLeD [TupĂan Language Database]:
Pre-release (version 0.9)FabrĂcio Ferraz Gerardi, Stanislav Reichert, Tim Wientzek, Verena Blaschke, Eric Mattos, Zhuge Gao, Mihai Manolescu & Nianheng Wu2020 | DOI