lagom-nlp.bsky.social
We are the Leuven AI Group of Multilingual NLP (LAGoM NLP), a research lab at the department of Computer Science at KU Leuven, led by @mdlhx
37 posts
490 followers
148 following
Regular Contributor
Active Commenter
comment in response to
post
✅
comment in response to
post
We look at the role of English in this evaluation: it can be, and is often used as, an interface to boost task performance. Or it can be used as a natural language to evaluate language understanding. We recommend to move away from task performance as a main goal and focus on language understanding.
comment in response to
post
milanlp.bsky.social is having the same issue, maybe take a look at this github issue here: github.com/bluesky-soci...
comment in response to
post
Moreover, we advocate for a shift in perspective from seeking a general definition of data quality towards a more language- and task-specific one. Ultimately, we aim for this study to serve as a guide to using Wikipedia for pretraining in a multilingual setting.
comment in response to
post
We evaluate the downstream impact of quality filtering on Wikipedia by training tiny monolingual pretrained models for each Wikipedia to find that data quality pruning is an effective means for resource-efficient training without hurting performance, especially for LRLs.
comment in response to
post
We subject non-English Wikipedias to common quality filtering techniques like script filtering, MinHash and heuristic filtering, which reveal widespread issues such as a high percentage of one-line articles and duplicate articles.
comment in response to
post
In this paper we critically examine the notion of Wikipedia as a 'high quality' resource, particularly in the pretraining setting.
comment in response to
post
It's still not working somehow, if i search for your handle in the search bar, your profile doesn't show up, I don't know if this is a bug or some setting on your side that's not set correctly?
comment in response to
post
I just tried to add you to the list and somehow couldn't find you, I suspect this might just be too soon after the account creation? I will try again later, might be tomorrow
comment in response to
post
Just did 😁
comment in response to
post
Never mind it did work
comment in response to
post
Trying to but seems not to be working from my phone, will do this from a laptop later today or tomorrow if it hasn't worked
comment in response to
post
welcome!
comment in response to
post
go.bsky.app/LKGekew here we go!
comment in response to
post
why not! one more!
comment in response to
post
I wanted to do this but I am not finding enough accounts yet, I also have @amsterdamnlp.bsky.social @ukplab.bsky.social @colt-upf.bsky.social but I need two more
comment in response to
post
@mdlhx.bsky.social will virtually present our work about zero-shot pos tagging at the multilingual representation workshop on Saturday, 16 Dec at the poster session
Anthology link: aclanthology.org/2024.mrl-1.9/
comment in response to
post
Kushal Tatariya will present our work about interpreting PIXEL (Pixology): Session 09 Interpretability and Analysis of Models for NLP
Nov 13 (Wed) 16:00-17:30
Anthology link: aclanthology.org/2024.emnlp-m...
comment in response to
post
Wessel Poelman and Esther Ploeger will present about typological diversity: Session 11 Multilinguality and Language Diversity
Nov 14 (Thu) 10:30-12:00.
Anthology link: aclanthology.org/2024.emnlp-m...
comment in response to
post
Furthermore, we show that skewed language selection can paint an unfair picture of multilingual model performance. We hope that this work motivates more systematic approaches to language sampling in NLP, potentially inspired by existing methods from linguistic typology.
comment in response to
post
We approximate the diversity by measuring average language distance and absolute typological feature value inclusion and find great variation across papers.
comment in response to
post
We find that there are no set definitions or criteria for making claims about typological diversity in NLP. In practice, when looking at all papers making such claims, languages spoken in Europe are overrepresented.
comment in response to
post
Spoiler: We find that PLMs do get more influenced by Hindi words to predict negative emotions, and by English words to predict positive emotions. Moreover, the PLMs may also overgeneralise this learning to examples where it does not apply.
comment in response to
post
We use LIME and token-level language ID to examine the effect of language on emotion prediction across 3 PLMs finetuned on a Hinglish emotion classification dataset.
comment in response to
post
TLDR: In this paper we leverage sociolinguistic theories to see what pre-trained language models learn when predicting emotion for code-mixed data. Hinglish speakers switch to Hindi to express negative emotions and to English for positive emotions.