lothritz commited on
Commit
feb68b2
·
1 Parent(s): bd1b691

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -0
README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lb_GottBERT
2
+
3
+ Lb_GottBERT is a BERT-like language model for the Luxembourgish language.
4
+
5
+ We used the weights of the German GottBERT language model as a starting point and continued pre-training it on the MLM task using the same corpus that we used for our LuxemBERT model (https://huggingface.co/lothritz/LuxemBERT).
6
+ We achieved higher performances on several downstream tasks than the original LuxemBERT, DA BERT (https://huggingface.co/iolariu/DA_BERT), and its "sister" model Lb_mBERT (https://huggingface.co/lothritz/Lb_mBERT).
7
+
8
+ If you would like to know more about our work, the pre-training corpus, or use our models or datasets, please check out /cite the following papers:
9
+
10
+ ```
11
+ @inproceedings{lothritz-etal-2022-luxembert,
12
+ title = "{L}uxem{BERT}: Simple and Practical Data Augmentation in Language Model Pre-Training for {L}uxembourgish",
13
+ author = "Lothritz, Cedric and
14
+ Lebichot, Bertrand and
15
+ Allix, Kevin and
16
+ Veiber, Lisa and
17
+ Bissyande, Tegawende and
18
+ Klein, Jacques and
19
+ Boytsov, Andrey and
20
+ Lefebvre, Cl{\'e}ment and
21
+ Goujon, Anne",
22
+ booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
23
+ month = jun,
24
+ year = "2022",
25
+ address = "Marseille, France",
26
+ publisher = "European Language Resources Association",
27
+ url = "https://aclanthology.org/2022.lrec-1.543",
28
+ pages = "5080--5089",
29
+ abstract = "Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.",
30
+ }
31
+ ```
32
+
33
+ ```
34
+ @inproceedings{lothritz2023comparing,
35
+ title={Comparing Pre-Training Schemes for Luxembourgish BERT Models},
36
+ author={Lothritz, Cedric and Ezzini, Saad and Purschke, Christoph and Bissyande, Tegawend{\'e} Fran{\c{c}}ois D Assise and Klein, Jacques and Olariu, Isabella and Boytsov, Andrey and Lefebvre, Clement and Goujon, Anne},
37
+ booktitle={Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)},
38
+ year={2023}
39
+ }
40
+ ```