Update README.md
Browse filesUpdated references to the training dataset.
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
datasets:
|
4 |
-
- deepvk/
|
5 |
language:
|
6 |
- ru
|
7 |
base_model:
|
@@ -91,7 +91,7 @@ Compared to the USER2-base model, there are two additional MLP layers. One is fo
|
|
91 |
|
92 |
<img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
|
93 |
|
94 |
-
The training set is built entirely from splits of the [`deepvk/
|
95 |
- **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0`…`classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
|
96 |
- **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
|
97 |
- **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.
|
@@ -99,10 +99,10 @@ The training set is built entirely from splits of the [`deepvk/CLAZER`](https:/
|
|
99 |
|
100 |
| Dataset | # Samples |
|
101 |
|----------------------------:|:----:|
|
102 |
-
| [
|
103 |
-
| [
|
104 |
-
| [
|
105 |
-
| [
|
106 |
| **Total** | 244K |
|
107 |
|
108 |
## Citations
|
@@ -114,4 +114,4 @@ The training set is built entirely from splits of the [`deepvk/CLAZER`](https:/
|
|
114 |
publisher={Hugging Face}
|
115 |
year={2025},
|
116 |
}
|
117 |
-
```
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
datasets:
|
4 |
+
- deepvk/GeRaCl_synthethic_dataset
|
5 |
language:
|
6 |
- ru
|
7 |
base_model:
|
|
|
91 |
|
92 |
<img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
|
93 |
|
94 |
+
The training set is built entirely from splits of the [`deepvk/GeRaCl_synthethic_dataset`](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset) dataset. It is a concatenation of three sub-datasets:
|
95 |
- **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0`…`classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
|
96 |
- **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
|
97 |
- **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.
|
|
|
99 |
|
100 |
| Dataset | # Samples |
|
101 |
|----------------------------:|:----:|
|
102 |
+
| [GeRaCl_synthethic_dataset/synthetic_classes_train](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes_train) | 93K |
|
103 |
+
| [GeRaCl_synthethic_dataset/synthetic_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes) (val and test) | 6K |
|
104 |
+
| [GeRaCl_synthethic_dataset/ru_mteb_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_classes/) | 52K |
|
105 |
+
| [GeRaCl_synthethic_dataset/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_extended_classes) | 93K |
|
106 |
| **Total** | 244K |
|
107 |
|
108 |
## Citations
|
|
|
114 |
publisher={Hugging Face}
|
115 |
year={2025},
|
116 |
}
|
117 |
+
```
|