MikhailVyrodov commited on
Commit
002aee0
·
verified ·
1 Parent(s): e5259c0

Update README.md

Browse files

Updated references to the training dataset.

Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - deepvk/synthetic-classes
5
  language:
6
  - ru
7
  base_model:
@@ -91,7 +91,7 @@ Compared to the USER2-base model, there are two additional MLP layers. One is fo
91
 
92
  <img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
93
 
94
- The training set is built entirely from splits of the [`deepvk/CLAZER`](https://huggingface.co/datasets/deepvk/synthetic-classes) dataset. It is a concatenation of three sub-datasets:
95
  - **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0`…`classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
96
  - **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
97
  - **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.
@@ -99,10 +99,10 @@ The training set is built entirely from splits of the [`deepvk/CLAZER`](https:/
99
 
100
  | Dataset | # Samples |
101
  |----------------------------:|:----:|
102
- | [CLAZER/synthetic_classes_train](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/synthetic_classes_train) | 93K |
103
- | [CLAZER/synthetic_classes](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/synthetic_classes) (val and test) | 6K |
104
- | [CLAZER/ru_mteb_classes](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/ru_mteb_classes/) | 52K |
105
- | [CLAZER/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/synthetic-classes/viewer/ru_mteb_extended_classes) | 93K |
106
  | **Total** | 244K |
107
 
108
  ## Citations
@@ -114,4 +114,4 @@ The training set is built entirely from splits of the [`deepvk/CLAZER`](https:/
114
  publisher={Hugging Face}
115
  year={2025},
116
  }
117
- ```
 
1
  ---
2
  license: apache-2.0
3
  datasets:
4
+ - deepvk/GeRaCl_synthethic_dataset
5
  language:
6
  - ru
7
  base_model:
 
91
 
92
  <img src="assets/architecture.png" alt="GeRaCl architecture" width="600"/>
93
 
94
+ The training set is built entirely from splits of the [`deepvk/GeRaCl_synthethic_dataset`](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset) dataset. It is a concatenation of three sub-datasets:
95
  - **Synthetic classes part**. For every training example we randomly chose one of the five class lists (`classes_0`…`classes_4`) and paired it with the sample’s text. The validation and test splits were added unchanged.
96
  - **RU-MTEB part**. The entire `ru_mteb_classes` dataset was added to the mix.
97
  - **RU-MTEB extended part**. The entire `ru_mteb_extended_classes` dataset was added to the mix.
 
99
 
100
  | Dataset | # Samples |
101
  |----------------------------:|:----:|
102
+ | [GeRaCl_synthethic_dataset/synthetic_classes_train](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes_train) | 93K |
103
+ | [GeRaCl_synthethic_dataset/synthetic_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/synthetic_classes) (val and test) | 6K |
104
+ | [GeRaCl_synthethic_dataset/ru_mteb_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_classes/) | 52K |
105
+ | [GeRaCl_synthethic_dataset/ru_mteb_extended_classes](https://huggingface.co/datasets/deepvk/GeRaCl_synthethic_dataset/viewer/ru_mteb_extended_classes) | 93K |
106
  | **Total** | 244K |
107
 
108
  ## Citations
 
114
  publisher={Hugging Face}
115
  year={2025},
116
  }
117
+ ```