FlofloB commited on
Commit
381cdec
·
verified ·
1 Parent(s): 88953f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -7,6 +7,9 @@ tags:
7
  model-index:
8
  - name: smollm2-135M_pretrained_200k_fineweb_uncovai_selected
9
  results: []
 
 
 
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -14,8 +17,7 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  # smollm2-135M_pretrained_200k_fineweb_uncovai_selected
16
 
17
- This model is a fine-tuned version of [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) on an unknown dataset.
18
-
19
  ## Model description
20
 
21
  More information needed
@@ -53,4 +55,4 @@ The following hyperparameters were used during training:
53
  - Transformers 4.44.2
54
  - Pytorch 2.4.1+cu121
55
  - Datasets 3.2.0
56
- - Tokenizers 0.19.1
 
7
  model-index:
8
  - name: smollm2-135M_pretrained_200k_fineweb_uncovai_selected
9
  results: []
10
+ datasets:
11
+ - HuggingFaceFW/fineweb
12
+ - UncovAI/fineweb_CC-MAIN-2024-18_100k_output_UncovAI_83362
13
  ---
14
 
15
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
17
 
18
  # smollm2-135M_pretrained_200k_fineweb_uncovai_selected
19
 
20
+ This model is a pre-trained version of [HuggingFaceTB/SmolLM2-135M](https://huggingface.co/HuggingFaceTB/SmolLM2-135M) on the first 200k samples from the [Fineweb dataset dump CC-MAIN-2024-18](https://huggingface.co/datasets/HuggingFaceFW/fineweb/viewer/CC-MAIN-2024-18) using the [UncovAI](https://uncovai.com/) model for text to remove synthetic labeled data. We observed that more than 16% of the data were detected as having been generated by AI by our model.
 
21
  ## Model description
22
 
23
  More information needed
 
55
  - Transformers 4.44.2
56
  - Pytorch 2.4.1+cu121
57
  - Datasets 3.2.0
58
+ - Tokenizers 0.19.1