aniket0898 commited on
Commit
a3632fc
·
verified ·
1 Parent(s): 45c5198

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,837 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - sentence-transformers
7
+ - sentence-similarity
8
+ - feature-extraction
9
+ - generated_from_trainer
10
+ - dataset_size:6300
11
+ - loss:MatryoshkaLoss
12
+ - loss:MultipleNegativesRankingLoss
13
+ base_model: nomic-ai/nomic-embed-text-v1
14
+ widget:
15
+ - source_sentence: What amount of senior notes was repaid during fiscal 2022?
16
+ sentences:
17
+ - 'The following table sets forth the breakdown of revenue by geography, determined
18
+ based on the location of the Host’s listing (in millions): | Year Ended December
19
+ 31, | 2021 | 2022 | 2023 United States | $ | 2,996 | | $ | 3,890 | $ | 4,290 International(1)
20
+ | 2,996 | | 4,509 | | 5,627 Total revenue | $ | 5,992 | | $ | 8,399 | $ | 9,917'
21
+ - During fiscal 2022, $2.25 billion of senior notes was repaid.
22
+ - Several factors are considered in developing the estimate for the long-term expected
23
+ rate of return on plan assets. For the defined benefit retirement plans, these
24
+ factors include historical rates of return of broad equity and bond indices and
25
+ projected long-term rates of return obtained from pension investment consultants.
26
+ The expected long-term rates of return for plan assets are 8 - 9% for equities
27
+ and 3 - 5% for bonds. For other retiree benefit plans, the expected long-term
28
+ rate of return reflects that the assets are comprised primarily of Company stock.
29
+ The expected rate of return on Company stock is based on the long-term projected
30
+ return of 8.5% and reflects the historical pattern of returns.
31
+ - source_sentence: What does GameStop Corp. offer to its customers?
32
+ sentences:
33
+ - State fraud and abuse laws could lead to criminal, civil, or administrative consequences,
34
+ including licensure loss, exclusion from healthcare programs, and significant
35
+ negative effects on the violating entity's business operations and financial health
36
+ if the laws are violated.
37
+ - GameStop Corp. offers games and entertainment products through its stores and
38
+ ecommerce platforms.
39
+ - Stribild is an oral formulation dosed once a day for the treatment of HIV-1 infection
40
+ in certain patients.
41
+ - source_sentence: How might a 10% change in the obsolescence reserve percentage impact
42
+ net earnings?
43
+ sentences:
44
+ - A 10% change in our obsolescence reserve percentage at January 28, 2023 would
45
+ have affected net earnings by approximately $2.5 million in fiscal 2022.
46
+ - The information required by Item 3 on Legal Proceedings is provided by referencing
47
+ Note 19 of the Notes to Consolidated Financial Statements in Item 8.
48
+ - ured notes for an aggregate principal amount of $18.50 billion. These notes were
49
+ issued in multiple series, which mature from 2027 through 2063.
50
+ - source_sentence: What are the SEC's regulations for security-based swap dealers
51
+ like Goldman Sachs' subsidiaries?
52
+ sentences:
53
+ - The increase in other income, net was primarily due to an increase in interest
54
+ income as a result of higher cash balances and higher interest rates.
55
+ - Through our Stubs loyalty programs, we have developed a consumer database of approximately
56
+ 32 million households, representing approximately 64 million individuals.
57
+ - SEC rules govern the registration and regulation of security-based swap dealers.
58
+ Security-based swaps are defined as swaps on single securities, single loans or
59
+ narrow-based baskets or indices of securities. The SEC has adopted a number of
60
+ rules for security-based swap dealers, including (i) capital, margin and segregation
61
+ requirements; (ii) record-keeping, financial reporting and notification requirements;
62
+ (iii) business conduct standards; (iv) regulatory and public trade reporting;
63
+ and (v) the application of risk mitigation techniques to uncleared portfolios
64
+ of security-based swaps.
65
+ - source_sentence: How is the information about legal proceedings organized in the
66
+ financial documents according to the provided context?
67
+ sentences:
68
+ - The information about legal proceedings is organized under Part II, Item 8 in
69
+ the section titled 'Financial Statements and Supplementary Data – Note 14'.
70
+ - We have a match-funding policy that addresses the interest rate risk by aligning
71
+ the interest rate profile (fixed or floating rate and duration) of our debt portfolio
72
+ with the interest rate profile of our finance receivable portfolio within a predetermined
73
+ range on an ongoing basis. In connection with that policy, we use interest rate
74
+ derivative instruments to modify the debt structure to match assets within the
75
+ finance receivable portfolio.
76
+ - Achieved adjusted FIFO operating profit of $5.1 billion, which represents an 18%
77
+ increase compared to 2021.
78
+ pipeline_tag: sentence-similarity
79
+ library_name: sentence-transformers
80
+ metrics:
81
+ - cosine_accuracy@1
82
+ - cosine_accuracy@3
83
+ - cosine_accuracy@5
84
+ - cosine_accuracy@10
85
+ - cosine_precision@1
86
+ - cosine_precision@3
87
+ - cosine_precision@5
88
+ - cosine_precision@10
89
+ - cosine_recall@1
90
+ - cosine_recall@3
91
+ - cosine_recall@5
92
+ - cosine_recall@10
93
+ - cosine_ndcg@10
94
+ - cosine_mrr@10
95
+ - cosine_map@100
96
+ model-index:
97
+ - name: Nomic Financial Matryoshka
98
+ results:
99
+ - task:
100
+ type: information-retrieval
101
+ name: Information Retrieval
102
+ dataset:
103
+ name: dim 768
104
+ type: dim_768
105
+ metrics:
106
+ - type: cosine_accuracy@1
107
+ value: 0.7457142857142857
108
+ name: Cosine Accuracy@1
109
+ - type: cosine_accuracy@3
110
+ value: 0.8614285714285714
111
+ name: Cosine Accuracy@3
112
+ - type: cosine_accuracy@5
113
+ value: 0.8957142857142857
114
+ name: Cosine Accuracy@5
115
+ - type: cosine_accuracy@10
116
+ value: 0.93
117
+ name: Cosine Accuracy@10
118
+ - type: cosine_precision@1
119
+ value: 0.7457142857142857
120
+ name: Cosine Precision@1
121
+ - type: cosine_precision@3
122
+ value: 0.28714285714285714
123
+ name: Cosine Precision@3
124
+ - type: cosine_precision@5
125
+ value: 0.1791428571428571
126
+ name: Cosine Precision@5
127
+ - type: cosine_precision@10
128
+ value: 0.09299999999999999
129
+ name: Cosine Precision@10
130
+ - type: cosine_recall@1
131
+ value: 0.7457142857142857
132
+ name: Cosine Recall@1
133
+ - type: cosine_recall@3
134
+ value: 0.8614285714285714
135
+ name: Cosine Recall@3
136
+ - type: cosine_recall@5
137
+ value: 0.8957142857142857
138
+ name: Cosine Recall@5
139
+ - type: cosine_recall@10
140
+ value: 0.93
141
+ name: Cosine Recall@10
142
+ - type: cosine_ndcg@10
143
+ value: 0.8398915226132163
144
+ name: Cosine Ndcg@10
145
+ - type: cosine_mrr@10
146
+ value: 0.8107896825396824
147
+ name: Cosine Mrr@10
148
+ - type: cosine_map@100
149
+ value: 0.8136819482601757
150
+ name: Cosine Map@100
151
+ - task:
152
+ type: information-retrieval
153
+ name: Information Retrieval
154
+ dataset:
155
+ name: dim 512
156
+ type: dim_512
157
+ metrics:
158
+ - type: cosine_accuracy@1
159
+ value: 0.7357142857142858
160
+ name: Cosine Accuracy@1
161
+ - type: cosine_accuracy@3
162
+ value: 0.8514285714285714
163
+ name: Cosine Accuracy@3
164
+ - type: cosine_accuracy@5
165
+ value: 0.8914285714285715
166
+ name: Cosine Accuracy@5
167
+ - type: cosine_accuracy@10
168
+ value: 0.93
169
+ name: Cosine Accuracy@10
170
+ - type: cosine_precision@1
171
+ value: 0.7357142857142858
172
+ name: Cosine Precision@1
173
+ - type: cosine_precision@3
174
+ value: 0.2838095238095238
175
+ name: Cosine Precision@3
176
+ - type: cosine_precision@5
177
+ value: 0.17828571428571427
178
+ name: Cosine Precision@5
179
+ - type: cosine_precision@10
180
+ value: 0.09299999999999999
181
+ name: Cosine Precision@10
182
+ - type: cosine_recall@1
183
+ value: 0.7357142857142858
184
+ name: Cosine Recall@1
185
+ - type: cosine_recall@3
186
+ value: 0.8514285714285714
187
+ name: Cosine Recall@3
188
+ - type: cosine_recall@5
189
+ value: 0.8914285714285715
190
+ name: Cosine Recall@5
191
+ - type: cosine_recall@10
192
+ value: 0.93
193
+ name: Cosine Recall@10
194
+ - type: cosine_ndcg@10
195
+ value: 0.8352581932886503
196
+ name: Cosine Ndcg@10
197
+ - type: cosine_mrr@10
198
+ value: 0.8047103174603173
199
+ name: Cosine Mrr@10
200
+ - type: cosine_map@100
201
+ value: 0.8075415578285141
202
+ name: Cosine Map@100
203
+ - task:
204
+ type: information-retrieval
205
+ name: Information Retrieval
206
+ dataset:
207
+ name: dim 256
208
+ type: dim_256
209
+ metrics:
210
+ - type: cosine_accuracy@1
211
+ value: 0.7285714285714285
212
+ name: Cosine Accuracy@1
213
+ - type: cosine_accuracy@3
214
+ value: 0.8614285714285714
215
+ name: Cosine Accuracy@3
216
+ - type: cosine_accuracy@5
217
+ value: 0.8857142857142857
218
+ name: Cosine Accuracy@5
219
+ - type: cosine_accuracy@10
220
+ value: 0.9271428571428572
221
+ name: Cosine Accuracy@10
222
+ - type: cosine_precision@1
223
+ value: 0.7285714285714285
224
+ name: Cosine Precision@1
225
+ - type: cosine_precision@3
226
+ value: 0.28714285714285714
227
+ name: Cosine Precision@3
228
+ - type: cosine_precision@5
229
+ value: 0.17714285714285713
230
+ name: Cosine Precision@5
231
+ - type: cosine_precision@10
232
+ value: 0.09271428571428571
233
+ name: Cosine Precision@10
234
+ - type: cosine_recall@1
235
+ value: 0.7285714285714285
236
+ name: Cosine Recall@1
237
+ - type: cosine_recall@3
238
+ value: 0.8614285714285714
239
+ name: Cosine Recall@3
240
+ - type: cosine_recall@5
241
+ value: 0.8857142857142857
242
+ name: Cosine Recall@5
243
+ - type: cosine_recall@10
244
+ value: 0.9271428571428572
245
+ name: Cosine Recall@10
246
+ - type: cosine_ndcg@10
247
+ value: 0.8319809230146766
248
+ name: Cosine Ndcg@10
249
+ - type: cosine_mrr@10
250
+ value: 0.8011235827664398
251
+ name: Cosine Mrr@10
252
+ - type: cosine_map@100
253
+ value: 0.8040552556779361
254
+ name: Cosine Map@100
255
+ - task:
256
+ type: information-retrieval
257
+ name: Information Retrieval
258
+ dataset:
259
+ name: dim 128
260
+ type: dim_128
261
+ metrics:
262
+ - type: cosine_accuracy@1
263
+ value: 0.7128571428571429
264
+ name: Cosine Accuracy@1
265
+ - type: cosine_accuracy@3
266
+ value: 0.8328571428571429
267
+ name: Cosine Accuracy@3
268
+ - type: cosine_accuracy@5
269
+ value: 0.8671428571428571
270
+ name: Cosine Accuracy@5
271
+ - type: cosine_accuracy@10
272
+ value: 0.9142857142857143
273
+ name: Cosine Accuracy@10
274
+ - type: cosine_precision@1
275
+ value: 0.7128571428571429
276
+ name: Cosine Precision@1
277
+ - type: cosine_precision@3
278
+ value: 0.2776190476190476
279
+ name: Cosine Precision@3
280
+ - type: cosine_precision@5
281
+ value: 0.1734285714285714
282
+ name: Cosine Precision@5
283
+ - type: cosine_precision@10
284
+ value: 0.09142857142857141
285
+ name: Cosine Precision@10
286
+ - type: cosine_recall@1
287
+ value: 0.7128571428571429
288
+ name: Cosine Recall@1
289
+ - type: cosine_recall@3
290
+ value: 0.8328571428571429
291
+ name: Cosine Recall@3
292
+ - type: cosine_recall@5
293
+ value: 0.8671428571428571
294
+ name: Cosine Recall@5
295
+ - type: cosine_recall@10
296
+ value: 0.9142857142857143
297
+ name: Cosine Recall@10
298
+ - type: cosine_ndcg@10
299
+ value: 0.8145627876253931
300
+ name: Cosine Ndcg@10
301
+ - type: cosine_mrr@10
302
+ value: 0.7825572562358278
303
+ name: Cosine Mrr@10
304
+ - type: cosine_map@100
305
+ value: 0.7859620809117356
306
+ name: Cosine Map@100
307
+ - task:
308
+ type: information-retrieval
309
+ name: Information Retrieval
310
+ dataset:
311
+ name: dim 64
312
+ type: dim_64
313
+ metrics:
314
+ - type: cosine_accuracy@1
315
+ value: 0.6642857142857143
316
+ name: Cosine Accuracy@1
317
+ - type: cosine_accuracy@3
318
+ value: 0.8042857142857143
319
+ name: Cosine Accuracy@3
320
+ - type: cosine_accuracy@5
321
+ value: 0.8457142857142858
322
+ name: Cosine Accuracy@5
323
+ - type: cosine_accuracy@10
324
+ value: 0.9028571428571428
325
+ name: Cosine Accuracy@10
326
+ - type: cosine_precision@1
327
+ value: 0.6642857142857143
328
+ name: Cosine Precision@1
329
+ - type: cosine_precision@3
330
+ value: 0.2680952380952381
331
+ name: Cosine Precision@3
332
+ - type: cosine_precision@5
333
+ value: 0.16914285714285712
334
+ name: Cosine Precision@5
335
+ - type: cosine_precision@10
336
+ value: 0.09028571428571427
337
+ name: Cosine Precision@10
338
+ - type: cosine_recall@1
339
+ value: 0.6642857142857143
340
+ name: Cosine Recall@1
341
+ - type: cosine_recall@3
342
+ value: 0.8042857142857143
343
+ name: Cosine Recall@3
344
+ - type: cosine_recall@5
345
+ value: 0.8457142857142858
346
+ name: Cosine Recall@5
347
+ - type: cosine_recall@10
348
+ value: 0.9028571428571428
349
+ name: Cosine Recall@10
350
+ - type: cosine_ndcg@10
351
+ value: 0.7821373629924483
352
+ name: Cosine Ndcg@10
353
+ - type: cosine_mrr@10
354
+ value: 0.7436649659863942
355
+ name: Cosine Mrr@10
356
+ - type: cosine_map@100
357
+ value: 0.7468498882402747
358
+ name: Cosine Map@100
359
+ ---
360
+
361
+ # Nomic Financial Matryoshka
362
+
363
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [nomic-ai/nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
364
+
365
+ ## Model Details
366
+
367
+ ### Model Description
368
+ - **Model Type:** Sentence Transformer
369
+ - **Base model:** [nomic-ai/nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) <!-- at revision eb6b20cd65fcbdf7a2bc4ebac97908b3b21da981 -->
370
+ - **Maximum Sequence Length:** 8192 tokens
371
+ - **Output Dimensionality:** 768 tokens
372
+ - **Similarity Function:** Cosine Similarity
373
+ - **Training Dataset:**
374
+ - json
375
+ - **Language:** en
376
+ - **License:** apache-2.0
377
+
378
+ ### Model Sources
379
+
380
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
381
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
382
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
383
+
384
+ ### Full Model Architecture
385
+
386
+ ```
387
+ SentenceTransformer(
388
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel
389
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
390
+ (2): Normalize()
391
+ )
392
+ ```
393
+
394
+ ## Usage
395
+
396
+ ### Direct Usage (Sentence Transformers)
397
+
398
+ First install the Sentence Transformers library:
399
+
400
+ ```bash
401
+ pip install -U sentence-transformers
402
+ ```
403
+
404
+ Then you can load this model and run inference.
405
+ ```python
406
+ from sentence_transformers import SentenceTransformer
407
+
408
+ # Download from the 🤗 Hub
409
+ model = SentenceTransformer("aniket0898/bge-base-financial-matryoshka")
410
+ # Run inference
411
+ sentences = [
412
+ 'How is the information about legal proceedings organized in the financial documents according to the provided context?',
413
+ "The information about legal proceedings is organized under Part II, Item 8 in the section titled 'Financial Statements and Supplementary Data – Note 14'.",
414
+ 'We have a match-funding policy that addresses the interest rate risk by aligning the interest rate profile (fixed or floating rate and duration) of our debt portfolio with the interest rate profile of our finance receivable portfolio within a predetermined range on an ongoing basis. In connection with that policy, we use interest rate derivative instruments to modify the debt structure to match assets within the finance receivable portfolio.',
415
+ ]
416
+ embeddings = model.encode(sentences)
417
+ print(embeddings.shape)
418
+ # [3, 768]
419
+
420
+ # Get the similarity scores for the embeddings
421
+ similarities = model.similarity(embeddings, embeddings)
422
+ print(similarities.shape)
423
+ # [3, 3]
424
+ ```
425
+
426
+ <!--
427
+ ### Direct Usage (Transformers)
428
+
429
+ <details><summary>Click to see the direct usage in Transformers</summary>
430
+
431
+ </details>
432
+ -->
433
+
434
+ <!--
435
+ ### Downstream Usage (Sentence Transformers)
436
+
437
+ You can finetune this model on your own dataset.
438
+
439
+ <details><summary>Click to expand</summary>
440
+
441
+ </details>
442
+ -->
443
+
444
+ <!--
445
+ ### Out-of-Scope Use
446
+
447
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
448
+ -->
449
+
450
+ ## Evaluation
451
+
452
+ ### Metrics
453
+
454
+ #### Information Retrieval
455
+ * Dataset: `dim_768`
456
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
457
+
458
+ | Metric | Value |
459
+ |:--------------------|:-----------|
460
+ | cosine_accuracy@1 | 0.7457 |
461
+ | cosine_accuracy@3 | 0.8614 |
462
+ | cosine_accuracy@5 | 0.8957 |
463
+ | cosine_accuracy@10 | 0.93 |
464
+ | cosine_precision@1 | 0.7457 |
465
+ | cosine_precision@3 | 0.2871 |
466
+ | cosine_precision@5 | 0.1791 |
467
+ | cosine_precision@10 | 0.093 |
468
+ | cosine_recall@1 | 0.7457 |
469
+ | cosine_recall@3 | 0.8614 |
470
+ | cosine_recall@5 | 0.8957 |
471
+ | cosine_recall@10 | 0.93 |
472
+ | cosine_ndcg@10 | 0.8399 |
473
+ | cosine_mrr@10 | 0.8108 |
474
+ | **cosine_map@100** | **0.8137** |
475
+
476
+ #### Information Retrieval
477
+ * Dataset: `dim_512`
478
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
479
+
480
+ | Metric | Value |
481
+ |:--------------------|:-----------|
482
+ | cosine_accuracy@1 | 0.7357 |
483
+ | cosine_accuracy@3 | 0.8514 |
484
+ | cosine_accuracy@5 | 0.8914 |
485
+ | cosine_accuracy@10 | 0.93 |
486
+ | cosine_precision@1 | 0.7357 |
487
+ | cosine_precision@3 | 0.2838 |
488
+ | cosine_precision@5 | 0.1783 |
489
+ | cosine_precision@10 | 0.093 |
490
+ | cosine_recall@1 | 0.7357 |
491
+ | cosine_recall@3 | 0.8514 |
492
+ | cosine_recall@5 | 0.8914 |
493
+ | cosine_recall@10 | 0.93 |
494
+ | cosine_ndcg@10 | 0.8353 |
495
+ | cosine_mrr@10 | 0.8047 |
496
+ | **cosine_map@100** | **0.8075** |
497
+
498
+ #### Information Retrieval
499
+ * Dataset: `dim_256`
500
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
501
+
502
+ | Metric | Value |
503
+ |:--------------------|:-----------|
504
+ | cosine_accuracy@1 | 0.7286 |
505
+ | cosine_accuracy@3 | 0.8614 |
506
+ | cosine_accuracy@5 | 0.8857 |
507
+ | cosine_accuracy@10 | 0.9271 |
508
+ | cosine_precision@1 | 0.7286 |
509
+ | cosine_precision@3 | 0.2871 |
510
+ | cosine_precision@5 | 0.1771 |
511
+ | cosine_precision@10 | 0.0927 |
512
+ | cosine_recall@1 | 0.7286 |
513
+ | cosine_recall@3 | 0.8614 |
514
+ | cosine_recall@5 | 0.8857 |
515
+ | cosine_recall@10 | 0.9271 |
516
+ | cosine_ndcg@10 | 0.832 |
517
+ | cosine_mrr@10 | 0.8011 |
518
+ | **cosine_map@100** | **0.8041** |
519
+
520
+ #### Information Retrieval
521
+ * Dataset: `dim_128`
522
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
523
+
524
+ | Metric | Value |
525
+ |:--------------------|:----------|
526
+ | cosine_accuracy@1 | 0.7129 |
527
+ | cosine_accuracy@3 | 0.8329 |
528
+ | cosine_accuracy@5 | 0.8671 |
529
+ | cosine_accuracy@10 | 0.9143 |
530
+ | cosine_precision@1 | 0.7129 |
531
+ | cosine_precision@3 | 0.2776 |
532
+ | cosine_precision@5 | 0.1734 |
533
+ | cosine_precision@10 | 0.0914 |
534
+ | cosine_recall@1 | 0.7129 |
535
+ | cosine_recall@3 | 0.8329 |
536
+ | cosine_recall@5 | 0.8671 |
537
+ | cosine_recall@10 | 0.9143 |
538
+ | cosine_ndcg@10 | 0.8146 |
539
+ | cosine_mrr@10 | 0.7826 |
540
+ | **cosine_map@100** | **0.786** |
541
+
542
+ #### Information Retrieval
543
+ * Dataset: `dim_64`
544
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
545
+
546
+ | Metric | Value |
547
+ |:--------------------|:-----------|
548
+ | cosine_accuracy@1 | 0.6643 |
549
+ | cosine_accuracy@3 | 0.8043 |
550
+ | cosine_accuracy@5 | 0.8457 |
551
+ | cosine_accuracy@10 | 0.9029 |
552
+ | cosine_precision@1 | 0.6643 |
553
+ | cosine_precision@3 | 0.2681 |
554
+ | cosine_precision@5 | 0.1691 |
555
+ | cosine_precision@10 | 0.0903 |
556
+ | cosine_recall@1 | 0.6643 |
557
+ | cosine_recall@3 | 0.8043 |
558
+ | cosine_recall@5 | 0.8457 |
559
+ | cosine_recall@10 | 0.9029 |
560
+ | cosine_ndcg@10 | 0.7821 |
561
+ | cosine_mrr@10 | 0.7437 |
562
+ | **cosine_map@100** | **0.7468** |
563
+
564
+ <!--
565
+ ## Bias, Risks and Limitations
566
+
567
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
568
+ -->
569
+
570
+ <!--
571
+ ### Recommendations
572
+
573
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
574
+ -->
575
+
576
+ ## Training Details
577
+
578
+ ### Training Dataset
579
+
580
+ #### json
581
+
582
+ * Dataset: json
583
+ * Size: 6,300 training samples
584
+ * Columns: <code>anchor</code> and <code>positive</code>
585
+ * Approximate statistics based on the first 1000 samples:
586
+ | | anchor | positive |
587
+ |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
588
+ | type | string | string |
589
+ | details | <ul><li>min: 2 tokens</li><li>mean: 20.47 tokens</li><li>max: 40 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 45.09 tokens</li><li>max: 272 tokens</li></ul> |
590
+ * Samples:
591
+ | anchor | positive |
592
+ |:-------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
593
+ | <code>What was the stored value of cards and loyalty program balances at the end of fiscal year 2022?</code> | <code>Stored value cards and loyalty program at October 2, 2022 showed a balance of approximately $1.503 billion.</code> |
594
+ | <code>What transformation is planned for Le Jardin located at The Londoner Macao?</code> | <code>Le Jardin, located on the southern flank of The Londoner Macao, is to undergo a transformation into a distinctive garden-themed attraction spanning approximately 50,000 square meters.</code> |
595
+ | <code>What are the key terms of the new Labor Agreement ratified by the UAW in 2023?</code> | <code>The key terms and provisions of the Labor Agreement are: General wage increases of 11% upon ratification in 2023, 3% in September each of 2024, 2025 and 2026, and 5% in September 2027; Consolidation of applicable wage classifications for in-progression, temporary and other employees – with employees reaching the top classification rate upon the completion of 156 weeks of active service; The re-establishment of a cost-of-living allowance; Lump sum ratification bonus payments of $5,000 paid to eligible employees in the three months ended December 31, 2023; For members currently employed and enrolled in the Employees’ Pension Plan, an increase of $5.00 to the monthly basic benefit for past and future service provided; A 3.6% increase in company contributions to eligible employees' defined contribution retirement accounts; and Annual contribution of $500 to eligible retirees or surviving spouses.</code> |
596
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
597
+ ```json
598
+ {
599
+ "loss": "MultipleNegativesRankingLoss",
600
+ "matryoshka_dims": [
601
+ 768,
602
+ 512,
603
+ 256,
604
+ 128,
605
+ 64
606
+ ],
607
+ "matryoshka_weights": [
608
+ 1,
609
+ 1,
610
+ 1,
611
+ 1,
612
+ 1
613
+ ],
614
+ "n_dims_per_step": -1
615
+ }
616
+ ```
617
+
618
+ ### Training Hyperparameters
619
+ #### Non-Default Hyperparameters
620
+
621
+ - `eval_strategy`: epoch
622
+ - `per_device_train_batch_size`: 32
623
+ - `per_device_eval_batch_size`: 16
624
+ - `gradient_accumulation_steps`: 16
625
+ - `learning_rate`: 2e-05
626
+ - `num_train_epochs`: 4
627
+ - `lr_scheduler_type`: cosine
628
+ - `warmup_ratio`: 0.1
629
+ - `bf16`: True
630
+ - `tf32`: True
631
+ - `load_best_model_at_end`: True
632
+ - `optim`: adamw_torch_fused
633
+ - `batch_sampler`: no_duplicates
634
+
635
+ #### All Hyperparameters
636
+ <details><summary>Click to expand</summary>
637
+
638
+ - `overwrite_output_dir`: False
639
+ - `do_predict`: False
640
+ - `eval_strategy`: epoch
641
+ - `prediction_loss_only`: True
642
+ - `per_device_train_batch_size`: 32
643
+ - `per_device_eval_batch_size`: 16
644
+ - `per_gpu_train_batch_size`: None
645
+ - `per_gpu_eval_batch_size`: None
646
+ - `gradient_accumulation_steps`: 16
647
+ - `eval_accumulation_steps`: None
648
+ - `learning_rate`: 2e-05
649
+ - `weight_decay`: 0.0
650
+ - `adam_beta1`: 0.9
651
+ - `adam_beta2`: 0.999
652
+ - `adam_epsilon`: 1e-08
653
+ - `max_grad_norm`: 1.0
654
+ - `num_train_epochs`: 4
655
+ - `max_steps`: -1
656
+ - `lr_scheduler_type`: cosine
657
+ - `lr_scheduler_kwargs`: {}
658
+ - `warmup_ratio`: 0.1
659
+ - `warmup_steps`: 0
660
+ - `log_level`: passive
661
+ - `log_level_replica`: warning
662
+ - `log_on_each_node`: True
663
+ - `logging_nan_inf_filter`: True
664
+ - `save_safetensors`: True
665
+ - `save_on_each_node`: False
666
+ - `save_only_model`: False
667
+ - `restore_callback_states_from_checkpoint`: False
668
+ - `no_cuda`: False
669
+ - `use_cpu`: False
670
+ - `use_mps_device`: False
671
+ - `seed`: 42
672
+ - `data_seed`: None
673
+ - `jit_mode_eval`: False
674
+ - `use_ipex`: False
675
+ - `bf16`: True
676
+ - `fp16`: False
677
+ - `fp16_opt_level`: O1
678
+ - `half_precision_backend`: auto
679
+ - `bf16_full_eval`: False
680
+ - `fp16_full_eval`: False
681
+ - `tf32`: True
682
+ - `local_rank`: 0
683
+ - `ddp_backend`: None
684
+ - `tpu_num_cores`: None
685
+ - `tpu_metrics_debug`: False
686
+ - `debug`: []
687
+ - `dataloader_drop_last`: False
688
+ - `dataloader_num_workers`: 0
689
+ - `dataloader_prefetch_factor`: None
690
+ - `past_index`: -1
691
+ - `disable_tqdm`: False
692
+ - `remove_unused_columns`: True
693
+ - `label_names`: None
694
+ - `load_best_model_at_end`: True
695
+ - `ignore_data_skip`: False
696
+ - `fsdp`: []
697
+ - `fsdp_min_num_params`: 0
698
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
699
+ - `fsdp_transformer_layer_cls_to_wrap`: None
700
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
701
+ - `deepspeed`: None
702
+ - `label_smoothing_factor`: 0.0
703
+ - `optim`: adamw_torch_fused
704
+ - `optim_args`: None
705
+ - `adafactor`: False
706
+ - `group_by_length`: False
707
+ - `length_column_name`: length
708
+ - `ddp_find_unused_parameters`: None
709
+ - `ddp_bucket_cap_mb`: None
710
+ - `ddp_broadcast_buffers`: False
711
+ - `dataloader_pin_memory`: True
712
+ - `dataloader_persistent_workers`: False
713
+ - `skip_memory_metrics`: True
714
+ - `use_legacy_prediction_loop`: False
715
+ - `push_to_hub`: False
716
+ - `resume_from_checkpoint`: None
717
+ - `hub_model_id`: None
718
+ - `hub_strategy`: every_save
719
+ - `hub_private_repo`: False
720
+ - `hub_always_push`: False
721
+ - `gradient_checkpointing`: False
722
+ - `gradient_checkpointing_kwargs`: None
723
+ - `include_inputs_for_metrics`: False
724
+ - `eval_do_concat_batches`: True
725
+ - `fp16_backend`: auto
726
+ - `push_to_hub_model_id`: None
727
+ - `push_to_hub_organization`: None
728
+ - `mp_parameters`:
729
+ - `auto_find_batch_size`: False
730
+ - `full_determinism`: False
731
+ - `torchdynamo`: None
732
+ - `ray_scope`: last
733
+ - `ddp_timeout`: 1800
734
+ - `torch_compile`: False
735
+ - `torch_compile_backend`: None
736
+ - `torch_compile_mode`: None
737
+ - `dispatch_batches`: None
738
+ - `split_batches`: None
739
+ - `include_tokens_per_second`: False
740
+ - `include_num_input_tokens_seen`: False
741
+ - `neftune_noise_alpha`: None
742
+ - `optim_target_modules`: None
743
+ - `batch_eval_metrics`: False
744
+ - `batch_sampler`: no_duplicates
745
+ - `multi_dataset_batch_sampler`: proportional
746
+
747
+ </details>
748
+
749
+ ### Training Logs
750
+ | Epoch | Step | Training Loss | dim_768_cosine_map@100 | dim_512_cosine_map@100 | dim_256_cosine_map@100 | dim_128_cosine_map@100 | dim_64_cosine_map@100 |
751
+ |:----------:|:------:|:-------------:|:----------------------:|:----------------------:|:----------------------:|:----------------------:|:---------------------:|
752
+ | 0.8122 | 10 | 0.7331 | - | - | - | - | - |
753
+ | 0.9746 | 12 | - | 0.7871 | 0.7796 | 0.7747 | 0.7546 | 0.7214 |
754
+ | 1.6244 | 20 | 0.2506 | - | - | - | - | - |
755
+ | 1.9492 | 24 | - | 0.8021 | 0.7990 | 0.7869 | 0.7691 | 0.7371 |
756
+ | 2.4365 | 30 | 0.1029 | - | - | - | - | - |
757
+ | 2.9239 | 36 | - | 0.8030 | 0.8017 | 0.7926 | 0.7760 | 0.7402 |
758
+ | 3.2487 | 40 | 0.054 | - | - | - | - | - |
759
+ | **3.8985** | **48** | **-** | **0.8055** | **0.799** | **0.7924** | **0.7754** | **0.7383** |
760
+ | 0.8122 | 10 | 0.0397 | - | - | - | - | - |
761
+ | 0.9746 | 12 | - | 0.8109 | 0.7983 | 0.7974 | 0.7795 | 0.7373 |
762
+ | 1.6244 | 20 | 0.0301 | - | - | - | - | - |
763
+ | 1.9492 | 24 | - | 0.8115 | 0.8049 | 0.8026 | 0.7839 | 0.7486 |
764
+ | 2.4365 | 30 | 0.0236 | - | - | - | - | - |
765
+ | 2.9239 | 36 | - | 0.8138 | 0.8082 | 0.8045 | 0.7858 | 0.7470 |
766
+ | 3.2487 | 40 | 0.0131 | - | - | - | - | - |
767
+ | **3.8985** | **48** | **-** | **0.8137** | **0.8075** | **0.8041** | **0.786** | **0.7468** |
768
+
769
+ * The bold row denotes the saved checkpoint.
770
+
771
+ ### Framework Versions
772
+ - Python: 3.8.10
773
+ - Sentence Transformers: 3.2.1
774
+ - Transformers: 4.41.2
775
+ - PyTorch: 2.1.2+cu121
776
+ - Accelerate: 1.0.1
777
+ - Datasets: 2.19.1
778
+ - Tokenizers: 0.19.1
779
+
780
+ ## Citation
781
+
782
+ ### BibTeX
783
+
784
+ #### Sentence Transformers
785
+ ```bibtex
786
+ @inproceedings{reimers-2019-sentence-bert,
787
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
788
+ author = "Reimers, Nils and Gurevych, Iryna",
789
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
790
+ month = "11",
791
+ year = "2019",
792
+ publisher = "Association for Computational Linguistics",
793
+ url = "https://arxiv.org/abs/1908.10084",
794
+ }
795
+ ```
796
+
797
+ #### MatryoshkaLoss
798
+ ```bibtex
799
+ @misc{kusupati2024matryoshka,
800
+ title={Matryoshka Representation Learning},
801
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
802
+ year={2024},
803
+ eprint={2205.13147},
804
+ archivePrefix={arXiv},
805
+ primaryClass={cs.LG}
806
+ }
807
+ ```
808
+
809
+ #### MultipleNegativesRankingLoss
810
+ ```bibtex
811
+ @misc{henderson2017efficient,
812
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
813
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
814
+ year={2017},
815
+ eprint={1705.00652},
816
+ archivePrefix={arXiv},
817
+ primaryClass={cs.CL}
818
+ }
819
+ ```
820
+
821
+ <!--
822
+ ## Glossary
823
+
824
+ *Clearly define terms in order to be accessible across audiences.*
825
+ -->
826
+
827
+ <!--
828
+ ## Model Card Authors
829
+
830
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
831
+ -->
832
+
833
+ <!--
834
+ ## Model Card Contact
835
+
836
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
837
+ -->
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "nomic-ai/nomic-embed-text-v1",
3
+ "activation_function": "swiglu",
4
+ "architectures": [
5
+ "NomicBertModel"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "nomic-ai/nomic-bert-2048--configuration_hf_nomic_bert.NomicBertConfig",
10
+ "AutoModel": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertModel",
11
+ "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
12
+ },
13
+ "bos_token_id": null,
14
+ "causal": false,
15
+ "dense_seq_output": true,
16
+ "embd_pdrop": 0.0,
17
+ "eos_token_id": null,
18
+ "fused_bias_fc": true,
19
+ "fused_dropout_add_ln": true,
20
+ "initializer_range": 0.02,
21
+ "layer_norm_epsilon": 1e-12,
22
+ "max_trained_positions": 2048,
23
+ "mlp_fc1_bias": false,
24
+ "mlp_fc2_bias": false,
25
+ "model_type": "nomic_bert",
26
+ "n_embd": 768,
27
+ "n_head": 12,
28
+ "n_inner": 3072,
29
+ "n_layer": 12,
30
+ "n_positions": 8192,
31
+ "pad_vocab_size_multiple": 64,
32
+ "parallel_block": false,
33
+ "parallel_block_tied_norm": false,
34
+ "prenorm": false,
35
+ "qkv_proj_bias": false,
36
+ "reorder_and_upcast_attn": false,
37
+ "resid_pdrop": 0.0,
38
+ "rotary_emb_base": 1000,
39
+ "rotary_emb_fraction": 1.0,
40
+ "rotary_emb_interleaved": false,
41
+ "rotary_emb_scale_base": null,
42
+ "rotary_scaling_factor": 2,
43
+ "scale_attn_by_inverse_layer_idx": false,
44
+ "scale_attn_weights": true,
45
+ "summary_activation": null,
46
+ "summary_first_dropout": 0.1,
47
+ "summary_proj_to_labels": true,
48
+ "summary_type": "cls_index",
49
+ "summary_use_proj": true,
50
+ "torch_dtype": "float32",
51
+ "transformers_version": "4.41.2",
52
+ "type_vocab_size": 2,
53
+ "use_cache": true,
54
+ "use_flash_attn": true,
55
+ "use_rms_norm": false,
56
+ "use_xentropy": true,
57
+ "vocab_size": 30528
58
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.2.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.1.2+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7f3db1d6afcc3addba9f02fd2162058c2f822326a0e7975f125b7e2d7e4582d
3
+ size 546938168
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 8192,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "BertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff