metadata
language:
- en
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:2699
- loss:CachedMultipleNegativesRankingLoss
base_model: google/embeddinggemma-300m
widget:
- source_sentence: >-
For a conventional conforming loan, what are the common down payment
amounts?
sentences:
- fannie_mae_selling_guide_chunk_001
- fannie_mae_selling_guide
- >-
Standard down payment options for a conventional conforming loan range
from 3 to 20 of the purchase price.
- source_sentence: How is the cash required for mortgage reserves confirmed by a lender?
sentences:
- freddie_mac_guide
- >-
Lenders verify assets by requiring two months of consecutive statements
for all checking, savings, and investment accounts.
- freddie_mac_guide_chunk_017
- source_sentence: >-
What are the different types of mortgage rate locks available to
borrowers?
sentences:
- freddie_mac_guide
- >-
Common rate locks include a 30-day lock, a 45-day lock, and a 60-day
lock, with longer locks sometimes incurring a small fee.
- freddie_mac_guide_chunk_014
- source_sentence: How do lenders verify a borrowers assets for reserves?
sentences:
- va_chapter2_eligibility_chunk_001
- >-
Lenders verify assets by obtaining the two most recent monthly
statements for all checking, savings, and investment accounts, ensuring
the funds have been sourced and seasoned.
- va_chapter2_eligibility
- source_sentence: When is a borrower eligible for a streamline refinance?
sentences:
- >-
A borrower is eligible for a streamline refinance if they have made at
least six consecutive on-time payments on their current mortgage.
- fha_handbook_4000_1
- fha_handbook_4000_1_chunk_007
datasets:
- sugiv/mortgage-qa-dataset
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on google/embeddinggemma-300m
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: mortgage eval
type: mortgage-eval
metrics:
- type: cosine_accuracy@1
value: 0.34421364985163205
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.6468842729970327
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.7863501483679525
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.9317507418397626
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.34421364985163205
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.21562809099901084
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.1572700296735905
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.09317507418397623
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.34421364985163205
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.6468842729970327
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.7863501483679525
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.9317507418397626
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.620931422633939
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.5233208515849467
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.5285370057494337
name: Cosine Map@100
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: mortgage test
type: mortgage-test
metrics:
- type: cosine_accuracy@1
value: 0.2781065088757396
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.5621301775147929
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.7100591715976331
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.8727810650887574
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.2781065088757396
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.18737672583826429
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.14201183431952663
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.08727810650887573
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.2781065088757396
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.5621301775147929
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.7100591715976331
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.8727810650887574
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.553056381202397
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.4530313703390625
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.46216201514766103
name: Cosine Map@100
SentenceTransformer based on google/embeddinggemma-300m
This is a sentence-transformers model finetuned from google/embeddinggemma-300m on the mortgage-qa-dataset dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: google/embeddinggemma-300m
- Maximum Sequence Length: 2048 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- Language: en
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(4): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sugiv/embeddinggemma-300m-mortgage")
# Run inference
queries = [
"When is a borrower eligible for a streamline refinance?",
]
documents = [
'A borrower is eligible for a streamline refinance if they have made at least six consecutive on-time payments on their current mortgage.',
'fha_handbook_4000_1',
'fha_handbook_4000_1_chunk_007',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.8276, -0.0791, -0.0792]])
Evaluation
Metrics
Information Retrieval
- Datasets:
mortgage-eval
andmortgage-test
- Evaluated with
InformationRetrievalEvaluator
Metric | mortgage-eval | mortgage-test |
---|---|---|
cosine_accuracy@1 | 0.3442 | 0.2781 |
cosine_accuracy@3 | 0.6469 | 0.5621 |
cosine_accuracy@5 | 0.7864 | 0.7101 |
cosine_accuracy@10 | 0.9318 | 0.8728 |
cosine_precision@1 | 0.3442 | 0.2781 |
cosine_precision@3 | 0.2156 | 0.1874 |
cosine_precision@5 | 0.1573 | 0.142 |
cosine_precision@10 | 0.0932 | 0.0873 |
cosine_recall@1 | 0.3442 | 0.2781 |
cosine_recall@3 | 0.6469 | 0.5621 |
cosine_recall@5 | 0.7864 | 0.7101 |
cosine_recall@10 | 0.9318 | 0.8728 |
cosine_ndcg@10 | 0.6209 | 0.5531 |
cosine_mrr@10 | 0.5233 | 0.453 |
cosine_map@100 | 0.5285 | 0.4622 |
Training Details
Training Dataset
mortgage-qa-dataset
- Dataset: mortgage-qa-dataset at de29792
- Size: 2,699 training samples
- Columns:
question
,answer
,source_document
, andsource_chunk
- Approximate statistics based on the first 1000 samples:
question answer source_document source_chunk type string string string string details - min: 9 tokens
- mean: 16.15 tokens
- max: 27 tokens
- min: 14 tokens
- mean: 31.67 tokens
- max: 62 tokens
- min: 8 tokens
- mean: 9.99 tokens
- max: 14 tokens
- min: 14 tokens
- mean: 15.99 tokens
- max: 20 tokens
- Samples:
question answer source_document source_chunk When is a full appraisal required for a mortgage application?
A full appraisal is required for most transactions, but it can be waived for certain streamlined refinance programs if the Loan-to-Value LTV ratio is 90 or less.
fha_handbook_4000_1
fha_handbook_4000_1_chunk_005
When getting a mortgage, who orders the title insurance for the lender?
While often coordinated by the settlement agent, the lender typically requires and is the ultimate recipient of the lenders title insurance policy to protect their financial interest.
va_chapter4_underwriting
va_chapter4_underwriting_chunk_012
What components of a loan application does an underwriter assess?
Underwriters analyze the four Cs of credit: Capacity income and DTI, Capital assets and reserves, Collateral property value, and Credit credit history and score.
va_chapter5_processing
va_chapter5_processing_chunk_005
- Loss:
CachedMultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "mini_batch_size": 32, "gather_across_devices": false }
Evaluation Dataset
mortgage-qa-dataset
- Dataset: mortgage-qa-dataset at de29792
- Size: 337 evaluation samples
- Columns:
question
,answer
,source_document
, andsource_chunk
- Approximate statistics based on the first 337 samples:
question answer source_document source_chunk type string string string string details - min: 9 tokens
- mean: 16.44 tokens
- max: 28 tokens
- min: 17 tokens
- mean: 32.28 tokens
- max: 62 tokens
- min: 8 tokens
- mean: 10.14 tokens
- max: 14 tokens
- min: 14 tokens
- mean: 16.14 tokens
- max: 20 tokens
- Samples:
question answer source_document source_chunk What financial metrics are crucial for an AUS to issue an approval?
Key AUS factors include credit score, loan-to-value ratio, debt-to-income ratio, and the overall strength and stability of the borrowers financial profile.
va_chapter4_underwriting
va_chapter4_underwriting_chunk_017
Can you explain how an LTV ratio is figured out?
The LTV ratio is calculated by dividing the mortgage loan amount by the appraised value or purchase price of the property, whichever is lower.
fannie_mae_servicing_guide
fannie_mae_servicing_guide_chunk_002
How do lenders verify a borrowers employment history?
Lenders verify employment by contacting employers directly and typically require a two-year history, which may be confirmed via recent pay stubs and W-2 forms.
freddie_mac_guide
freddie_mac_guide_chunk_002
- Loss:
CachedMultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "mini_batch_size": 32, "gather_across_devices": false }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 16per_device_eval_batch_size
: 16learning_rate
: 3e-06num_train_epochs
: 4warmup_steps
: 100fp16
: Trueload_best_model_at_end
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 3e-06weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 4max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 100log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Trueignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config
: Nonedeepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torch_fusedoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsehub_revision
: Nonegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseliger_kernel_config
: Noneeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportionalrouter_mapping
: {}learning_rate_mapping
: {}
Training Logs
Epoch | Step | Training Loss | Validation Loss | mortgage-eval_cosine_ndcg@10 | mortgage-test_cosine_ndcg@10 |
---|---|---|---|---|---|
-1 | -1 | - | - | 0.5803 | - |
0.1479 | 25 | 0.1574 | - | - | - |
0.2959 | 50 | 0.1053 | 0.0722 | 0.5993 | - |
0.4438 | 75 | 0.0969 | - | - | - |
0.5917 | 100 | 0.0765 | 0.0773 | 0.6085 | - |
0.7396 | 125 | 0.079 | - | - | - |
0.8876 | 150 | 0.0802 | 0.0858 | 0.6056 | - |
1.0355 | 175 | 0.021 | - | - | - |
1.1834 | 200 | 0.0728 | 0.0549 | 0.6093 | - |
1.3314 | 225 | 0.0857 | - | - | - |
1.4793 | 250 | 0.071 | 0.0659 | 0.6145 | - |
1.6272 | 275 | 0.0633 | - | - | - |
1.7751 | 300 | 0.1844 | 0.0687 | 0.6209 | - |
1.9231 | 325 | 0.0545 | - | - | - |
2.0710 | 350 | 0.0474 | 0.0646 | 0.6025 | - |
2.2189 | 375 | 0.0702 | - | - | - |
2.3669 | 400 | 0.0831 | 0.0699 | 0.6026 | - |
2.5148 | 425 | 0.0635 | - | - | - |
2.6627 | 450 | 0.103 | 0.0674 | 0.6031 | - |
2.8107 | 475 | 0.097 | - | - | - |
2.9586 | 500 | 0.077 | 0.0686 | 0.6032 | - |
3.1065 | 525 | 0.0713 | - | - | - |
3.2544 | 550 | 0.1617 | 0.0668 | 0.6087 | - |
3.4024 | 575 | 0.1084 | - | - | - |
3.5503 | 600 | 0.0791 | 0.0658 | 0.6038 | - |
3.6982 | 625 | 0.0477 | - | - | - |
3.8462 | 650 | 0.0956 | 0.0659 | 0.6073 | - |
3.9941 | 675 | 0.0587 | - | - | - |
-1 | -1 | - | - | 0.6209 | 0.5531 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 5.1.0
- Transformers: 4.57.0.dev0
- PyTorch: 2.8.0.dev20250319+cu128
- Accelerate: 1.10.1
- Datasets: 4.0.0
- Tokenizers: 0.22.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
CachedMultipleNegativesRankingLoss
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}