SentenceTransformer based on google/embeddinggemma-300m

This is a sentence-transformers model finetuned from google/embeddinggemma-300m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: google/embeddinggemma-300m
- Maximum Sequence Length: 2048 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(4): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("yasserrmd/geo-gemma-300m-emb")
# Run inference
queries = [
"Based on the Brine Shrimp Lethality Test (BSLT), what are the toxicity levels of liquid smoke from cocoa pod skin at various pyrolysis temperatures and water contents?",
]
documents = [
'The Brine Shrimp Lethality Test (BSLT) was used to determine the toxicity levels of liquid smoke from cocoa pod skin at various pyrolysis temperatures and water contents. The results showed that the LC50 values (the concentration required to kill 50% of the test organisms) were as follows: at 200°C and 10% water content, 11,858.58 ppm; at 200°C and 15% water content, 13,094.23 ppm; at 200°C and 20% water content, 13,373.94 ppm; at 200°C and 25% water content, 15,703.52 ppm. At 300°C and 10% water content, 11,604.26 ppm; at 300°C and 15% water content, 11,673.05 ppm; at 300°C and 20% water content, 13,373.94 ppm; at 300°C and 25% water content, 13,373.94 ppm. At 400°C and 10% water content, 9,213.73 ppm; at 400°C and 15% water content, 13,094.237 ppm; at 400°C and 20% water content, 13,373.94 ppm; at 400°C and 25% water content, 12,493.63 ppm. All the results indicate that the liquid smoke from cocoa pod skin at different pyrolysis temperatures and water contents is classified as non-toxic.',
'The estimated annual flood damage for agriculture and built-up areas in the Tajan watershed, northern Iran, is projected to surge from USD 162 million to USD 376 million and USD 91 million to USD 220 million, respectively, by 2040, considering the land use change scenarios from 2021 to 2040.',
'The distribution of PM2.5 in Santa Ana, CA, tends to be higher in socioeconomically disadvantaged communities compared to other areas, highlighting environmental health inequities that persist in urban areas. This can inform policy decisions related to health equity and community access to resources.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.5805, 0.0253, 0.0709]])
Training Details
Training Dataset
Unnamed Dataset
- Size: 41,432 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 12 tokens
- mean: 27.1 tokens
- max: 71 tokens
- min: 17 tokens
- mean: 119.32 tokens
- max: 413 tokens
- Samples:
sentence_0 sentence_1 How does plastic debris from land-based sources impact the ocean, particularly in the context of First Long Beach, China?
Plastic debris from land-based sources can significantly impact the ocean, as seen in the study conducted at First Long Beach (FLB), China. The study found that plastic debris amounts ranged from 2 to 82 particles per square meter on this marine sand beach. The most common size of plastics was 0.5–2.5 cm (44.4%), and the most common color was white (60.9%). The most abundant shape of plastic debris was fragments (76.2%). The amount of plastic debris varied significantly between different transects along the land-based source input zone due to the impacts of wind, ocean currents, and waves. Land-based wastewater discharge was identified as a major source of plastic debris on FLB, influenced by coastal water tide variations. Reduction strategies should focus on tracing and managing these land-based sources to mitigate the impact of plastic debris on the ocean.
How does the concentration of SO2 in urban areas of Nanjing correlate with the normalized difference vegetation index (NDVI), and what does this imply for public health?
The concentration of SO2 in urban areas of Nanjing exhibits a strong correlation (coefficient of determination, R2 > 0.5) with the normalized difference vegetation index (NDVI) within a radial distance of 2 km from the air pollutant monitoring sites. This indicates that NDVI can be an effective indicator for assessing the distribution and concentrations of air pollutants such as SO2. Negative correlations between NDVI and socio-economic indicators are observed under relatively consistent natural conditions, including climate and terrain. Therefore, the spatiotemporal distribution patterns of NDVI can provide valuable insights not only into socio-economic growth but also into the levels and locations of air pollution concentrations, which is crucial for public health interventions and policies.
How has the rise of user-generated geodata impacted the role of traditional map producers?
The rise of user-generated geodata has transformed ordinary citizens into neogeographers, blurring the boundaries between traditional map producers, such as national mapping agencies and local authorities, and citizens as consumers of this information. Citizens now actively participate in mapping different types of features on the Earth’s surface as volunteers, either by providing observations on the ground or tracing data from other sources, such as aerial photographs or satellite imagery. This has resulted in a significant increase in the availability of rich spatial datasets, which are often openly accessible through platforms like OpenStreetMap (OSM) and Ushahidi.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false }
Training Hyperparameters
Non-Default Hyperparameters
num_train_epochs
: 1multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: noprediction_loss_only
: Trueper_device_train_batch_size
: 8per_device_eval_batch_size
: 8per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config
: Nonedeepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torch_fusedoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsehub_revision
: Nonegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseliger_kernel_config
: Noneeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robinrouter_mapping
: {}learning_rate_mapping
: {}
Training Logs
Epoch | Step | Training Loss |
---|---|---|
0.0965 | 500 | 0.012 |
0.1931 | 1000 | 0.006 |
0.2896 | 1500 | 0.0057 |
0.3862 | 2000 | 0.0045 |
0.4827 | 2500 | 0.0024 |
0.5793 | 3000 | 0.0013 |
0.6758 | 3500 | 0.0025 |
0.7723 | 4000 | 0.0029 |
0.8689 | 4500 | 0.0012 |
0.9654 | 5000 | 0.0004 |
Framework Versions
- Python: 3.12.11
- Sentence Transformers: 5.1.0
- Transformers: 4.56.1
- PyTorch: 2.8.0+cu128
- Accelerate: 1.10.1
- Datasets: 4.0.0
- Tokenizers: 0.22.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 20
Model tree for yasserrmd/geo-gemma-300m-emb
Base model
google/embeddinggemma-300m