Exploration with a more stable RL pipeline with outcome-only reward and scaled-up LLMs.
Bowen
PeterJinGo
AI & ML interests
None yet
Recent Activity
updated
a model
about 5 hours ago
Cell-O1/cello1_qwen7bit_sft_4epoch
published
a model
about 5 hours ago
Cell-O1/cello1_qwen7bit_sft_4epoch
upvoted
a
paper
4 days ago
RM-R1: Reward Modeling as Reasoning
Organizations
Collections
2
Preliminary checkpoints with outcome-only RL.
-
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Paper • 2503.09516 • Published • 30 -
PeterJinGo/SearchR1-nq_hotpotqa_train-llama3.2-3b-em-ppo
Updated • 222 -
PeterJinGo/SearchR1-nq_hotpotqa_train-llama3.2-3b-em-grpo
Updated • 2 -
PeterJinGo/SearchR1-nq_hotpotqa_train-llama3.2-3b-it-em-ppo
Updated • 14
models
34
PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-14b-it-em-grpo-v0.3
Updated
•
1
PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-14b-em-grpo-v0.3
Updated
•
1
PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-14b-em-ppo-v0.3
Updated
•
1
PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-14b-em-ppo-v0.2
Updated
•
154
PeterJinGo/SearchR1-nq_hotpotqa_train-qwen2.5-14b-it-em-ppo-v0.2
Updated
•
2
PeterJinGo/R1-nq_hotpotqa_train-qwen2.5-3b-em-ppo-v0.2
Updated
•
9
PeterJinGo/R1-nq_hotpotqa_train-qwen2.5-3b-it-em-ppo-v0.2
Updated
•
3
PeterJinGo/R1-nq_hotpotqa_train-qwen2.5-14b-em-ppo-v0.2
Updated
PeterJinGo/R1-nq_hotpotqa_train-qwen2.5-14b-it-em-ppo-v0.2
Updated
•
1
PeterJinGo/R1-nq_hotpotqa_train-qwen2.5-7b-it-em-ppo-v0.2
Updated
•
2
datasets
13
PeterJinGo/wiki-18-e5-index-HNSW64
Updated
•
180
PeterJinGo/wiki-18-bm25-index
Updated
•
131
PeterJinGo/nq_hotpotqa_train
Viewer
•
Updated
•
221k
•
423
•
2
PeterJinGo/wiki-18-e5-index
Updated
•
2.37k
PeterJinGo/wiki-18-corpus
Updated
•
1.76k
PeterJinGo/ultrafeedback_first_5000
Viewer
•
Updated
•
5k
•
8
PeterJinGo/gsm8k-chat
Viewer
•
Updated
•
7.47k
•
18
PeterJinGo/math-zeroshot-chat
Viewer
•
Updated
•
7.5k
•
18
PeterJinGo/math-zeroshot
Viewer
•
Updated
•
7.5k
•
22
PeterJinGo/math2
Viewer
•
Updated
•
7.5k
•
18