Update README.md
Browse files
README.md
CHANGED
@@ -90,7 +90,7 @@ SPLADE models are a fine balance between retrieval effectiveness (quality) and r
|
|
90 |
|
91 |
<details>
|
92 |
|
93 |
-
While ONLY a empirical analysis on large sample make sense here is a spot checking - a qualitatively example to give you an idea. Our models achieve par competitive effectiveness with **~
|
94 |
(We will show Quantitative results in the next section.)
|
95 |
|
96 |
So, **by design "how to beat SoTA MRR?" was never our goal**, Instead "At what cost can we achieve an acceptable effectiveness i.e. MRR@10". Non-chalantly reducing lambda values (位Q,位D, see above table) will achieve a better MRR.
|
@@ -100,19 +100,19 @@ But Lower lambda values = Higher FLOPS = More tokens = Poorer efficiency. This i
|
|
100 |
|
101 |
**Ours**
|
102 |
```python
|
103 |
-
number of actual dimensions:
|
104 |
SPLADE BOW rep:
|
105 |
-
[('stress', 2.
|
106 |
```
|
107 |
|
108 |
-
**naver/splade-cocondenser-ensembledistil** (SoTA, ~
|
109 |
```python
|
110 |
number of actual dimensions: 126
|
111 |
SPLADE BOW rep:
|
112 |
[('stress', 2.25), ('glass', 2.23), ('thermal', 2.18), ('glasses', 1.65), ('pan', 1.62), ('heat', 1.56), ('stressed', 1.42), ('crack', 1.31), ('break', 1.12), ('cracked', 1.1), ('hot', 0.93), ('created', 0.9), ('factors', 0.81), ('broken', 0.73), ('caused', 0.71), ('too', 0.71), ('damage', 0.69), ('if', 0.68), ('hotter', 0.65), ('governed', 0.61), ('heating', 0.59), ('temperature', 0.59), ('adjacent', 0.59), ('cause', 0.58), ('effect', 0.57), ('fracture', 0.56), ('bradford', 0.55), ('strain', 0.53), ('hammer', 0.51), ('brian', 0.48), ('error', 0.47), ('windows', 0.45), ('will', 0.45), ('reaction', 0.42), ('create', 0.42), ('windshield', 0.41), ('heated', 0.41), ('factor', 0.4), ('cracking', 0.39), ('failure', 0.38), ('mechanical', 0.38), ('when', 0.38), ('formed', 0.38), ('bolt', 0.38), ('mechanism', 0.37), ('warm', 0.37), ('areas', 0.36), ('area', 0.36), ('energy', 0.34), ('disorder', 0.33), ('barry', 0.33), ('shock', 0.32), ('determined', 0.32), ('gage', 0.32), ('sash', 0.31), ('theory', 0.31), ('level', 0.31), ('resistant', 0.31), ('brake', 0.3), ('window', 0.3), ('crash', 0.3), ('hazard', 0.29), ('##ink', 0.27), ('ceramic', 0.27), ('storm', 0.25), ('problem', 0.25), ('issue', 0.24), ('impact', 0.24), ('fridge', 0.24), ('injury', 0.23), ('ross', 0.22), ('causes', 0.22), ('affect', 0.21), ('pressure', 0.21), ('fatigue', 0.21), ('leak', 0.21), ('eye', 0.2), ('frank', 0.2), ('cool', 0.2), ('might', 0.19), ('gravity', 0.18), ('ray', 0.18), ('static', 0.18), ('collapse', 0.18), ('physics', 0.18), ('wave', 0.18), ('reflection', 0.17), ('parker', 0.17), ('strike', 0.17), ('hottest', 0.17), ('burst', 0.16), ('chance', 0.16), ('burn', 0.14), ('rubbing', 0.14), ('interference', 0.14), ('bailey', 0.13), ('vibration', 0.12), ('gilbert', 0.12), ('produced', 0.12), ('rock', 0.12), ('warmer', 0.11), ('get', 0.11), ('drink', 0.11), ('fireplace', 0.11), ('ruin', 0.1), ('brittle', 0.1), ('fragment', 0.1), ('stumble', 0.09), ('formation', 0.09), ('shatter', 0.08), ('great', 0.08), ('friction', 0.08), ('flash', 0.07), ('cracks', 0.07), ('levels', 0.07), ('smash', 0.04), ('fail', 0.04), ('fra', 0.04), ('##glass', 0.03), ('variables', 0.03), ('because', 0.02), ('knock', 0.02), ('sun', 0.02), ('crush', 0.01), ('##e', 0.01), ('anger', 0.01)]
|
113 |
```
|
114 |
|
115 |
-
**naver/splade-v2-distil** (~
|
116 |
```python
|
117 |
number of actual dimensions: 234
|
118 |
SPLADE BOW rep:
|
@@ -135,7 +135,7 @@ But it is unclear how well these hyperparameters are transferable to other domai
|
|
135 |
<details>
|
136 |
|
137 |
**Note: Why Anserini not PISA?** *Anserini is a production ready lucene based library. Common industry search deployments use Solr or elastic which are lucene based, hence the performance can be comparable. PISA latency is irrelevant for industry as it is a a research only system.*
|
138 |
-
The full [anserini evaluation log](
|
139 |
|
140 |
- **BEIR ZST OOD performance**: Will be added to the end of page.
|
141 |
|
|
|
90 |
|
91 |
<details>
|
92 |
|
93 |
+
While ONLY a empirical analysis on large sample make sense here is a spot checking - a qualitatively example to give you an idea. Our models achieve par competitive effectiveness with **~4% and ~48%, lesser tokens comparable SPLADE++ models including SoTA**.
|
94 |
(We will show Quantitative results in the next section.)
|
95 |
|
96 |
So, **by design "how to beat SoTA MRR?" was never our goal**, Instead "At what cost can we achieve an acceptable effectiveness i.e. MRR@10". Non-chalantly reducing lambda values (位Q,位D, see above table) will achieve a better MRR.
|
|
|
100 |
|
101 |
**Ours**
|
102 |
```python
|
103 |
+
number of actual dimensions: 121
|
104 |
SPLADE BOW rep:
|
105 |
+
[('stress', 2.42), ('thermal', 2.31), ('glass', 2.27), ('pan', 1.78), ('heat', 1.66), ('glasses', 1.58), ('crack', 1.42), ('anxiety', 1.36), ('break', 1.31), ('window', 0.91), ('heating', 0.84), ('hot', 0.82), ('adjacent', 0.82), ('hotter', 0.82), ('if', 0.75), ('cause', 0.7), ('caused', 0.7), ('create', 0.7), ('factors', 0.69), ('created', 0.68), ('cracks', 0.67), ('breaks', 0.67), ('area', 0.66), ('##glass', 0.66), ('cracked', 0.63), ('areas', 0.6), ('cracking', 0.59), ('windows', 0.58), ('effect', 0.56), ('causes', 0.56), ('ruin', 0.54), ('severe', 0.54), ('too', 0.53), ('flame', 0.5), ('collapse', 0.49), ('stresses', 0.49), ('or', 0.48), ('physics', 0.47), ('temperature', 0.46), ('get', 0.46), ('heated', 0.45), ('problem', 0.45), ('energy', 0.44), ('hottest', 0.42), ('phenomenon', 0.42), ('sweating', 0.41), ('insulation', 0.39), ('level', 0.39), ('warm', 0.39), ('governed', 0.38), ('formation', 0.37), ('failure', 0.35), ('frank', 0.34), ('cooling', 0.32), ('fracture', 0.31), ('because', 0.31), ('crystal', 0.31), ('determined', 0.31), ('boiler', 0.31), ('mechanical', 0.3), ('shatter', 0.29), ('friction', 0.29), ('levels', 0.29), ('cold', 0.29), ('will', 0.29), ('ceramics', 0.29), ('factor', 0.28), ('crash', 0.28), ('reaction', 0.28), ('fatigue', 0.28), ('hazard', 0.27), ('##e', 0.26), ('anger', 0.26), ('bubble', 0.25), ('process', 0.24), ('cleaning', 0.23), ('surrounding', 0.22), ('theory', 0.22), ('sash', 0.22), ('distraction', 0.21), ('adjoining', 0.19), ('environmental', 0.19), ('ross', 0.18), ('formed', 0.17), ('broken', 0.16), ('affect', 0.16), ('##pan', 0.15), ('graphic', 0.14), ('damage', 0.14), ('bubbles', 0.13), ('windshield', 0.13), ('temporal', 0.13), ('roof', 0.12), ('strain', 0.12), ('clear', 0.09), ('ceramic', 0.08), ('stressed', 0.08), ('##uation', 0.08), ('cool', 0.08), ('expand', 0.07), ('storm', 0.07), ('shock', 0.07), ('psychological', 0.06), ('breaking', 0.06), ('##es', 0.06), ('melting', 0.05), ('burst', 0.05), ('sensing', 0.04), ('heats', 0.04), ('error', 0.03), ('weather', 0.03), ('drink', 0.03), ('fire', 0.03), ('vibration', 0.02), ('induced', 0.02), ('warmer', 0.02), ('leak', 0.02), ('fog', 0.02), ('safety', 0.01), ('surface', 0.01), ('##thermal', 0.0)]
|
106 |
```
|
107 |
|
108 |
+
**naver/splade-cocondenser-ensembledistil** (SoTA, ~4% more tokens + FLOPS = 1.85)
|
109 |
```python
|
110 |
number of actual dimensions: 126
|
111 |
SPLADE BOW rep:
|
112 |
[('stress', 2.25), ('glass', 2.23), ('thermal', 2.18), ('glasses', 1.65), ('pan', 1.62), ('heat', 1.56), ('stressed', 1.42), ('crack', 1.31), ('break', 1.12), ('cracked', 1.1), ('hot', 0.93), ('created', 0.9), ('factors', 0.81), ('broken', 0.73), ('caused', 0.71), ('too', 0.71), ('damage', 0.69), ('if', 0.68), ('hotter', 0.65), ('governed', 0.61), ('heating', 0.59), ('temperature', 0.59), ('adjacent', 0.59), ('cause', 0.58), ('effect', 0.57), ('fracture', 0.56), ('bradford', 0.55), ('strain', 0.53), ('hammer', 0.51), ('brian', 0.48), ('error', 0.47), ('windows', 0.45), ('will', 0.45), ('reaction', 0.42), ('create', 0.42), ('windshield', 0.41), ('heated', 0.41), ('factor', 0.4), ('cracking', 0.39), ('failure', 0.38), ('mechanical', 0.38), ('when', 0.38), ('formed', 0.38), ('bolt', 0.38), ('mechanism', 0.37), ('warm', 0.37), ('areas', 0.36), ('area', 0.36), ('energy', 0.34), ('disorder', 0.33), ('barry', 0.33), ('shock', 0.32), ('determined', 0.32), ('gage', 0.32), ('sash', 0.31), ('theory', 0.31), ('level', 0.31), ('resistant', 0.31), ('brake', 0.3), ('window', 0.3), ('crash', 0.3), ('hazard', 0.29), ('##ink', 0.27), ('ceramic', 0.27), ('storm', 0.25), ('problem', 0.25), ('issue', 0.24), ('impact', 0.24), ('fridge', 0.24), ('injury', 0.23), ('ross', 0.22), ('causes', 0.22), ('affect', 0.21), ('pressure', 0.21), ('fatigue', 0.21), ('leak', 0.21), ('eye', 0.2), ('frank', 0.2), ('cool', 0.2), ('might', 0.19), ('gravity', 0.18), ('ray', 0.18), ('static', 0.18), ('collapse', 0.18), ('physics', 0.18), ('wave', 0.18), ('reflection', 0.17), ('parker', 0.17), ('strike', 0.17), ('hottest', 0.17), ('burst', 0.16), ('chance', 0.16), ('burn', 0.14), ('rubbing', 0.14), ('interference', 0.14), ('bailey', 0.13), ('vibration', 0.12), ('gilbert', 0.12), ('produced', 0.12), ('rock', 0.12), ('warmer', 0.11), ('get', 0.11), ('drink', 0.11), ('fireplace', 0.11), ('ruin', 0.1), ('brittle', 0.1), ('fragment', 0.1), ('stumble', 0.09), ('formation', 0.09), ('shatter', 0.08), ('great', 0.08), ('friction', 0.08), ('flash', 0.07), ('cracks', 0.07), ('levels', 0.07), ('smash', 0.04), ('fail', 0.04), ('fra', 0.04), ('##glass', 0.03), ('variables', 0.03), ('because', 0.02), ('knock', 0.02), ('sun', 0.02), ('crush', 0.01), ('##e', 0.01), ('anger', 0.01)]
|
113 |
```
|
114 |
|
115 |
+
**naver/splade-v2-distil** (~48% more tokens + FLOPS = 3.82)
|
116 |
```python
|
117 |
number of actual dimensions: 234
|
118 |
SPLADE BOW rep:
|
|
|
135 |
<details>
|
136 |
|
137 |
**Note: Why Anserini not PISA?** *Anserini is a production ready lucene based library. Common industry search deployments use Solr or elastic which are lucene based, hence the performance can be comparable. PISA latency is irrelevant for industry as it is a a research only system.*
|
138 |
+
The full [anserini evaluation log will be updated soon]() with encoding, indexing and querying details are here.
|
139 |
|
140 |
- **BEIR ZST OOD performance**: Will be added to the end of page.
|
141 |
|