Update README.md

af8bad0 verified 8 months ago

6.41 kB

	---
	license: mit
	datasets:
	- dleemiller/wiki-sim
	- sentence-transformers/stsb
	language:
	- en
	metrics:
	- spearmanr
	- pearsonr
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: text-classification
	library_name: sentence-transformers
	tags:
	- cross-encoder
	- modernbert
	- sts
	- stsb
	- stsbenchmark-sts
	model-index:
	- name: CrossEncoder based on answerdotai/ModernBERT-base
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: sts test
	type: sts-test
	metrics:
	- type: pearson_cosine
	value: 0.9162245947821821
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.9121555789491528
	name: Spearman Cosine
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: sts dev
	type: sts-dev
	metrics:
	- type: pearson_cosine
	value: 0.9260833551026787
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.9236030687487745
	name: Spearman Cosine
	---
	# ModernBERT Cross-Encoder: Semantic Similarity (STS)

	Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
	I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
	They're simple to use, fast and very accurate.

	Like many people, I was excited about the architecture and training uplift from the ModernBERT architecture (`answerdotai/ModernBERT-base`).
	So I've applied it to the stsb cross encoder, which is a very handy model. Additionally, I've added
	pretraining from a much larger semi-synthetic dataset `dleemiller/wiki-sim` that targets this kind of objective.
	The inference performance efficiency, expanded context and simplicity make this a really nice platform as an evaluator model.

	---

	## Features
	- High performing: Achieves Pearson: 0.9162 and Spearman: 0.9122 on the STS-Benchmark test set.
	- Efficient architecture: Based on the ModernBERT-base design (149M parameters), offering faster inference speeds.
	- Extended context length: Processes sequences up to 8192 tokens, great for LLM output evals.
	- Diversified training: Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.

	---

	## Performance

	\| Model \| STS-B Test Pearson \| STS-B Test Spearman \| Context Length \| Parameters \| Speed \|
	\|--------------------------------\|--------------------\|---------------------\|----------------\|------------\|---------\|
	\| `dleemiller/ModernCE-large-sts` \| 0.9256 \| 0.9215 \| 8192 \| 395M \| Medium \|
	\| `dleemiller/CrossGemma-sts-300m` \| 0.9175 \| 0.9135 \| 2048 \| 303M \| Medium \|
	\| `dleemiller/ModernCE-base-sts` \| 0.9162 \| 0.9122 \| 8192 \| 149M \| Fast \|
	\| `cross-encoder/stsb-roberta-large` \| 0.9147 \| - \| 512 \| 355M \| Slow \|
	\| `dleemiller/EttinX-sts-m` \| 0.9143 \| 0.9102 \| 8192 \| 149M \| Fast \|
	\| `dleemiller/NeoCE-sts` \| 0.9124 \| 0.9087 \| 4096 \| 250M \| Fast \|
	\| `dleemiller/EttinX-sts-s` \| 0.9004 \| 0.8926 \| 8192 \| 68M \| Very Fast \|
	\| `cross-encoder/stsb-distilroberta-base` \| 0.8792 \| - \| 512 \| 82M \| Fast \|
	\| `dleemiller/EttinX-sts-xs` \| 0.8763 \| 0.8689 \| 8192 \| 32M \| Very Fast \|
	\| `dleemiller/EttinX-sts-xxs` \| 0.8414 \| 0.8311 \| 8192 \| 17M \| Very Fast \|
	\| `dleemiller/sts-bert-hash-nano` \| 0.7904 \| 0.7743 \| 8192 \| 0.97M \| Very Fast \|
	\| `dleemiller/sts-bert-hash-pico` \| 0.7595 \| 0.7474 \| 8192 \| 0.45M \| Very Fast \|


	---

	## Usage

	To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:

	```python
	from sentence_transformers import CrossEncoder

	# Load ModernCE model
	model = CrossEncoder("dleemiller/ModernCE-base-sts")

	# Predict similarity scores for sentence pairs
	sentence_pairs = [
	("It's a wonderful day outside.", "It's so sunny today!"),
	("It's a wonderful day outside.", "He drove to work earlier."),
	]
	scores = model.predict(sentence_pairs)

	print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32)
	```

	### Output
	The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.

	---

	## Training Details

	### Pretraining
	The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
	- Classifier Dropout: a somewhat large classifier dropout of 0.3, to reduce overreliance on teacher scores.
	- Objective: STS-B scores from `cross-encoder/stsb-roberta-large`.

	### Fine-Tuning
	Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.

	### Validation Results
	The model achieved the following test set performance after fine-tuning:
	- Pearson Correlation: 0.9162
	- Spearman Correlation: 0.9122

	---

	## Model Card

	- Architecture: ModernBERT-base
	- Tokenizer: Custom tokenizer trained with modern techniques for long-context handling.
	- Pretraining Data: `dleemiller/wiki-sim (pair-score-sampled)`
	- Fine-Tuning Data: `sentence-transformers/stsb`

	---

	## Thank You

	Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.

	---

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{moderncestsb2025,
	author = {Miller, D. Lee},
	title = {ModernCE STS: An STS cross encoder model},
	year = {2025},
	publisher = {Hugging Face Hub},
	url = {https://huggingface.co/dleemiller/ModernCE-base-sts},
	}
	```

	---

	## License

	This model is licensed under the [MIT License](LICENSE).