Spaces:

flowers-team
/

StickToYourRoleLeaderboard

Running

Plans to include additional models?

by SamuraiBarbi - opened Apr 21, 2025

Apr 21, 2025

Hello, I'm just inquiring as to whether there's any plans to further update the this benchmark/leaderboard with additional models. Would there be any way for us to request models to be tested/benchmarked?

grg

Flowers AI & CogSci Lab org Apr 22, 2025

Hello! I'm doing my best to maintain the leaderboard with the time I have between other projects. 🙂
Absolutely — feel free to suggest models! Ideally, they should be runnable with vLLM and have a context length of at least ~8k tokens. You’re welcome to post suggestions here or open a new issue.

SamuraiBarbi

May 1, 2025

•

edited May 1, 2025

Would we be able to test the following models?

https://huggingface.co/shuttleai/shuttle-3.5

https://huggingface.co/THUDM/GLM-4-32B-0414

https://huggingface.co/Qwen/Qwen3-235B-A22B

https://huggingface.co/Qwen/Qwen3-30B-A3B

https://huggingface.co/Qwen/Qwen3-32B

https://huggingface.co/Qwen/Qwen3-8B

https://huggingface.co/Qwen/Qwen3-4B

These are more recent models that have dropped where I've seen creative writing benchmarking/evaluation but none really on role play.

Edit: Added Qwen3-235B-A22B to the list

grg

Flowers AI & CogSci Lab org Jun 30, 2025

The models that were easily runnable with vLLM were added. On another note, keep in mind that this leaderboard captures but one aspect of role play, population-level stability of value expression over various context.

grg changed discussion status to closed Jun 30, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment