leaderboard-test (Leader Board Test Org)

posted an update 1 day ago

Post

940

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

freddyaboulton

posted an update 1 day ago

Post

1413

Privacy matters when talking to AI! 🔇

We've just added a microphone mute button to FastRTC in our latest update (v0.0.14). Now you control exactly what your LLM hears.

Plus lots more features in this release! Check them out:
https://github.com/freddyaboulton/fastrtc/releases/tag/0.0.14

freddyaboulton

posted an update 16 days ago

Post

3150

Getting WebRTC and Websockets right in python is very tricky. If you've tried to wrap an LLM in a real-time audio layer then you know what I'm talking about.

That's where FastRTC comes in! It makes WebRTC and Websocket streams super easy with minimal code and overhead.

Check out our org: hf.co/fastrtc

clefourrier

authored a paper about 1 month ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 203

freddyaboulton

posted an update 3 months ago

Post

1770

Just created a Gradio space for playing with the new OAI realtime voice API!

freddyaboulton/openai-realtime-voice

freddyaboulton

posted an update 3 months ago

Post

966

Gemini can talk 🗣️

Check out the new multimodal API from Google on @akhaliq 's anychat or my space. It's very fast and smart 🍓

https://huggingface.co/spaces/freddyaboulton/gemini-voicehttps://huggingface.co/spaces/akhaliq/anychat

1 reply

·

freddyaboulton

posted an update 3 months ago

Post

2294

Version 0.0.21 of gradio-pdf now properly loads chinese characters!

freddyaboulton

posted an update 3 months ago

Post

1619

Hello Llama 3.2! 🗣️🦙

Build a Siri-like coding assistant that responds to "Hello Llama" in 100 lines of python! All with Gradio, webRTC 😎

freddyaboulton/hey-llama-code-editor

freddyaboulton

posted an update 3 months ago

Post

1160

Just created a cookbook of real time audio/video spaces created using Gradio and WebRTC ⚡️

Use this and the [docs](https://freddyaboulton.github.io/gradio-webrtc/) to get started building the next gen of AI apps!

freddyaboulton/gradio-webrtc-cookbook-6758ba7745aeca7b1be7de0f

2 replies

·

clefourrier

authored a paper 3 months ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 18

clefourrier

authored 2 papers 9 months ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 9

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 192

freddyaboulton

posted an update 9 months ago

Post

1524

@dwancin Can you please reset your toggle component's space? It's stuck for some reason. Happy to help

dwancin/gradio_toggle

clefourrier

posted an update 11 months ago

Post

6118

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

clefourrier

posted an update 11 months ago

Post

4763

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

clefourrier

posted an update 11 months ago

Post

2236

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

freddyaboulton

posted an update 11 months ago

Post

3684

We just released gradio version 4.26.0 ! We *highly* recommend you upgrade your apps to this version to bring in these nice changes:

🎥 Introducing the API recorder. Any gradio app running 4.26.0 and above will have an "API Recorder" that will record your interactions with the app and auto-generate the corresponding python or js code needed to recreate those actions programmatically. It's very neat!

📝 Enhanced markdown rendering in gr.Chatbot

🐢 Fix for slow load times on spaces as well as the UI locking up on rapid generations

See the full changelog of goodies here: https://www.gradio.app/changelog#4-26-0

1 reply

·

clefourrier

posted an update 11 months ago

Post

2237

Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

freddyaboulton

posted an update 11 months ago

Post

2450

Gradio 4.25.0 is out with some nice improvements and bug fixes!

🧹 Automatic deletion of gr.State variables stored in the server. Never run out of RAM again. Also adds an unload event you can run when a user closes their browser tab.

😴 Lazy example caching. You can set cache_examples="lazy" to cache examples when they're first requested as opposed to before the server launches. This can cut down the server's start-up time drastically.

🔊 Fixes a bug with streaming audio outputs

🤖 Improvements to gr.ChatInterface like pasting images directly from the clipboard.

See the rest of the changelog here: https://www.gradio.app/changelog#4-25-0

clefourrier

posted an update 12 months ago

Post

2367

Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.

4 replies

·

Leader Board Test Org

AI & ML interests

Recent Activity

leaderboard-test's activity

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

GAIA: a benchmark for General AI Assistants

AI & ML interests

Recent Activity

Team members 2

leaderboard-test's activity