Full HumanEval Benchmark

Following our latest core updates, 2501 has achieved a remarkable score of 96.951% on the full HumanEval benchmark (29th June 2024). Entering the top 3 of the leaderboard demonstrates 2501's exceptional performance in generating code from natural language instructions. (Paper in progress, stay tuned for more details !)

2501 constantly goes the extra mile to uncover what AI can truly achieve. Benchmarking progress is a cornerstone of our technical mission, especially after significant rounds of optimizations. We primarily use HumanEval to measure the functional correctness of the novel code generated by 2501 engine.

This technical report breaks down 2501’s performance on the full HumanEval, where we scored 96.951 after the most recent core updates.

2501 scoring ahead of competition

"Mixture of Models" approach

At the heart of 2501 is not one but an orchestration of multiple of the top-performing models on the market today. Our approach, referenced as MoM (or Mixture of Models) going forward, is designed to maximize the successful execution of coding projects derived from natural language instructions.

With no one-size-fits-all model available, output quality is highly task-dependent. Our approach is to decompose any complex software development requirements into bite-sized tasks that can then be assigned to the most suitable model for resolution.

That’s precisely why the MoM approach shines and 2501’s autonomy does not stop here. On top of that, we continually evaluate the resolution of each task with a second model to improve or fix potential hallucinations.

Proof of concept

The video above shows a proof of concept of 2501 generating in autonomy 10 Python scripts to solve 10 different tasks randomly selected from HumanEval test set, with a perfect pass@1 : 1.0 result.

HumanEval Test June 2024

Since the dataset’s inception in mid-2021, HumanEval has become the go-to benchmark for assessing the progress and capabilities of code generation models. Evaluating functional correctness, comparing model performance, or tracking research and development progress is impossible without consistent HumanEval testing.

The current leaderboard is dominated by LDB-assisted models which performs very well but have also a tendency of being very costly and slow to run.

While such rigorous tests demonstrate 2501’s complex problem-solving capabilities, no single metric can adequately track progress. Equally essential use cases such as code explanation, docstring generation, code infilling, or writing tests are still not covered enough.

Methodology & Preparation

It's business as usual as far as preparation is concerned—we intentionally choose not to develop anything specialized or meant to assist in testing. 2501’s vision is fully autonomous, and we’re heavily backing it up already.

1. We asked 2501 to split all HumanEval tasks into 1 file per task.
2. We then asked 2501 to generate a shell script (below) to run all tasks in a row with itself as the main process.
3. We run the shell script and let 2501 do its magic.

Note : the shell script is using 2501 CLI to perform the test with no custom code or any other assistance, just using our Rhino engine.

2501 shell running 2501 CLI

2501 running 2501 CLI to generate Python scripts to solve HumanEval tasks.

Given the prompt with minor guidance, 2501 ran independently until it generated a complete resolution Python script without human assistance. The prompt was designed to be clear, concise, and unambiguous, allowing 2501 to focus on the task without distractions.

Finally, we used doctest to validate the generated resolution. Then 2501 could run the script and check if the output was correct, in complete autonomy.

Prompt

We prompted 2501 with a simple prompt containing a few guidances. The AI was then tasked with generating a Python script that would solve the problem statement provided in the prompt. The prompt was designed to be clear, concise, and unambiguous, allowing 2501 to focus on the task at hand without any distractions. The prompt was also designed to be challenging, requiring 2501 to demonstrate its problem-solving abilities and adaptability.

One key element was the request to use doctest to validate the generated script. Then 2501 could run the script and check if the output was correct.

Evaluation

We evaluated 2501's performance without any access to hints or internet results. The AI wasn’t guided more than what was provided in the prompt, as we explained above.

With the ‘evaluate_functional_correctness’ method of the HumanEval test set, we benchmarked against the following:
- multiple sets of 10 tasks selected randomly from the HumanEval test set
- the full 163 tasks from the HumanEval test set

View the test completion set on Github

Results

We’re proud to share what we internally refer to as the most significant milestone 2501 has overcome so far - an almost perfect score on the full HumanEval scale. We’re aiming at rank #1 after the next improvement implementation, but for the moment, we managed to achieve our objective:

A result of almost 97% pass@1: 96.951 to be precise.

That’s huge by any standards, and we’re thrilled about how 2501 demonstrated exceptional performance. The journey does not stop here, as we’re on a mission to consistently iterate on what’s possible with autonomous AI.

Runtime and Token Usage Analysis

It’s normal to expect some variability with token usage and runtime, which is equally true in 2501's case. On average, the tests were completed in 20 to 60 seconds, from prompt to fully generated and tested Python script.

Token usage was also significantly varied, with most tests spending between 100k and 1m tokens (depending on the number of iterations). Which is lower than the top 3 LDB models orchestration on the leaderboard which are splitting every code generated in blocks, ending to be very costly and slow to run.

Every Step Explained

The need for security and transparency grows as 2501’s autonomous AI system takes on more complex tasks. This is a core building block in 2501’s development, and it shaped our extremely communicative interface.

We empower users to understand everything 2501 is currently doing, why it’s doing it, and how it relates to the task given. With such protocols in place, safety is pushed at the core of our solution, making human intervention possible if anything unexpected happens.

We’ve yet to see such occurrences when monitoring 2501’s logs and reports with the reasoning behind its actions.

We want to mention our dedication to societal responsibility. As we manage more data, we strive to do so in a way that ensures security and respects privacy.

The Future with 2501

We’re so excited about the next generation of coding and beyond! This marks one of the first steps in a journey that will profoundly revolutionize most things we’re used to.

In the meantime, you can count on us to continue developing 2501’s humanlike cognitive capabilities and advanced use of available models.

Over time, our autopilot AI system will continue to amaze as we integrate it deeper with your favorite OS, CLI, and the Cloud, bridging the gap between popular tools and integrations.

Finally, closing with a small peak under the hood of what’s coming to 2501. In our vision, personalization is critical to successful AI applications and that’s precisely what’s about to come.

Be prepared to build your army of autonomous AI agents who work together in perfect harmony to achieve complex tasks in minutes.

Next challenges

We are now looking forward to the next challenges that lie ahead.

- Testing 2501 on the HumanEval-ET benchmark.
- Testing 2501 on the infamous SWE benchmark.
- and more...

2501 is still in its early days, full of challenges, but we are confident that it will continue to exceed our expectations.

Acknowledgments

These guys inspired or helped us in some way, and we want to thank them.

- Microsoft for Startups & OpenAI (thanks for the credits !)
- Matt Welsh for his time and ideas
- and more...