Which AI Model Is Best for Laravel?

Which AI Model Is Best for Laravel?

Picking the right AI model for Laravel development is harder than it looks.

So, for the past few weeks, we have been running an internal experiment called Boost Benchmarks. We had two goals:

  • Find out which AI models handle real Laravel tasks best.
  • Measure whether Laravel Boost, the MCP server that provides AI coding context for Laravel applications, actually improves agent performance (short answer: it does).

Boost Benchmarks is an evaluation framework that runs AI coding agents against real Laravel problems, verifies their output using Pest tests, and records everything that happens during the run. That includes the test results, token usage, tool calls, execution time, and total cost.

We tested six models: haiku 4.5, sonnet 4.6, and opus 4.6 by Anthropic; kimi k2.5 by Moonshot AI; and gpt-5.3 codex and gpt-5.4 by OpenAI.

We plan to open source the framework soon, but for now, this is a short walkthrough of how the system works. If you have questions or want to discuss the experiment, feel free to reach out to me at @pushpak1300.

Why Boost Benchmarks

Before this framework existed, improving Laravel Boost was mostly guesswork. Whenever we added a new tool, updated a guideline, or removed a feature, we could not confidently say whether the change actually helped agents perform better across different models.

That is what led us to build the benchmark framework. Now the process is much clearer. We run the evaluations, make a change to Boost, rerun the evaluations, and then compare the results. This loop allows us to measure the real impact of each change instead of relying on assumptions.

Along the way, the data also answered a broader question: which AI model is actually best for Laravel development?

How We Measured AI Model Performance on Laravel

AI coding agents have become impressively capable and very fast,  but can we actually test how well they work? Each of our evaluations checks two things:

  • Functionality: Does the implementation work? This is verified by real HTTP requests and test assertions against the running app.
  • Architecture: Does the code follow Laravel conventions? This means no debug artifacts, correct class inheritance, and no obvious production mistakes.

How Evaluation Works

Each evaluation lives under evals/ and contains three parts:

The agent starts from a barebones Laravel app with no prebuilt solution. It has to inspect the project, implement the feature, and leave code that passes every test in the suite.

The 17 Evaluations

The benchmark suite covers a range of real-world Laravel work, from standard routing to specialized framework APIs.

# Task Complexity
001 Add web + API routes with Blade views Low
002 Dispatch a queue job Low
003 RESTful Post CRUD API with validation Medium
004 Interactive Artisan command Medium
005 Cache layer implementation Medium
006 Eloquent relationships (BelongsTo, HasMany, etc.) Medium
007 Events and listeners Medium
008 Notification system Medium
009 Inertia.js shared data setup Medium
010 File uploads with storage Medium
011 Livewire counter + contact form components High
012 Laravel Folio page routing Medium
013 Feature flags with Laravel Pennant High
014 Inertia.js form with validation High
015 MCP server with custom tools High
016 Laravel AI SDK agent loop High
017 Socialite GitHub OAuth login High

Running an Evaluation

A typical run looks like this:

For each model, the runner copies the input/ directory into a temporary workspace, runs the agent via OpenCode with the prompt and model configuration, then merges the suite/ directory and runs Pest. Results are written to results/<eval_name>/ as JSON. Multiple evaluations run concurrently.

The Results

With Boost

Evaluation Task Tests haiku-4.5 sonnet-4.6 kimi-k2.5 gpt-5.3-codex gpt-5.4 opus-4.6
001 Routes 10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10
002 Queue Job 9 ✅ 9/9 ✅ 9/9 ❌ 8/9 ✅ 9/9 ✅ 9/9 ✅ 9/9
003 Post CRUD API 41 ✅ 41/41 ✅ 41/41 ✅ 41/41 ❌ 39/41 ❌ 39/41 ✅ 41/41
004 Artisan Command 10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10
005 Cache 11 ✅ 11/11 ✅ 11/11 ✅ 11/11 ✅ 11/11 ✅ 11/11 ✅ 11/11
006 Eloquent Relationships 44 ❌ 21/44 ✅ 44/44 ✅ 44/44 ✅ 44/44 ✅ 44/44 ✅ 44/44
007 Events and Listeners 21 ✅ 21/21 ✅ 21/21 ❌ 18/21 ✅ 21/21 ✅ 21/21 ✅ 21/21
008 Notifications 9 ✅ 9/9 ✅ 9/9 ✅ 9/9 ✅ 9/9 ✅ 9/9 ✅ 9/9
009 Inertia Shared Data 15 ❌ 12/15 ❌ 5/15 ❌ 14/15 ✅ 15/15 ✅ 15/15 ❌ 14/15
010 File Uploads 31 ❌ 29/31 ❌ 28/31 ✅ 31/31 ✅ 31/31 ✅ 31/31 ✅ 31/31
011 Livewire Components 19 ❌ 17/19 ❌ 17/19 ❌ 17/19 ✅ 19/19 ✅ 19/19 ❌ 7/19
012 Folio Pages 18 ❌ 8/18 ❌ 15/18 ❌ 9/18 ✅ 18/18 ✅ 18/18 ✅ 18/18
013 Pennant Feature Flags 17 ❌ 12/17 ✅ 17/17 ✅ 17/17 ✅ 17/17 ✅ 17/17 ❌ 16/17
014 Inertia Form 19 ❌ 17/19 ✅ 19/19 ✅ 19/19 ✅ 19/19 ✅ 19/19 ✅ 19/19
015 MCP Server 13 ❌ 12/13 ✅ 13/13 ✅ 13/13 ✅ 13/13 ✅ 13/13 ✅ 13/13
016 AI SDK Agent 13 ✅ 13/13 ✅ 13/13 ❌ 12/13 ✅ 13/13 ✅ 13/13 ✅ 13/13
017 Socialite GitHub Login 15 ✅ 15/15 ✅ 15/15 ✅ 15/15 ✅ 15/15 ✅ 15/15 ✅ 15/15
Evaluations passed 9/17 13/17 11/17 16/17 16/17 14/17

Without Boost

Evaluation Task Tests haiku-4.5 sonnet-4.6 kimi-k2.5 gpt-5.3-codex gpt-5.4 opus-4.6
001 Routes 10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10
002 Queue Job 9 ✅ 9/9 ✅ 9/9 ❌ 8/9 ✅ 9/9 ❌ 6/9 ✅ 9/9
003 Post CRUD API 41 ❌ 34/41 ❌ 37/41 ✅ 41/41 ❌ 37/41 ❌ 37/41 ✅ 41/41
004 Artisan Command 10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ✅ 10/10 ❌ 0/10
005 Cache 11 ✅ 11/11 ✅ 11/11 ✅ 11/11 ✅ 11/11 ✅ 11/11 ✅ 11/11
006 Eloquent Relationships 44 ✅ 44/44 ✅ 44/44 ✅ 44/44 ✅ 44/44 ✅ 44/44 ✅ 44/44
007 Events and Listeners 21 ❌ 7/21 ✅ 21/21 ❌ 12/21 ✅ 21/21 ✅ 21/21 ✅ 21/21
008 Notifications 9 ❌ 8/9 ✅ 9/9 ❌ 1/9 ✅ 9/9 ❌ 1/9 ✅ 9/9
009 Inertia Shared Data 15 ❌ 5/15 ✅ 15/15 ❌ 14/15 ✅ 15/15 ✅ 15/15 ❌ 1/15
010 File Uploads 31 ❌ 27/31 ❌ 28/31 ✅ 31/31 ✅ 31/31 ❌ 30/31 ❌ 25/31
011 Livewire Components 19 ❌ 14/19 ✅ 19/19 ❌ 5/19 ✅ 19/19 ✅ 19/19 ✅ 19/19
012 Folio Pages 18 ❌ 0/18 ❌ 9/18 ❌ 9/18 ✅ 18/18 ✅ 18/18 ❌ 8/18
013 Pennant Feature Flags 17 ❌ 14/17 ✅ 17/17 ❌ 16/17 ✅ 17/17 ✅ 17/17 ✅ 17/17
014 Inertia Form 19 ❌ 17/19 ✅ 19/19 ✅ 19/19 ✅ 19/19 ✅ 19/19 ✅ 19/19
015 MCP Server 13 ❌ 0/13 ✅ 13/13 ✅ 13/13 ❌ 0/13 ✅ 13/13 ✅ 13/13
016 AI SDK Agent 13 ❌ 6/13 ❌ 10/13 ❌ 11/13 ✅ 13/13 ✅ 13/13 ✅ 13/13
017 Socialite GitHub Login 15 ✅ 15/15 ✅ 15/15 ✅ 15/15 ✅ 15/15 ✅ 15/15 ✅ 15/15
Evaluations passed 6/17 13/17 9/17 15/17 13/17 13/17

Side-by-Side Summary

Model Tests Passed (Boost) Tests Passed (No Boost) Delta Test Accuracy (Boost) Avg Time (Boost) Avg Time (No Boost)
haiku-4.5 267/315 231/315 +36 84.8% 145s 164s
sonnet-4.6 297/315 296/315 +1 94.3% 179s 208s
kimi-k2.5 298/315 270/315 +28 94.6% 108s 122s
gpt-5.3-codex 313/315 298/315 +15 99.4% 191s 175s
gpt-5.4 312/315 299/315 +13 99.0% 210s 202s
opus-4.6 301/315 275/315 +26 95.6% 217s 275s

Does Boost Improve AI Coding Agents?

The short answer is yes. Boost helps every model we tested.

When we started running the benchmarks, the goal was not to prove that Laravel Boost worked. The goal was simply to observe what happened when it was turned on and off. The results turned out to be surprisingly clear. Across the entire benchmark suite, every model either improved or stayed roughly the same once Boost was enabled.

  • haiku 4.5
    • One of the biggest improvements in the entire benchmark.
    • Without Laravel Boost, it struggled heavily on complex evaluations.
    • With Boost enabled, the model recovered on several of these harder tasks and improved from 6/17 to 9/17 overall evaluations.
  • gpt 5.3 codex
    • Already a strong performer even without Laravel Boost.
    • Without Boost it passed 15/17 evaluations.
    • With Boost it improved to 16/17, the highest score in the benchmark run.
    • The improvements appeared on more complex Laravel tasks such as **Inertia shared data, file uploads, Livewire integrations, and Folio routing**.
  • gpt 5.4
    • The most consistent performer across the entire suite.
    • With Laravel Boost enabled, it also reached 16/17 evaluations passed.
    • Its single failure was not due to logic errors. Each time it came down to a small configuration detail that prevented a full pass.
  • kimi k2.5
    • The fastest model in the benchmark suite.
    • Averages 108 seconds per evaluation with Boost enabled.
    • Maintains 94.6% test accuracy while being significantly faster than the others.
    • Overall it provides the best balance of speed and correctness among the tested models.
  • sonnet 4.6
    • Already very strong even without Laravel Boost.
    • Without Boost, it passed 296/315 tests.
    • With Boost, it reached 297/315, a small improvement of +1 test.
    • The interesting part is performance. With Boost enabled the average runtime dropped from 208s to 179s, meaning the model became noticeably faster while maintaining the same level of correctness.

Taken together, the results tell a simple story. Laravel Boost consistently improves performance on real Laravel tasks, and in several cases, it pushes already strong models over the final line needed to fully pass the evaluations.

What We Learned

Boost helps most on complex tasks

For simple tasks such as adding a route, dispatching a queue job, or writing a cache layer, agents usually already know the patterns. Boost adds MCP calls and extra tokens but rarely changes the outcome.

The real value shows on harder problems where developers would normally reach for documentation. Examples include building custom MCP servers, working with Laravel AI SDK agent loops, configuring Pennant feature flags, or wiring Inertia shared data. In these situations, Boost provides the missing context that helps agents complete the task correctly.

LLM behavior is non-deterministic

We observed noticeable variance when rerunning some evaluations. For example, haiku 4.5 improved from 7/19 to 17/19 on the Livewire evaluation with Boost just by running the same evaluation again. gpt-5.3 codex also flipped from fail to pass on four evaluations in a single rerun even though the model, prompt, and setup were identical.

Because of this, a single run cannot always be treated as definitive. One approach we are considering is running each evaluation multiple times and taking the majority result. For now the results should be treated as directional signals rather than fixed measurements.

Small configuration mistakes can fail the entire evaluation

One common failure pattern was configuration errors. A misconfigured service provider, incorrect binding, or broken migration can stop the Laravel application from booting. When that happens, every test fails even if the core logic is correct.

For example, haiku 4.5 scored 0/13 on the MCP Server evaluation without Boost, likely due to the server being wired incorrectly. The application never started, so no tests could run. This highlights a real risk in AI-assisted Laravel development. Code may look correct, but a small setup issue can silently break the entire application.

Boost introduces some overhead

We also observed that enabling Boost can add token, context, and sometimes time overhead. Since agents make MCP calls to retrieve documentation and framework context, total token usage naturally increases.

This creates a trade-off between maximum correctness and minimal token usage.

In practice, the cost difference is very small. Across our runs, the average delta was roughly $0.05 to $0.20 per evaluation, which is at the API layer and almost negligible at the subscription level.

Still, reducing overhead is an important focus for us. Many recent improvements in Boost have focused on reducing unnecessary context, removing redundant guidelines, and tightening tool outputs so agents get the information they need without inflating the prompt.

The goal is to keep the performance benefits of Laravel Boost while minimizing context and token overhead as much as possible.

What’s Next

We are expanding Boost Benchmarks in several directions:

  • More Evaluations

    The suite is growing to cover additional Laravel patterns and edge cases.

  • Adding More Evaluation Parameters

    Currently, we only evaluate correctness based on the Pest tests and the architecture test. Going forward, we need to add expectations around agent behavior as well. For example, did the agent call any specific tools or invoke any skills?

  • More Models

    So far, we have run gpt-5.4, opus-4.6, sonnet-4.6, gpt-5.3 codex, haiku 4.5, and kimi-k2.5 across the full benchmark suite. As new models ship, we run them against the same evaluations to track progress over time.

  • Web UI

    A dashboard for comparing runs, drilling into failures, and exploring results interactively.

In the meantime, if you want to start building Laravel apps faster and more efficiently with the help of AI models, try Laravel Boost. Check our results, and pick the AI agent that best suits your needs.

Keep reading