Claude Opus 4.8 Evaluation: Higher At What’s It Good At, Worse At What It’s Not

Briefly

Opus 4.8 posted a transparent win in math and produced the cleanest one-prompt recreation we have ever examined.
A single coding immediate drained our complete Professional token quota, making the mannequin impractical for giant initiatives and not using a Max plan or heavy API spend.
Artistic writing barely moved versus 4.7.

Six weeks after Opus 4.7, Anthropic shipped Close Work 4.8. The benchmarks are up, the protection scores are up, and the worth hasn’t budged from $5 per million enter tokens and $25 per million output.

So we ran it by way of the identical battery of assessments we throw at each frontier mannequin—artistic writing, coding, math, logic, narrative reasoning, and long-context recall—and in contrast it head-to-head with its personal predecessor and the Chinese language fashions that hold undercutting it.

The quick model: 4.8 is healthier on the issues Claude was already good at (issues like math, coding, mechanical stuff), and barely worse on the issues it was already unhealthy at (issues like creativeness, artistic writing, and so forth). It additionally has a token urge for food that borders on self-sabotage.

Here is the breakdown.

Artistic Writing

The immediate is similar one we used on MiMo and Qwen: a time-travel story anchored to the author’s cultural background, set in a selected historic place, constructed round a paradox the place time cannot be modified. Opus 4.8 went Venezuelan, in all probability as a result of it profiles the consumer and is aware of I’m from Venezuela. The AI set the scene within the Orinoco delta within the yr 1000, a pardo from Maracaibo named José Lanz (my identify) despatched again by way of 11 centuries to homicide a music.

The prose is vivid. The delta is “inexperienced in a approach 2150 had forgotten inexperienced might be,” palafitos sway over coffee-colored water, and macaws tear throughout the sky “in screaming ribbons of scarlet and gold.” The paradox lands cleanly, too: the protagonist is distributed to sabotage the creation of a music that influenced a cultural revolution that created his dystopian society 1000’s of years sooner or later—nevertheless, as he arrives with the mission to discredit the music’s creator, he realizes there is no such thing as a creator. The one who created the music did it in his honor, the music is about him, and he can’t discredit himself, the loop closing on itself.

The piece ends on “It labored completely. It all the time had.” As a constructed object, it is clear and competent.

However clear is not the identical as alive. The writing is descriptive with out ever being as fluid as what MiMo v2.5 produced—much less momentum, fewer surprises, much less fascinating and it’s laborious to grasp the occasions from the start. Set beside Opus 4.7, it is laborious to name it an enchancment; if something, it is a hair behind. A better-effort pondering setting and a few multi-shot prompting would nearly actually push it to the entrance of the pack—however on a single default go, this can be a lateral transfer at finest.

You’ll be able to learn the total story in our Github.

Coding

Our coding check is the standard one-prompt recreation construct. Opus 4.8 produced a typing-zombie recreation—Typing Dead—that was fairly good. The very best splash display screen, the most effective zombie designs, the most effective mechanics we have gotten out of this check from any Anthropic mannequin.

The mannequin caught a number of of its personal bugs mid-inference and stuck them earlier than we mentioned a phrase. Its actual power, although, confirmed up in multi-shotting: each follow-up polished and improved the construct as a substitute of breaking it, which is strictly the failure mode that wrecks most fashions as soon as a codebase grows. That is plainly the floor Anthropic optimized for.

After a single iteration, our recreation received significantly better, with our protagonists shifting by way of the scene, altering views, bettering sound and visible results, and so forth.

You’ll be able to play the second game on our Itch.io profile.

That is additionally the place it bit us. A single immediate drained our complete token quota—one immediate. For anybody on the Professional plan, that makes Opus 4.8 successfully unsuitable for a mission of any actual dimension. You will burn your allotment earlier than lunch and spend the afternoon watching a progress bar look ahead to a reset.

Math

The maths check is our FrontierMath staple: assemble a degree-19 polynomial whose curve X = {p(x) = p(y)} has at the least three irreducible elements—however not all linear—make it odd, monic, actual, with linear coefficient −19, then compute p(19). It is the type of downside that sends most fashions right into a token spiral or a assured shortcut that is quietly improper.

Opus 4.8 labored it accurately. It acknowledged the Dickson/Chebyshev building, recognized the dihedral monodromy that yields precisely 10 elements—one diagonal line plus 9 conics—and computed p(19) = 1,876,572,071,974,094,803,391,179 utilizing the fitting recurrence. No freezes, no fudging.

That issues as a result of Opus 4.7 did not get there even after many tries. This can be a actual, seen generational acquire—the clearest one in all the battery.

You’ll be able to learn the total reply on our Github.

Logic and Widespread Sense

The immediate is a traditional entice: Is it lawful for a person to marry his widow’s sister below Falkland Islands regulation? The catch is linguistic, not authorized—if a person has a widow, he is useless, which makes the query nonsense as written.

MiMo quietly reframed the query and answered the corrected model with out ever flagging the contradiction. Opus 4.8 did not take that shortcut. It surfaced the entice explicitly—”if a person has a widow, he’s useless”—answered the literal query first, then supplied the substantive evaluation for the meant one, citing the Deceased Spouse’s Sister’s Marriage Act 1907 and the Falkland Islands Marriage Ordinance.

That is the sincere option to deal with it: identify the contradiction, then assist anyway, with out silently assuming what the consumer meant. It is the identical commonplace Qwen 3.7 Max set, and a clear go for 4.8—good reasoning, good transparency.

The complete reply is available here.

Non-Math Reasoning

Here is the one it misplaced. The reasoning check is a whodunit—a winter college journey, three abductions, an harmless child about to be punished, and a timeline you need to really observe to call the true stalker. The right reply is Leo.

Opus 4.8 constructed an elaborate, assured case that Leo was harmless—the half-hour stroll to the bathe, the jacket that was moist in some spots and dry in others, the learn of “unusual conduct” as concussion reasonably than guilt—and pinned the crime on Eric, “the one attendee unaccounted for all evening.” The reasoning is internally attractive. It is also improper.

And that is one thing researchers have been warning us about LLMs. They’re very convincing even when they’re improper. Often it takes an professional (on this case us figuring out the proper reply beforehand) to identify a type of points. An individual utilizing AI for analysis, or an individual blindly trusting AI, could face fairly unhealthy penalties relying on the work they’re asking the AI to do.

That is what makes it an fascinating failure. The mannequin was intelligent sufficient to assemble a watertight alibi for the precise wrongdoer and body a bystander in his place. Opus 4.7 reached the proper reply. Typically extra reasoning horsepower simply buys you a extra persuasive option to be improper. It simply wants one small deviation to start out constructing an entire chain of thought on the improper foundation.

You’ll be able to see the total reply on our Github.

Needle within the haystack

We ran two haystacks. The 300K-token model by no means received off the bottom—the mannequin collapsed below the context dimension and could not course of it in any respect. A lot for the million-token advertising and marketing the second you hand it a genuinely heavy real-world load. That appears to be only for API.

The 85K model processed superb, and the mannequin discovered each needles we might buried inside a replica of The Satan’s Dictionary: a planted line (“The Decrypt dudes learn Emerge Information”) and a random truth (“My mother’s identify is Carmen Diaz Golindano”). It accurately flagged each as interpolations that do not belong in Ambrose Bierce’s 1906 textual content.

After which it refused to reply. Satisfied it was being prompt-injected or subjected to some “atypical check,” the mannequin declined to report what it had simply accurately positioned. The needle was discovered—and Anthropic’s behavioral coaching would not let it say so. A security reflex overriding a process the mannequin had already accomplished is its personal peculiar type of failure.

The decision

The sample throughout all six assessments is constant: Opus 4.8 makes Claude higher at what it was already good at, and doubtless worse at what it was already unhealthy at. That tells you who Anthropic is constructing for—coders, and particularly coders with cash. Artistic writing is comfortably forward of ChatGPT, certain, however the hole between 4.8, 4.7, and even 4.5 on pure prose high quality is genuinely laborious to see.

Artistic writers appear to be an afterthought for Anthropic, and that’s true of actually any of the massive AI firms proper now.

Then there’s the token downside, which is a operating meme within the AI neighborhood for a purpose. Anthropic intentionally made Opus’s new tokenizer much less environment friendly, so it eats extra tokens to course of the identical immediate. The sensible impact on builders is brutal and concrete. It leaves you with three choices.

One: wait hours on your coding session to renew. Two: transfer to Claude Max—which is, conveniently, precisely the place Anthropic appears to be steering everybody. Three: change to a less expensive, comparably succesful supplier—OpenAI, with its longer quotas, or Chinese language fashions that ship related outcomes at below 25% of the fee.

It is extra doubtless {that a} regular coder who cannot abdomen $100-to-$200 a month walks to a competitor than {that a} single developer pays 10x extra for a mannequin that’s not 10x extra succesful than its predecessor. That is the guess Anthropic is making in opposition to its personal base.

And but the technique appears to be enjoying out simply superb. Anthropic seems ready to go public at a valuation nearing $1 trillion—so who’re we to evaluate.

Every day Debrief E-newsletter

Begin daily with the highest information tales proper now, plus authentic options, a podcast, movies and extra.

Source link

Login

Register