I Built a Fantasy Football AI Agent (It Lied 4 Times)

I gave Claude Code a public API and one vague prompt.

An hour later I had a custom agent that ran. Then I started checking its answers, and that's where it fell apart.

This post is about that space between "it runs" and "it's right." That's what decides whether your agent is a demo or a tool, and almost nobody shows you what lives inside it.

Here's what I wanted. An agent that could manage my team, not just pull stats for me to read. Know which leagues I'm in, know my roster, know how each league scores, know what the projections say. Then actually answer when I ask who to start or whether a trade is worth it.

I play in three leagues, each with its own settings.

Sleeper has a public API, so I handed the agent a community-written API reference and one line:

The prompt: "Build me a Sleeper fantasy football agent that helps me manage my leagues."

No plan. No workflow diagram. A username, a public API, and a bet that the docs were good enough to figure out the rest.

Claude Code built a working skill in about an hour.

Those two words are worth being precise about, because the rest of this post turns on them. Claude Code is the agent, the thing that does the work. The skill is what it produced: a portable bundle of instructions plus a script. The agent built the skill. And as you'll see, the skill doesn't have to stay inside Claude Code to run.

Then I ran the skill against my real leagues, and it was wrong four times. Twice it lied to me: plausible numbers, completely wrong, no way to tell by looking. Twice it just went quiet, a blank where my team should have been. Not one of the four threw an error.

That's the whole problem. The dangerous failures don't announce themselves. They hand you a believable number, or a blank, and let you draw the conclusion.

Letting the agent build it

From that open API you can pull any user's leagues, rosters, and projected points for the season. The projections are what make trade analysis possible.

The agent wired up the obvious path on the first try: take a username, find the leagues, pull the roster, look up projections.

One of those pieces leans on an endpoint Sleeper never officially documented. The community found it years ago by watching what the web app quietly requests in the background, and it's been stable ever since. If anything in this build breaks a year from now, that's my bet for what breaks, so I'm flagging it loudly.

There was one more thing the agent had to learn the hard way: player names. The roster data comes back as numeric IDs and nothing else. To turn an ID into "Patrick Mahomes," you have to download Sleeper's full player list, which is big, and keep a local copy. Skip that and every roster reads like a phone book. It's the single most common way to break a Sleeper integration, which is why the agent sorted it out before I ever saw a roster.

So the agent had a thing that ran. Then I pointed it at my actual leagues, and the lying started.

Where it went wrong, and how it got fixed

Here's what I actually like about building this way: the agent gets you to "runs" fast, then real data does the teaching. Every one of these surfaced from pointing the skill at a real league and watching it fail in a way I almost didn't catch.

Two times it lied

The scoring settings were wrong, and the numbers looked fine anyway.

I pulled my standard-scoring league and started reading off projections. They were plausible. Then I stopped and asked myself whether the skill even knew how this league scores.

It didn't. It was using full-PPR numbers for a league that doesn't award a single point for receptions.

This is the dangerous kind of bug. Nothing crashes, the numbers land in a believable range, and you only catch it if you stop and ask the specific question. Once the skill read each league's actual scoring format first, the totals moved 30 to 40 points per player. A trade you'd recommend on the wrong numbers is a trade you'd regret.

It pulled last season's projections.

The NFL season ended in February. It's June now. When the skill first asked for projections, it got last year's, because the endpoint assumes you mean whatever season is currently underway. For a season that hasn't started yet, you have to ask for it by name.

I didn't catch this during the build. I caught it the first time I tried to use the thing for real, which tells you exactly how much testing against a hypothetical is worth.

Both of these are the same trap. The output looked right. The only tell was knowing enough to be suspicious.

Two times it went quiet

Co-owners don't live where you'd think.

In one league I'm a co-owner, not the primary manager, and the skill couldn't find my team at all. Sleeper files co-owners under the team, not under the person. The skill had been looking under the person, found nothing, and showed me an empty roster. No error. Just a blank where my team should have been.

There's a trap inside that trap. When a team has no co-owners, Sleeper doesn't return an empty list. It returns nothing at all. My code handled that by finding nothing and shrugging. Code that assumes the list is even there will crash instead, on exactly the handful of teams that happen to be co-owned. This is the bug I'd bet money breaks in other people's builds.

Team defenses showed up blank.

A defense doesn't have a normal player name attached the way a quarterback does, so it kept rendering as nothing until the skill learned to build the name from the team fields instead.

A blank is easier to notice than a wrong number. But it's still no error, and on a roster full of real names, one quiet gap is easy to skim past.

Four bugs. Zero error messages. Two that lied and two that went silent, and every one stayed hidden until I pointed the skill at a real league instead of an imagined one.

What it looks like when it works

Once all four corrections were in, I floated a real trade in my PPR league. Give up Jayden Daniels, get back Bijan Robinson. One player for one player.

The skill read the league's scoring format, pulled this season's projections, and did the math. Jayden Daniels projects for about 314 points this season. Bijan Robinson projects for about 325. Net, the deal puts about 11 projected points in my pocket.

On points alone, that's a thin yes, barely worth the paperwork. Username in, a real answer out, in the scoring system that league actually uses. Hold onto that number, though, because I'm about to argue with it.

Why "it works" still isn't "it's right"

Here's the part I have to say, because it's the whole point of this post. Even the corrected skill isn't right. It's just less wrong.

Look again at that trade. The 11-point edge makes it look like a coin flip. It isn't, and the number can't tell you why in either direction.

It can't see how badly I need it. In a one-quarterback league, Daniels is my backup. A 314-point player who scores exactly zero for me as long as Lamar Jackson stays healthy. Bijan would start every week on a roster whose best running back projects for 84. The real swing isn't 11 points, it's the gap between a quarterback riding my bench and a genuine RB1 in my lineup. The projection prices players in a vacuum, not against the team I actually have.

And it can't see what I'd be giving up. Trade Daniels and I have no insurance behind Lamar. One injury and my season hangs on a waiver-wire quarterback. The number doesn't charge me for that risk, the same way it doesn't know about bye weeks, or about my superflex league down the hall, where a second quarterback is worth a fortune and this exact trade would be a mistake.

So the number is a starting point, not a verdict. The scoring fix made the projections trustworthy. It did not make the analysis trustworthy. Those are two different problems, and the second one is still on me.

The playbook you can steal

The whole build is one good prompt plus a willingness to test against real leagues. Here's the part that actually transfers, and the part that doesn't.

What transfers: my four bugs. They're baked into the prompt below, so you don't have to bleed for them.

What doesn't transfer: the test that finds yours. Every API has its own version of these, and they only show up when you point the thing at your own real data instead of a tidy example. The prompt saves you my mistakes. It can't save you from yours.

The flow is simple. Give it your username. Let it find your leagues across this season and last, since off-season leagues that haven't drafted yet won't show up otherwise. Pick a league. Get your roster with projections. From there you can float a trade and see how the projected points compare, the way I just did.

The prompt I used: one skill, built once, run anywhere

Now for the part I changed my mind about. My first prompt opened with "build me a Sleeper skill," and that one word quietly decided everything downstream. It baked in the packaging and assumed a human was sitting there to pick a league.

The reusable asset isn't tied to any of that. It's the capability plus the four corrections. So I rewrote the prompt as a single capability block that describes the behavior and says nothing about packaging. Copy this and point it at your own leagues:

Build something that manages Sleeper fantasy football teams. This core
behavior holds regardless of how it's packaged or what tool builds it:

- Take a Sleeper username and resolve it to a user ID.
- List that user's NFL leagues for the current and previous season, so
  leagues that haven't drafted yet still appear.
- For a league, fetch the rosters and find the user's team by checking the
  primary owner and any co-owners, and handle the case where there are no
  co-owners (the field is absent, not an empty list).
- Resolve numeric player IDs to names from a cached copy of Sleeper's full
  player list, refreshed once a day. Team defenses have no normal player
  name, so build it from the first and last name fields.
- For trade analysis, fetch projections for the upcoming season explicitly,
  since the endpoint defaults to the in-progress season, and read the
  league's scoring format first to pick standard, half-PPR, or full-PPR
  values.
- Treat projected points as a starting point, not a verdict: account for
  positional scarcity, bye weeks, and superflex leagues where a raw total
  understates QB value.

These points are the hard-won part, so preserve them exactly. The flow when
a person is driving it: username, then leagues, then pick a league, then
roster with projections, then trade analyzer.

Hand that to Claude Code and you get a skill: a SKILL.md plus a script. Inside Claude Code it's interactive. I give it a username, it lists my leagues, I pick one, it pulls the roster and prices trades on demand.

But a skill is just instructions and a script, so it travels. Drop the same skill into an autonomous agent runtime like Hermes or OpenClaw and it runs on a schedule with no human in the loop. The only thing that changes is the front door. Nothing can ask me anything at runtime, so the username and the leagues to watch come from config instead of a prompt, and on each run it works every league and reports what it found: lineup problems, starters on bye, lopsided trades on the table.

Same skill. One agent built it, another runs it unattended. In Claude Code the skill waits for me to ask. In Hermes or OpenClaw it runs on its own and tells me what it found.

What I'd tell you before your own build

Three things, that's it:

Test against real data on the first iteration, not the third. Every bug here came from a real league. Every one would have stayed hidden behind a plausible output against a made-up team.
Let the agent build the ugly version first. I asked for the thing that runs, then fixed what real data exposed. The four things it got wrong taught me more than a clean first draft would have.
Don't mistake "less wrong" for "right." The skill stopped lying about the projections. It can still mislead me about a trade, because a good trade is more than a points total. Knowing the difference is the actual skill.

Test against real data on the first iteration. That's the whole game.

Copy the prompt above, point it at your own leagues, and build on it. Let me know what cool things you come up with.

- Ben

I built a fantasy football agent in an hour. It lied to me four times.