Your team has been pointing tickets for three years. Velocity is stable. The burndown looks clean. And yet, every time a stakeholder asks "when will this ship?", you do the same thing: you guess, then pad the guess, then hope.

If story points were working, that wouldn't happen. This post is for engineering managers and scrum masters who are tired of defending a number that doesn't predict anything. We'll cover why story points drift from reality, and a simpler 3-bucket system (S/M/L) that maps to the one unit your stakeholders actually understand: calendar time.

Key takeaways

  • Story points were designed to be abstract on purpose, which is exactly why they stop correlating with delivery dates.
  • Velocity becomes a moving target the moment estimation gets used for performance, so it inflates instead of stabilizing.
  • A 3-bucket system (Small, Medium, Large) is faster to assign and easier to calibrate than a Fibonacci ladder.
  • Buckets only work if you anchor them to real elapsed time and check them against what actually happened.
  • The goal is not perfect estimates. It is honest, fast, and self-correcting ones.

Why story points lie

Story points were invented to decouple estimation from time. The pitch was reasonable: humans are bad at guessing hours, but decent at comparing relative size. So we estimate complexity instead, sum it into velocity, and let the average smooth out the noise.

The theory holds for about a quarter. Then three things happen.

1. Points quietly become a productivity metric

The moment a director starts a sentence with "the team only delivered 32 points this sprint," the abstraction is dead. Engineers notice. Estimates inflate, because a bigger number is a safer number. Your velocity goes up while your actual throughput stays flat. The chart lies and everyone in the standup knows it.

2. Complexity and duration aren't the same thing

A 1-point task can sit in code review for four days. An 8-point task can ship in an afternoon because one engineer already knew the subsystem cold. Points measure perceived difficulty. Stakeholders care about elapsed calendar time. Those two things correlate loosely at best, and the gap is where your roadmap credibility goes to die.

3. The Fibonacci ladder invents false precision

Asking whether something is a 3 or a 5 feels rigorous. It is not. It is two people arguing about a difference that will be swamped by the first surprise dependency. You spent fifteen minutes of planning poker to produce a number with no predictive power.

Estimation isn't broken because your team is bad at it. It's broken because you're measuring the wrong axis with too many gradations.

The 3-bucket system: S, M, L

Collapse the ladder. You get three buckets, and each one is defined by elapsed wall-clock time, not abstract effort:

  • Small: ships within a day or two. One person, no unknowns, no cross-team dependency. You could describe the whole change in two sentences.
  • Medium: takes most of a week. Some unknowns, maybe one dependency, but the shape of the solution is clear before you start.
  • Large: more than a week, or has genuine unknowns you can't resolve until you're in the code. This is a flag, not an estimate.

That last point matters. Large is a signal, not a size. When something lands in Large, the correct response is not to schedule it. It's to break it down or spike it first. A backlog full of Large cards is a backlog you cannot plan, and the bucket tells you that instantly.

Why three buckets beats eleven

Speed: assigning S/M/L takes seconds, so estimation stops eating planning. Honesty: nobody fights over whether a thing is a 5 or an 8 when the only question is "days or a week?". And resolution that matches reality: at the planning horizon, three buckets is roughly all the signal that actually exists. Pretending you have more is the original sin of story points.

Anchor the buckets to the calendar

The buckets are worthless if every engineer privately defines Medium differently. Write the definitions down, in calendar time, and put them where people estimate. Something like:

  1. Small = under 2 working days, start to merged.
  2. Medium = 2 to 5 working days.
  3. Large = over a week, or contains an unknown that blocks sizing. Must be split or spiked before it enters a sprint.

Notice this is elapsed time, not focused hours. "It's only 4 hours of work" is the lie that wrecks every estimate, because those 4 hours are spread across two days of context-switching, a code review wait, and a flaky CI run. Bucket by the date it actually merges, not the time the editor was open.

Calibrate with what actually happened

Here's the step most teams skip, and it's the only step that makes any sizing system improve over time: compare your buckets to reality, regularly.

After a card ships, you want to know how long it actually took from start to merge, and whether that matches the bucket you assigned. Do this for a month and patterns jump out. Maybe your "Smalls" are reliably taking three days because code review is a bottleneck, which is a process problem no estimate can fix. Maybe everything tagged Large by one engineer ships in two days, which means they're sandbagging or the bucket definitions need tightening.

This is where tooling earns its keep. In Zoobbe you can add a single-select custom field to your board with three options, S, M, and L, so every card carries its bucket explicitly instead of living in someone's head. As work happens, the per-card time tracking timers record real session duration, and the session history aggregates the total elapsed time per card. Then board analytics surfaces average completion time across the board, so you can line up "what we said" against "what happened" without exporting anything to a spreadsheet. The point isn't the feature list. It's that calibration requires actual elapsed-time data, and guessing from memory at retro doesn't count.

Rolling it out without a mutiny

Don't announce a methodology change. Announce an experiment. Tell the team you're trying S/M/L for two sprints alongside whatever you do now, and that nobody is graded on it. The goal is a more honest planning conversation, not a new number to defend.

Keep the Large bucket sacred. The fastest way to kill the system is to let Larges sit unrefined in a sprint because "we'll figure it out as we go." You won't. Split it, spike it, or push it. A Large that survives planning untouched is a missed deadline that hasn't happened yet.

After two sprints, look at your calibration data and ask one question: are our buckets getting closer to reality? If yes, drop story points entirely. If no, your problem was never estimation. It was code review latency, unclear requirements, or too much work in progress, and no sizing system on earth fixes those.

FAQ

Isn't S/M/L just story points with fewer numbers?

Mechanically similar, philosophically opposite. Story points are deliberately abstract and complexity-based. These buckets are deliberately concrete and time-based. The whole point is to reconnect the estimate to the calendar your stakeholders care about.

How do I report this to leadership who expect velocity?

Report throughput in buckets shipped per sprint ("we closed 6 Smalls, 3 Mediums, 1 Large") alongside your calibration accuracy. It's more honest than a velocity number and harder to game, because the buckets are anchored to dates anyone can verify.

What about large projects that span months?

They never get sized as one card. A multi-month project is a stack of cards, and any card that lands in Large gets broken down until it's a Small or Medium. If you can't break it down, you don't understand it well enough to schedule it yet, which is itself useful to know.

Won't engineers just game three buckets too?

Any system gets gamed when it's used to grade people, which is why you decouple sizing from performance. Three buckets are harder to inflate than a Fibonacci ladder because the definitions are concrete days, and your calibration data makes sandbagging obvious within a sprint or two.

How often should we recalibrate the bucket definitions?

Review the actuals every retro, but only change the written definitions if a clear pattern holds for a few sprints. Stable definitions are the whole value. Don't tune them so often that nobody trusts what Medium means.

The honest version of estimation

Accurate estimation isn't about a better formula. It's about a faster guess, anchored to the calendar, checked against what actually happened, and corrected out loud. Three buckets get you there with less ceremony than story points and more signal. Try it for two sprints, watch your calibration data, and let the numbers tell you whether your real problem was ever estimation at all.

If you want the buckets, the timers, and the completion-time data living on the same board, give Zoobbe a try and run your next two sprints on it.

Photo by Xavi Cabrera on Unsplash