A $35M D2C supplements brand we work with had a simple default for years: ground shipments under 5 lbs go UPS, over 5 lbs go FedEx, expedited goes whoever's cheaper that week. Nobody scored carriers. Nobody tracked on-time delivery by lane. When customers complained about late deliveries to Texas, the ops lead would shrug and say "must have been a weather thing." Three months of carrier data later, we found UPS was running 11 percentage points below FedEx on the Texas lane during summer. The brand had been paying for $40 expedited reships to make customers whole, all because the routing decision was set-and-forget.
This is the silent margin leak at most mid-market brands. Carrier scoring sits on top of multi-warehouse fulfillment as the layer that almost nobody does well at $20-100M revenue, even though the data is sitting in every brand's WMS waiting to be used.
This is the operator's playbook: what to score, how to score it, where AI fits in, and the 60-day rollout that recovers 3-8 percent of shipping cost without changing carriers.
What carrier scoring actually means at mid-market
Carrier scoring is the practice of continuously tracking carrier performance across multiple dimensions, by route, over time, and using that data to override default routing decisions when one carrier is materially underperforming on a specific lane.
The textbook definition stops at on-time delivery rate. The version that actually saves money at $20-100M includes:
- On-time delivery by carrier × lane × week (not annual averages)
- Damage and loss rate by carrier × SKU category (some carriers are rough on fragile freight)
- Cost per package by carrier × dimensional weight × lane (rates shift weekly)
- Customer experience by carrier × region (NPS or post-delivery survey tagged by carrier)
- Exception rate (address validation failures, reroutes, missed pickups)
- Surcharge exposure (residential, extended area, fuel, peak season)
The brands that score across all six dimensions catch the 3-8 percent margin leak. The brands that only track "did the package arrive on time" miss most of the optimization.
The data sources you already have
Carrier scoring at mid-market does not require new data collection. Every brand already has the data needed; it just lives in disconnected silos:
- WMS or 3PL portal: shipping records with carrier, service level, cost, tracking number per package
- Carrier APIs: UPS, FedEx, USPS, regional carriers all expose delivery status webhooks
- Customer service tickets: damage claims, lost package reports, late delivery complaints, typically in Zendesk, Intercom, or Gorgias
- Returns data: damage at receiving (returns triage attributes damage by carrier)
- Customer surveys: post-delivery NPS, if collected, segmented by carrier
The work is plumbing: pull all of these into one analytical view. We covered the broader data layer architecture in our AI for e-commerce operations playbook.
The scoring formula
A simple weighted composite that works at mid-market:
| Metric | Weight | How it's measured |
|---|---|---|
| On-time delivery rate | 30% | % delivered by promised date over rolling 4-week window |
| Cost per package (vs lane median) | 25% | Standardized to dim weight tier |
| Damage + loss rate | 20% | Customer-reported + returns-receiving |
| Exception rate | 15% | Address validations, reroutes, missed pickups |
| Customer experience | 10% | Post-delivery NPS or CSAT tagged by carrier |
The output: a 0-100 score per carrier per lane per week. The OMS routing layer uses this score to override defaults when a specific carrier-lane combination drops below threshold (typically 70).
The 5 weights above are reasonable defaults, not gospel. Adjust based on your business. Fragile freight brands weight damage rate higher (35-40%). Cost-sensitive D2C brands weight cost per package higher (40%+). Run with defaults for the first 90 days, then tune.
Where AI fits in (and where it does not)
The scoring math is statistics, not AI. Weighted composite, rolling averages, threshold-based routing. None of this requires a model.
AI earns its place in three specific layers on top of the scoring:
1. Exception cause classification
When a shipment fails on-time, the AI reads the carrier exception code + customer service ticket text + tracking history and classifies the cause: weather, residential delivery issue, address problem, carrier hub backlog, customer-not-home. Without this classification, you can't distinguish carrier-fault failures from customer-fault failures. With it, your scoring penalizes only what the carrier actually controls.
2. Lane-level pattern detection
The AI watches for emerging patterns: "UPS performance to Texas has dropped 8 percentage points over the last 2 weeks, concentrated in zips 78xxx." A statistical alert would flag the drop. The AI surfaces the specific lane + suggests a temporary override. Operations sees the recommendation, approves or rejects, and the system learns from the decision.
3. Cost prediction for new lanes
When you expand into a new region or carrier, you have no historical scoring data. The AI predicts likely cost + reliability based on similar lanes, carrier rate cards, and historical patterns. Not perfect, but better than the spreadsheet guess most brands make.
What AI does not do: the actual routing decision in production. The scoring decides. The OMS executes. AI sits in the analytical loop, not the production loop. This is the same architecture pattern we cover in multi-warehouse fulfillment without spreadsheets.
The 60-day rollout pattern
For a brand starting with no carrier scoring:
- Weeks 1-2: Data unification. Pull WMS shipping records, carrier API webhooks, CS ticket data, returns data into one analytical store. Standardize on tracking number as the join key.
- Weeks 3-4: Baseline scoring. Build the weighted composite. Run it on the last 90 days of data. Identify the top 5 lane-carrier underperformers.
- Weeks 5-6: Override logic. Wire the score into the OMS as a routing input. Implement the auto-override threshold (typically score <70 triggers fallback carrier).
- Weeks 7-8: AI classification overlay. Add exception cause classification + lane pattern detection.
- Weeks 9-10: Embed. Weekly carrier review meeting. Monthly carrier negotiation backed by the scoring data.
By month three, the average mid-market brand recovers 3-8 percent of shipping cost. For a brand spending $2M/year on shipping (typical at $40M revenue), that's $60K-$160K recovered annually.
What it actually costs to build
Realistic cost for a $20-100M brand implementing carrier scoring properly:
- Data integration (WMS + carrier APIs + CS + returns): $20-40K one-time
- Scoring engine build: $15-30K one-time
- AI classification + pattern detection: $20-40K one-time + $500-1,500/month operating
- OMS integration: $10-20K (cheap if you have a modern OMS, more if legacy)
- Year-one total: $65-130K build + $6-18K/year operating
For a $40M brand recovering $100K/year in shipping cost, the build pays back in 8-15 months. After that, it's pure margin.
The leverage carrier scoring creates in negotiation
The hidden benefit beyond routing optimization: when annual carrier rate negotiations come around, you have data the carrier doesn't expect you to have.
Standard negotiation: "We ship 50K packages a year, give us 8 percent off." The carrier rep says no.
Data-backed negotiation: "We ship 50K packages a year. Your on-time delivery to Texas is 87 percent vs FedEx's 96. Your damage rate on glass freight is 2.3 percent vs the industry average of 0.8. We have $400K in volume on the table. Give us 12 percent off or we move the Texas + glass freight volumes to FedEx." Carrier rep escalates and approves.
Brands that walk into annual negotiations with carrier scoring data routinely capture 2-5 percent additional discount that has nothing to do with raw volume. The data itself is the leverage.
Common failure modes
- Tracking only on-time delivery. You miss the damage rate signals, the cost variance, the exception rate. Score across at least 5 dimensions.
- Using annual averages. A carrier can be great YTD and terrible this week. Use rolling 4-week windows.
- Penalizing carriers for customer-fault failures. Customer-not-home or wrong address are not carrier issues. Filter them out via exception classification.
- Building the scoring without OMS integration. A score that nobody uses to route orders is just a report. Wire it into the routing engine.
- Set-and-forget weights. Tune the weights every 90 days based on what the business actually values.
Frequently asked questions
What is carrier scoring?
Carrier scoring is the practice of continuously tracking shipping carrier performance across on-time delivery, cost per package, damage rate, exception rate, and customer experience, by carrier × lane × week. The composite score drives automatic routing overrides when a carrier underperforms on a specific lane.
How much margin does carrier scoring recover?
For mid-market brands ($20-100M D2C) implementing scoring properly, recovery is typically 3-8 percent of total shipping cost. For a brand spending $2M/year on shipping, that's $60K-$160K recovered annually. Build cost is usually $65-130K with payback in 8-15 months.
Do I need AI for carrier scoring?
Not for the core scoring math (it's weighted statistics). AI earns its place in three layers on top: exception cause classification, lane-level pattern detection, and cost prediction for new lanes. The scoring decision and OMS routing remain deterministic.
How long does it take to build carrier scoring?
60-90 days from start to production system. Data unification (weeks 1-2), baseline scoring (3-4), OMS override logic (5-6), AI classification (7-8), embed (9-10).
What software does carrier scoring at mid-market?
Pure-play carrier scoring software is rare; most brands build custom on top of their OMS using shipping data + a BI layer. ShipBob and ShipHero have basic carrier analytics built in but don't do the full multi-dimensional scoring at the depth that moves margin. For $20-100M brands, custom build on top of a real OMS (covered in our order management systems guide) is the pattern that works.
Should I switch carriers based on the scoring?
Usually not. The scoring is for routing overrides on specific underperforming lanes, not wholesale carrier switches. If UPS underperforms on the Texas lane for 8 consecutive weeks, you route Texas to FedEx but keep UPS as primary elsewhere. Wholesale carrier switches involve volume re-negotiation, integration work, and customer-experience risk that usually outweighs the gain.
Bottom line
Carrier scoring at $20-100M D2C is the highest-ROI shipping investment most brands ignore. The data already exists. The scoring math is statistics, not AI. The recovery is 3-8 percent of shipping cost annually, plus negotiating leverage that captures another 2-5 percent at contract renewal. Brands that build it as a permanent operating layer pay for it in under a year and compound margin every quarter after. Brands that rely on carrier defaults leave the money on the table.