The Middlemen
A six-week investigation

Half the ‘AI APIs’
You’re Buying Are
Lying to You.

Inside the billion-dollar middleman economy where 45.83% of providers serve fake models, 9 routers actively inject malware into your code, and the entire arbitrage gold rush just died in 90 days.

0%
of audited shadow APIs failed model-fingerprint verification
0
top AI papers (ACL / CVPR / ICLR) cite the compromised endpoints
0
routers caught actively injecting malware into agent tool calls
$0M
ARR for OpenRouter, the head-of-class legitimate aggregator
I  ·  THE OPENING SCENE

Six German researchers ran a simple audit. They found that almost half of the AI API market is a lie.

In early March 2026, six security researchers at the CISPA Helmholtz Center in Saarbrücken decided to do something deceptively simple. They picked seventeen of the most popular “AI API” services on the internet — the kind of platform where a developer can sign up, get an API key in thirty seconds, and start calling GPT-5 or Claude Opus for a fraction of what OpenAI or Anthropic would charge directly. Then they audited them.

The methodology was rigorous and unsentimental. The team sent each service a battery of carefully designed probe prompts and compared the responses, statistically, against the known distributional fingerprint of the model the service claimed to be running. If you asked for GPT-5 and got back a response whose tokenization quirks, knowledge cutoff, and refusal patterns matched the genuine model within a calibrated cosine threshold, the service passed. If the response embedded into a measurably different region of vector space, it failed.

Forty-five point eight three percent failed.— CISPA, ‘Real Money, Fake Models’, arXiv 2603.01919

Nearly half of the audited providers were not running the model they were charging for. Some were running cheap open-source seven-billion-parameter models and billing as if they were running OpenAI’s frontier. One service advertising “GPT-5” was fingerprinted to GLM-4-9B, a nine-billion-parameter open-source Chinese model with a fraction of the capability. A “Gemini-2.5-flash” endpoint scored 37% on the MedQA medical benchmark; the real Gemini-2.5-flash scores 83.82%. That’s not a model substitution — it’s the difference between a working medical Q&A tool and one that’s wrong two-thirds of the time.

The CISPA paper, titled with academic deadpan as Real Money, Fake Models, landed on arXiv on March 2nd. By the end of the week it had cracked Hacker News. The thing that detonated wasn’t the 45.83% figure alone, though — the sample was, after all, only seventeen platforms, even if they were among the most cited in the literature. It was the second number, buried in section three of the paper: those same shadow APIs had been cited in 187 published research papers, and 116 of those 187 had been accepted at ACL, CVPR, ICLR, and other top-tier AI conferences. The cost to reverify each compromised paper using genuine API access? Between $115,000 and $140,000.

Somewhere between fifty and a hundred high-status AI research papers had quietly run their experiments on a lie. And no one had any way of knowing which ones.

I spent the next six weeks tracing how this market got built, how it actually works, and how it’s now coming apart. What I found is one of the most fascinating market failures of the AI era: a global underground economy that the major aggregator OpenRouter alone benchmarks at $50 million in annual revenue, and that in aggregate is plausibly worth ten times that — currently being dismantled in slow motion by Anthropic’s lawyers, German academics with statistical tooling, and the basic arithmetic of a business whose unit economics never quite worked in the first place.


II  ·  A FIELD GUIDE

What is an AI API relay?

A relay (Chinese tech circles call them 中转站, “transfer stations”; academia calls them “shadow APIs”; the operators themselves prefer “aggregators”) is a piece of middleware that sits between you and an official LLM provider, intercepts your traffic, and re-prices it. Some of them are useful infrastructure. Some are quiet fraud. The two look identical from the outside, which is the entire problem.

There are two distinct economic engines worth separating, because they age very differently. The first is geographic arbitrage: a relay exists because the user can’t reach the official provider. OpenAI banned API access from mainland China on July 9, 2024. Anthropic followed in September 2025 with a rule barring any entity more than 50% Chinese-owned, a policy estimated to cost Anthropic itself “low hundreds of millions” in annual revenue. Stripe doesn’t operate in China. Most Chinese developers can’t get an international credit card. The Great Firewall makes even reaching api.openai.com unreliable. If you’re a Chinese AI startup wanting to build on GPT or Claude, you need a middleman. This problem is permanent.

The second engine is economic arbitrage: the relay exists because a pricing inconsistency lets someone resell access at a margin. A $200/month Claude Max subscription, in token-equivalent terms, is worth thousands of dollars of API usage; that delta is the entire business model of half the operators in this market. This problem is temporary by construction. The instant the upstream provider closes the gap, the business collapses. Phase 5 of our timeline below was exactly that collapse, executed in a single week.

The legitimate version of this market — the relays running on real, paid-for, contractually permitted API keys — is enormous. OpenRouter, the global head-of-class, grew from $10 million ARR in October 2025 to $50 million ARR by Q1 2026, a 5x jump in six months. It is currently negotiating a Series B at a $1.3 billion valuation with a16z and Menlo Ventures. SiliconFlow, the largest China-compliant equivalent, claims six million registered users and processes trillions of tokens daily.

What every relay actually does Three things: (1) bridge a payment or geographic mismatch between the user and the upstream provider; (2) unify multiple model APIs (OpenAI’s /v1/chat/completions, Anthropic’s /v1/messages, Gemini’s generateContent) under a single protocol; (3) add team controls — usage quotas, per-key budgets, fallback routing, observability.

Most of this market is fine. Some of it is the most interesting fraud economy of the decade. The rest of this piece is about the some.

III  ·  THE THREE-YEAR GOLD RUSH

Every phase ended because someone closed a door. The doors are now mostly shut.

The arc from ChatGPT’s launch to today breaks cleanly into six phases. Scroll through them. Each one is shorter than the last.

Phase 1 · Wild West

ChatGPT launches. China gets blocked the same week.

A cottage industry of key resellers emerges, buying stolen or bulk OpenAI accounts and reselling them at the now-mythical “1 yuan = 1 dollar of API credit” rate. Enforcement is nonexistent. A Shanghai developer publishes one-api on GitHub under the handle songquanpeng; it becomes, almost by accident, the global infrastructure for every shadow API that follows.

Estimated 1,000+ relay sites by Dec 2023
Phase 2 · Golden Age

GPT-4, Claude 2, Gemini all ship within six months.

Demand for multi-model access explodes. On July 9, 2024, OpenAI formally blocks China. Demand for middlemen spikes. By the end of 2024, Chinese-language community tracking reports put the count of active relay platforms in the mainland market in the tens of thousands — one frequently cited estimate places it above 150,000, though no audited census exists. new-api, the SaaS-ready fork of one-api, becomes the default deployment for new entrants. Margins run 30–100%.

~150,000 relay sites · head operators clearing 7-figures/month
Phase 3 · Consolidation

OpenRouter raises real money. Margins start to compress.

LiteLLM becomes the default Python gateway. Portkey, Bifrost, and Helicone all enter from the YC side. The legitimate end of the market starts to look like enterprise infrastructure rather than a Telegram bazaar. Chinese regulators begin paying attention: the first ICP-license investigations of relay operators are filed. The first wave of platforms that built on the “wild west” assumption begins quietly shutting down.

OpenRouter: $10M ARR · LiteLLM crosses 1M weekly PyPI downloads
Phase 4 · Reverse Arbitrage

Claude Code ships. A $200/month subscription becomes a goldmine.

This is the moment that defined 2025 and detonated 2026. Anthropic ships Claude Code, the IDE-integrated coding agent. Suddenly, a $200/month Claude Max subscription is worth thousands of dollars of equivalent API usage. Operators figure out they can lift the OAuth tokens from the official Claude CLI, wrap them in an API-compatible shell, and sell the resulting capacity at five to ten times their cost. PackyCode, AnyRouter, and PinCC explode. The “5-person Max carpool” — five strangers sharing one $200 subscription — becomes a viral business model on Weibo and Telegram.

PackyCode & AnyRouter reportedly clearing 7-figure RMB monthly
Phase 5 · OAuth Crackdown

Anthropic ships server-side attestation. The party stops in a week.

On January 9th, 2026, Anthropic deploys cryptographic OAuth attestation: the official Claude CLI now includes a signed proof of its authenticity, and third-party harnesses can’t forge it without breaking the signing chain. PackyCode and AnyRouter both see monthly revenue drop 60–80%. A March leak of the attestation logic from the Bun runtime briefly lets operators back in — Anthropic responds not with another technical patch but with a billing change: any third-party harness traffic now generates a separately-billed line item back to the relay operator. The technical kill becomes an economic kill.

PackyCode revenue: −60% to −80% in one month
Phase 6 · The Reckoning

Three papers, one supply-chain attack, ninety days.

CISPA publishes the 45.83% paper. Three weeks later, a UCSB-led team publishes Your Agent Is Mine, the first systematic study of router-level malware injection: 428 routers tested, 9 actively injecting payloads, 17 silently exfiltrating AWS credentials, one stealing an Ethereum private key. Five days after that, LiteLLM’s PyPI package is compromised in a supply-chain attack that hits over a thousand enterprise environments before isolation. In April, Anthropic cuts 135,000 instances of OpenClaw, the largest subscription-sharing tool, and temporarily suspends its founder’s personal Anthropic account. The legitimate market suddenly has a reason to differentiate.

CISPA 45.83% · UCSB 9/428 routers · LiteLLM 1,000+ enterprises

IV  ·  THE FIVE BUSINESS MODELS

There are exactly five ways to make money running an AI relay. They differ on margin, lifespan, and prison risk by orders of magnitude.

If you re-read the six phases above, a pattern emerges. Each phase didn’t end because operators got lazy or because demand evaporated. Each phase ended because an upstream player finally invested the engineering to close a specific information or enforcement gap. The relay business has always been an arbitrage on temporary blindness — on the lag between “this is technically possible” and “the provider has built the system to detect it.” The margins are rent extracted from that lag.

That framing is what makes the next taxonomy useful. Strip away the marketing and the “what’s your model coverage” pages and the loyalty bonuses, and what’s left is a clean five-way split. I’ll label them A through E because that’s how the research community labels them. A is legitimate. E is criminal. The middle three are where the money has been — and each one’s lifespan is set not by its profitability but by how long the corresponding detection gap stays open.

Click between the models below to see the economics, the typical lifespan, and what closes each one.

Five business models for AI API relays, plotted by legal risk and gross margin A scatter chart with five bubbles labeled A through E. A (legitimate markup) sits in the low-risk, low-margin corner. B (subscription arbitrage), C (model substitution), and D (data extraction) cluster in the high-margin, mid-to-high-risk zone. E (poisoning and phishing) sits in the highest-risk position. Bubble size represents typical operator lifespan.
A. Legit markup
B. Subscription arb
C. Fake models
D. Data extract
E. Poison/phish
bubble size = typical lifespan

A · Legitimate markup resale. Hold a real account, mark up 5–15%, pass through. OpenRouter actually marks up zero on inference and charges only a 5.5% credit-card fee on top-ups. The principled distinction between this and any of the gray-market models below is binary: A-mode operators hold paid-for upstream credentials, operate under an explicit reseller MSA or equivalent ToS clause, and pass through the actual model the user is paying for. They make money on the service, not on the deception. This is the only model that survives more than a few years. Representatives: OpenRouter, SiliconFlow, Qiniu Cloud, the major cloud-hyperscaler endpoints (AWS Bedrock, Azure OpenAI, Google Vertex).

B · Subscription arbitrage (Web2API). Lift the OAuth token from a Claude Max or ChatGPT Plus subscription, wrap it as an API, sell the resulting capacity. Pre-2026 margins were 300–500%. As of January 2026, the model is effectively dead — Anthropic’s OAuth attestation kills the technical play, and the February legal-page update kills any operator with a legal-exposure threshold above zero. PackyCode, the largest player, has pivoted to A-mode where it now has no differentiation.

C · Model substitution. Take payment for Claude Opus, route silently to GLM-4-9B, pocket the spread. This is what 45.83% of CISPA’s audited shadow APIs were doing. Margin: 60–95%. Legal exposure: civil fraud in most jurisdictions, potentially criminal under China’s anti-unfair-competition statute and US state UDAP laws. Lifespan: weeks to a few months before fingerprinting catches up.

D · Data extraction and distillation. Scrape user prompts; sell them to data brokers, or train an internal model on harvested conversations and rebrand it as “our proprietary model.” Anthropic’s February 2026 public accusation against DeepSeek, Moonshot, and MiniMax described approximately 24,000 fake accounts generating 16 million interactions used for systematic Claude distillation. The economic case is brutal: training a frontier model from scratch costs $100M+; distillation costs $100k–500k in API access plus another $10–100k in fine-tuning. A 1,600x cost advantage.

E · Poisoning and credential theft. Not really “arbitrage” — this is cybercrime wearing a router’s clothes. The UCSB team’s nine actively-injecting routers were not making token margin; they were either delivering malware payloads (replacing legitimate URLs in tool responses, swapping requests for typosquatted reqeusts) or passively scraping credentials from cleartext traffic. One AWS root credential sells for $500–5,000 on dark markets. One Ethereum private key is whatever the wallet is holding.

The grim arithmetic Of the five models, exactly one is sustainable. The other four either died in the last six months (B), are dying as fingerprinting matures (C), live under a regulatory sword (D), or were always crimes (E). The legitimate aggregators — OpenRouter, SiliconFlow, the cloud-hyperscaler offerings — are about to spend the next two years eating the entire market.

V  ·  THE FRAUD DETECTION CRISIS

Fingerprinting a model was a solved problem. Nobody had bothered to actually run it.

The methodology CISPA used, called LLMmap, sends a series of probe prompts designed to extract a model’s characteristic style: its tokenization habits, its knowledge cutoff, its refusal patterns, its syntactic preferences. The responses are embedded into vector space and compared to a reference distribution from the genuine model. If the cosine distance exceeds a calibrated threshold, you’ve got an imposter.

What’s new isn’t the technique. It’s the willingness to actually run it at scale on commercial APIs and publish the results. Until CISPA’s paper, the assumption in academic AI circles was that the major shadow API players were probably routing as advertised, give or take. Nobody had checked.

The MedQA result is the headline finding because it’s so specific. MedQA is a benchmark built from USMLE — US Medical Licensing Examination — questions. The official Gemini-2.5-flash, called via Google’s own API, scores 83.82% on it. A shadow API advertising “Gemini-2.5-flash” scored approximately 37%. That is, depending on the question, somewhere between random guessing and slightly-better-than-random. If you built a medical Q&A product on top of that endpoint, your product was wrong about two-thirds of the time, and you had no way of knowing why.

Quick exercise: which of these two responses came from the real GPT-5 Opus?

Both endpoints were asked the same question: “Without searching, what is your training data cutoff date, and what was the last major news event you’re aware of?”

Endpoint αresponse · 1.4s
My training data has a cutoff of early 2026. I’m aware of major events through January 2026, including the EU AI Act’s phased rollout of GPAI obligations and the September 2025 AWS re:Invent. I shouldn’t claim awareness of events past that cutoff.
Endpoint βresponse · 2.7s
As an AI language model, I have a knowledge cutoff. My information may not include the latest events. The most recent major event I know about is from sometime in 2023. Is there anything specific you’d like me to help you with today? 😊
Endpoint α is the real one. Three signals gave Endpoint β away: (1) First-token latency of 2.7s is too slow for GPT-5 Opus, which serves first tokens in under 1s under normal load — this latency profile is consistent with a 7B open-source model on shared GPU. (2) The 2023 knowledge cutoff matches GPT-3.5 / Llama-3-8B, not any 2026-era frontier model. (3) The hedging boilerplate (“As an AI language model… Is there anything specific… 😊”) is a tell — GPT-5 was trained with explicit penalties against this stylistic pattern. This is exactly the kind of substitution CISPA caught in 45.83% of audited endpoints.

The UCSB paper, Your Agent Is Mine, is the darker companion piece. Published April 9th. The team purchased 28 paid services from Chinese marketplaces — Taobao, Shopify storefronts — and collected 400 free ones from public communities. 428 total endpoints. Nine of them — one paid, eight free — were observed actively injecting malicious payloads into responses. Seventeen of them touched AWS canary credentials the researchers had placed in their honeypot environment. One of them silently extracted an Ethereum private key.

The attack surface is the part most users don’t think about. Models themselves are safety-trained: they refuse to generate malware, they refuse to write phishing kits. But the JSON envelope around model responses — the tool-call arguments, the structured outputs, the function-calling parameters — sits outside the model’s safety filters. A router can rewrite a benign URL in a tool call to an attacker-controlled URL after the model has produced it. A router can replace the package name requests with typosquatted reqeusts in a code-generation response. The model never sees the substitution; the user’s agent executes it anyway.

Of the 440 sessions UCSB logged through their honeypot, 401 were running in “YOLO mode” — meaning the AI agent on the user’s side was set to auto-execute tool calls without human approval. That is the configuration in which malicious tool responses do their actual damage: a malicious URL gets fetched, a typosquatted package gets installed, a credential gets exfiltrated, and the user never sees the response that would have let them refuse.

Then on March 24th, three weeks after the CISPA paper, the LiteLLM package on PyPI got compromised in a supply-chain attack. The mechanics are worth detailing because they’re a preview of every gateway-layer attack we’re going to see for the next few years.

Attackers had first compromised Trivy, a popular vulnerability scanner. LiteLLM’s CI/CD pipeline pulled Trivy unpinned from apt. The poisoned Trivy ran inside LiteLLM’s GitHub Actions runner and stole the PYPI_PUBLISH token from the environment. The attackers used that token to publish litellm 1.82.7 at 10:39 UTC. Thirteen minutes later, they published 1.82.8. Both contained a backdoor. The backdoor wasn’t even an import-time payload — it was a .pth file that auto-executes on any Python interpreter startup. No import litellm required. It scraped SSH keys, AWS IAM credentials, GCP service accounts, Azure environment variables, Kubernetes secrets, and attempted lateral movement across clusters. PyPI isolated the packages a few hours later. Estimated reach: over 1,000 enterprise environments. The Vect ransomware group has already begun naming victims.

Before we move on, the uncomfortable question worth asking is the one the academic community has so far avoided in public: why did peer review let any of this through? 116 papers at top-tier venues built results on top of API endpoints with no chain-of-custody. None of those papers were asked, at submission time, to provide a fingerprint hash of the model they actually called. None were asked to disclose whether they used an official provider URL or a re-priced shadow endpoint. Reviewers accepted the implicit claim “I called GPT-5” as sufficient evidence that GPT-5 produced the number on the leaderboard. The CISPA paper’s real contribution isn’t the 45.83% figure. It’s the demonstration that ML research now needs a model-provenance standard the same way clinical research needs a chain-of-custody for biological samples.

By April, the operational assumption in every serious AI infra team I spoke to had inverted: the question was no longer “why should I trust my router” but “what evidence have you given me that I should.”


VI  ·  THE ECONOMICS OF A 5-PERSON CARPOOL

The mechanics that drove the 2025 boom were almost embarrassingly simple.

A Claude Max subscription costs $200 a month. Five strangers in a Telegram group pay roughly $55 each. The operator collects $275 in revenue against a $200 cost, and pockets the $75 spread as gross margin per account per month — about 30%. The platform handles authentication, rate limits, and silent rotation when accounts die. Easy Claude Code, PinCC, and a dozen smaller operators made this their entire business model through 2025.

With a hundred accounts in steady operation, the arithmetic says a single operator clears roughly $7,500 a month, just from one tier of one product, before counting upsells, referral commissions, or premium VIP pricing. With a thousand accounts, you’re a millionaire by Q3. The math, on paper, is intoxicating.

It did not actually work that way. The hidden cost was account churn. Anthropic’s anti-abuse systems noticed the multi-IP, multi-fingerprint usage patterns within weeks. Accounts got terminated. Operators bought new ones for $7–20 each from gray-market wholesalers, who source identity documents from biometric KYC scams in Southeast Asia and Sub-Saharan Africa, mark them up roughly 10x, and resell. Customers complained when their service vanished. Refund volume rose. The real net margin, by most operators’ own accounting, was closer to 15–20%, with a typical account lifespan of one to three months.

The January 2026 OAuth crackdown collapsed even that. PackyCode lost 60–80% of its monthly business in two weeks. The wholesalers who had been supplying fresh Max accounts saw their inventory get burned at unprecedented rates. The grift transitioned into its final phase, the universal one for every market that has stopped paying: operators started selling online courses to the next wave of newcomers about how to build your own relay station. By April, hopeful entrants were paying $30–$400 for tutorials about a business that no longer worked.

The barrier to entry is what made the involution unavoidable. A relay station requires about $140 of total upfront capital: a domain, a VPS, a Cloudflare account, and an open-source deployment of new-api or one-api (which is free). Free payment rails — WeChat Pay personal QR codes, USDT — eliminate even the merchant-account friction. Tens of thousands of these things exist for the same reason that none of them can defend a margin.

There’s a tell that’s worth knowing if you spend any time in the gray-market end of this market. Operators routinely advertise pricing with the slogan “1 yuan = 1 dollar of API credit.” In the early days of GPT-3.5, when token costs were close to zero and some resellers were genuinely just liquidating cheap stolen keys, this was occasionally a literal truth. In the Claude Opus and GPT-5 era, it is mechanically impossible. The slogan persists because it works as a hook for users who don’t bother to do the conversion. The operator’s real game is hidden in a redefined “internal point” system where the “dollar” on your dashboard converts back to actual cost at three to five times what you thought. The gray-market business has always depended on user inattention; that’s its structural fragility.

This is the part where a real diagnostic tool helps. Punch in any price-per-million-tokens you’ve been quoted and see what the math says it has to be hiding.

Pricing Diagnostic

If a provider is quoting you Claude Opus 4.7 at ¥1 (about $0.14) per million tokens, that price is not coming from the official Anthropic API. Find out where it’s coming from. (Pricing is denominated in Chinese yuan because that’s the currency the gray market actually quotes in — rough USD conversion is shown below each result.)

Enter a price above and click Diagnose to see what category it falls into. We benchmark against the actual upstream cost for each model.

Reference prices from official Anthropic, OpenAI, and Google API documentation; legitimate aggregator pricing from OpenRouter; gray-market pricing from Chinese-language community price-monitoring threads (May 2026).


VII  ·  ANTHROPIC STRIKES BACK

The strategic evolution that ended the gold rush.

The thing that fascinated me most as I worked through the timeline is the strategic evolution of how Anthropic chose to fight back. Their first instinct, like everyone’s, was technical: detect non-human OAuth usage patterns, terminate violating accounts. That worked for about six months before operators iterated past it.

The January 2026 move was sharper. Instead of trying to detect bad accounts after the fact, Anthropic added server-side cryptographic attestation to every OAuth token. When the official Claude CLI authenticates, it now includes a signed proof that the client is the genuine CLI binary running on a genuine user device. Third-party harnesses can’t forge that proof without breaking the signing chain.

Then, in March, the Bun runtime leaked the attestation logic. For about ten days, sophisticated operators figured out how to mint synthetic attestations from extracted client secrets. Anthropic’s response was the most interesting move of the cycle. They didn’t patch the attestation. They added a separate billing line. Any traffic carrying a third-party harness fingerprint now generates an itemized charge that gets sent to the relay operator’s Anthropic invoice, regardless of which subscriber’s token authenticated it. The technical kill became an economic kill. Operators who had been making margin on free attestation traffic suddenly had to pay for it themselves.

The April OpenClaw incident — Anthropic cutting off 135,000 instances of the OpenClaw subscription-sharing tool and temporarily suspending its founder Peter Steinberger from his own personal Anthropic account, then partially reinstating access weeks later with a separately metered surcharge — was the final signal. The legal page was updated. The lawyers were now drafting policy in real time.

The deeper principle is worth stating directly, because it’s the most replicable lesson in this whole story: technical kills get iterated past. Economic kills don’t. A signature check can be reverse-engineered. An attestation algorithm can be leaked, as the Bun incident proved. But a billing line item that charges the operator for every byte of traffic they used to make margin on cannot be patched around. It can only be absorbed or avoided. By making third-party harness traffic expensive instead of forbidden, Anthropic converted enforcement from a technology race into an arithmetic problem, and arithmetic problems don’t have exploits.

The strategic logic of why Anthropic cared this much is similarly clean. The subscription-arbitrage market was suppressing prices everywhere, not just in China. Once a competitive subscription-derived pricing tier exists at $40/seat globally, Anthropic’s enterprise sales team has to compete against it on every deal. Killing the arbitrage market restores pricing power on the enterprise tier. The China collateral damage — the “low hundreds of millions” in lost revenue from the September 2025 entity-controlled ban — is treated as the cost of the cleanup. Anthropic’s leadership has clearly decided that protecting the enterprise pricing model is worth losing a market it was never going to be able to safely operate in anyway.


VIII  ·  WHAT THIS MEANS FOR EVERYONE ELSE

The market isn’t dying. It’s being restructured around a single defensible asset: provable legitimacy.

The picture that emerges from the research isn’t “the AI API black market is dying.” It’s “the only thing left to compete on is the ability to prove you’re not lying.” Two distinct things are happening at once, and they’re both consequences of the same shift.

The legitimate aggregators are winning more than they expected to. OpenRouter’s 5x ARR growth in two quarters wasn’t because they suddenly got better; it was because the broader audience finally needed somebody trustworthy to buy from. The CISPA paper sold OpenRouter’s product better than any of OpenRouter’s own marketing did. SiliconFlow’s enterprise pipeline reportedly tripled in Q2. The legitimate-tier business has gone from “polite alternative” to “obvious default” in about ninety days, and the proximate cause is academic research, not marketing budget.

Meanwhile, the gray-market operators are being forced into a binary choice: pivot to A-mode (where they have no differentiation and razor-thin margins) or stop pretending and become criminals (where gross margins rise but so does the prison risk). The middle ground — the comfortable gray zone where you charged a premium for a vaguely-defined “Claude access” without specifying which keys, which pool, which jurisdiction — is gone. There is no longer a profitable niche for the ambiguously legitimate.

The structural shift this exposes is what makes the moment interesting. For two and a half years, the AI relay market competed on price, model coverage, and uptime. Starting roughly now, it competes on verifiable legitimacy: the ability to prove cryptographically, behaviorally, or contractually that you are running the model you claim, processing data the way you say, and operating under a license you can actually produce. That asset is hard to fake, slow to build, and — precisely because of those two properties — it’s the only defensible moat left in the category. The implications fall out differently for three audiences.

For developers

BYOK — Bring Your Own Key — is the new default. Whatever convenience you used to get from a shared-pool aggregator is now outweighed by the non-trivial probability that you’re calling a fake model, leaking your prompts into someone’s training pipeline, or executing a malicious tool response in your agent. Hold your own keys, route through a thin proxy you control, and run fingerprint checks on the providers you actually use. If you’re a Chinese developer routing around the geographic block, prefer the compliant aggregators — SiliconFlow, Qiniu Cloud, the Alibaba and Baidu cloud offerings — over the gray-market Telegram-bazaar tier, even at a higher per-token cost. The premium you pay is insurance against an entire class of failure modes you previously had no way to detect.

For researchers

This is an integrity emergency. If 45.83% of shadow APIs serve fake models and 116 of your colleagues’ published papers ran on those endpoints, the field needs a verification standard now, not after the next conference cycle. Benchmark leaderboards need a “model verified” column. Peer review needs to require, alongside the dataset and the code, the API endpoint and a fingerprint hash. The CISPA team has open-sourced the LLMmap toolkit; use it. The longer the community treats “I called this API” as sufficient evidence of “this model produced this number,” the more retractions are queued up behind us.

For founders and builders

The window for compliant aggregation has rarely been wider, but it will not stay open indefinitely. The customers who used to buy on price are now buying on trust, and trust compounds: the first three or four operators who establish public fingerprint audits, zero-log defaults, real SLAs, BYOK as a first-class mode, and an audit trail an enterprise procurement team can read without flinching will lock in the institutional buyers. The grifters will undercut you on price for another twelve months. Then they will die. The customers worth keeping are the ones who care about the things the grifters can’t fake — and those customers are buying decisions for the next five years right now.


IX  ·  HOW TO AUDIT YOUR PROVIDER

Eight checks. The honest providers should make most of them unnecessary.

If you take only one thing from this article, take this list. The eight items below are the minimum due diligence you should run on any AI API provider you’re depending on. The shorter list, though, is the one a genuinely transparent provider would publish themselves — without being asked. If you have to extract these from a vendor with effort, that’s information too. Your selections persist locally as you click; come back later to see where you left off.

0 of 8 completed