Data and Defensibility

When does data confer defensibility? Let's find out.

Abraham Thomas

Apr 12, 2025

120

If you’re reading this in your browser, please read this footnote: 1

INTRODUCTION

Data Is The New Moat

There is a growing consensus that data is not the new oil or the new gold; it’s something better. Data is the new moat.

We’re in the middle of a remarkable land-grab in software. LLMs are changing the way software is built, opening up vast new markets previously untouched by tech (especially in services), and making many incumbents look decidedly long in the tooth. This is a generational opportunity for companies to capture market share, and many startups are doing precisely that.

Startups riding the AI wave are reporting growth rates that have never been seen before. Bolt grew to $20M ARR in 2 months; Cursor went from $1M to $100M ARR in 21 months; OpenAI has billions in revenue (and remember, GPT-3 was released less than 5 years ago). Stories of hyper-growth abound.

But so do cautionary tales. Competition is fierce: if you can build a crazy powerful app in a weekend, so can others. Any capabilities you create might be eaten by the next generation of foundation models. Churn is high, sometimes driven by disillusionment when your product doesn’t live up to the hype, sometimes driven by excitement when a competitor releases something that blows your solution away.

Massive opportunities; killer environment. What’s a company to do? Moats.

In warfare, moats are what prevent a castle from being stormed. In business, they’re what prevent a company from being overrun by competition, engulfment, or slow decay. Moats help you acquire and retain customers; they help you outperform and undercut rivals; they help you buy low and sell high, move fast and play bigger; they help you win, and keep winning.

Some moats are known and loved: network effects, user lock-in and switching costs, brand and positioning, process power, unique IP, economies of scale. Others are obscure or questionable.2

But with the advent of AI, a new kind of moat has come to prominence: the data moat. And this is not a coincidence.

AI companies have a special resonance with data moats, because data and AI are two sides of the same coin. LLMs require vast amounts of data, for training, fine-tuning, learning, reasoning. And LLMs unlock the value of data like almost no technology before. It’s a match made in business-model heaven.3

All the old moats are still valid — brand, for example, or network effects. You can (and should) build them. But they’re orthogonal to and independent of AI, in way that data moats are not. Data moats reinforce AI advantages, and AI advantages reinforce data moats.

What Even Is A Data Moat?

Everybody’s talking about how to build a data moat. Moats are in the air.4

Unfortunately, much of this conversation is ill-posed. There is an incredible amount of incomplete, inconsistent, or simply outdated thinking about data moats and how they work.

Classic mistakes in this vein include thinking data is a moat when it isn’t; relying too much on weak data moats; confusing other moats (like scale) for data moats; misunderstanding which attributes of data contribute to its “moatiness”; failing to distinguish between software moats and data moats; and not realizing when a data moat has lost its effectiveness. The Underpants Gnomes remain undefeated.5

This essay aims to bring some structure, rigour, and current best practices to the discussion. I’ll define a few different categories of data moats, explain when and how (and if!) they work, and explore some tactics to maximize their potency.

There will be case studies! Rampant speculation! Counter-intuitive conclusions! Clever turns of phrase which I hope go viral! And snarky asides!6 Read on.

Control And Loops

Let’s begin with some taxonomy.

I posit that there are two and exactly two categories of data moat. I call them data control and data loops; every type of data advantage can be put into one or both of these categories.

Data Control. If you have sole control of a critical asset, you have a moat. In the world of data, this control comes in various forms: uniqueness, aggregation, movement, usage, records, action, catalysis and more. We’ll learn about data control in Part One of this essay.
Data Loops. Many well-known business moats rely on positive feedback loops7 that accelerate some core business dynamic — for example, marketplace gravity, user network effects, protocol adoption. A number of data moats exhibit the same pattern. We’ll learn about data loops in Part Two of this essay,

PART ONE: DATA CONTROL

Controlling data really means controlling the value of data; moats arise when nobody else can access this value.

There are a few ways to do this: you can control the production and ownership of (unique) data; you can control the movement of data, either internally or externally; or you can control the usage of data, through technological or other means. Create value; transport value; extract value: each of these approaches can lead to a data moat.

A pre-condition here is that the data being controlled has to be meaningful. Otherwise, “control” is pointless: you can’t make money from tolls on a road going nowhere.

Unique And Proprietary Data

Okay, so you produce, capture or own unique data. Maybe the data’s about product usage, or customer behaviour, or industry dynamics, or something else. Do you have a moat? Probably not.

Unique data is neither necessary nor sufficient to establish a data moat. It’s not necessary because, as we shall see, there are other (often better) methods: controlling movement or usage of data, and building data loops. And it’s not sufficient, thanks to the “meaningfulness” criterion mentioned above. What does this criterion entail?

The data must offer substantial value to either you or your customers. A small delta in value means you can be overtaken by someone who outperforms you on other fronts, even without this data; a large delta means you cannot.
The data must be genuinely rivalrous. Your using it should prevent others from using it, or at least, from getting the same value from it.
The data must have no functional substitutes. Competitors should not be able to achieve similar outcomes no matter what data they use, similar or not.

Most datasets don’t satisfy even one of these conditions, let alone all three. But if all three conditions are met — if you have unique, high-value, irreplaceable data that only you can use — then you may have a moat.8

Historically, there have been a few ways to get such data:

As a by-product of your core business. (This is sometimes called exhaust data, by analogy with exhaust from a combustion engine). A good example here is stock market data, captured by NYSE and NASDAQ as a by-product of their core exchange business. But this is not a data moat. A larger core business might result in more (or better) exhaust data, but the converse is not true: NYSE’s data sales don’t give their exchange business any “extra” defensibility.9
Via process power. Many well-established data businesses follow this template. Think of Factset’s financial statement data, or Moody’s ratings data, or Nielsen’s media consumption data: they all depend on decades of expertise working with specific datasets and internalizing all their nuances. This is a moat; whether you want to call it a data moat or a process moat is a matter of semantics.
Through brute force investment of time and resources. Some examples of this pattern include search engines crawling the web, logistics and delivery firms mapping roads, and robotaxis recording driver-environment interactions. In each case, companies acquire data that becomes the foundation of their technology and hence their business model, that others cannot easily replicate. These are genuine data moats. But …

Brute Force Is Dead ...

The business theory behind brute force is “my capex is your barrier to entry”. Firms spend time and resources acquiring data before their rivals do, and use that data to gain market dominance.

Unfortunately, this works less well these days:

LLMs make data acquisition easier. Not just a little easier, orders of magnitude easier. You don’t need 100s of curators working 1000s of hours, just tell an AI agent to go get the data for you. Companies that spent years building complex human-mediated data pipelines must now contend with upstarts who can replicate 99% of their work for 1% of the cost. Synthetic data is another end-run around brute-force methods.10
Capital for data acquisition is cheap. Funding markets have fully internalized “the bitter lesson” and “the unreasonable effectiveness of data”. As a result, the capital to finance brute force data acquisition is cheaper than ever. Brute force is ultimately a bet on market timing and the cost of capital; change those, and the tactic vaporizes.11
Knowledge diffuses, capabilities increase. The planet-scale infrastructure that Google built in the 2000s, that allowed them to crawl and index the entire internet fast, was a major moat for them (though PageRank gets all the credit). Today, there are dozens of firms who can do the same, and much cheaper. Knowledge diffuses, tools improve, hyperscalers service-ify anything computable, and Moore’s Law marches inexorably onwards. Yesterday’s edge is today’s commodity.

Using an expansive definition of “data” puts this effect in stark relief. Studio Ghibli spent decades painstakingly perfecting a gorgeous visual style that no other animation studio could replicate: the very essence of brute force to generate unique content.

And then last month, ChatGPT blew the doors open for anybody to create their own Ghibli-fied art.

What this implies for creativity, intellectual property, democratization and artist economics is beyond the scope of this essay, but the underlying theme — that LLMs make mere “effort” less of a moat — is clear.

... Long Live Brute Force!

So are moats from brute force data acquisition a thing of the past? Not quite. There are still a few cases where brute force has its place:

In industries where that last 1% of accuracy or coverage or quality makes a meaningful difference — for example, finance. (And maybe art! TBD.)
Upstream of LLMs — for example, labelling data for LLMs to train on (Scale, Mercor, and friends), synthetic data pipelines, proprietary evals and so on.
In domains where LLMs are currently disadvantaged — for example, ‘real world’ data acquisition — audio, video, physics, bio. (This won’t last, btw) (which is also why there’s a land-grab happening here).

The other interesting play here is to recognize that sure, brute force isn’t a long-term moat, but if you have a funding advantage, you can use it to establish first-mover advantage, and then find your moat elsewhere — perhaps via workflow lock-in, or non-data network effects, or platform status, or brand. We’ll come back to this point.

Fragmented Data

A second powerful way to control data is to be the clearinghouse; the central repository that unifies fragmented data assets or data value.

This is a well-known pattern for data businesses; indeed, it’s probably the default mode for them. Think of Bloomberg, LexisNexis and CoStar: clearinghouses for financial, legal and real-estate data, respectively.

The very best clearinghouses add so much value to their fragmented source data — aggregating, harmonizing, licensing, transforming — that they essentially create new proprietary data value of their own. In other words, datasets that are low-value and commoditized in isolation, become high-value and unique in aggregate.

Clearinghouse / unification is a genuine data moat, combining elements of uniqueness, process power and brute force. And it’s not just for specialist data businesses! An interesting recent phenomenon is the emergence of a similar pattern for non-data businesses.

The idea is to aggregate / unify fragmented data, but then monetize it through a software offering. Here are some examples:

Rippling, Gusto and Remote.com aggregate regulatory data on payroll, contractors, benefits, tax etc. from dozens of countries. This allows them to offer “unified global payroll” as a service.
Stripe and Adyen do the same thing for “global payment processing”, again incorporating local rules and regulations on tax, authentication, KYC and AML, data privacy, reporting, dispute resolution etc., not to mention local customer habits and preferences.
Aggregators like Kayak and Skyscanner, and OTAs like Expedia and Booking unify flight information, prices, seats and other travel data from multiple airlines, from GDS providers, and from each other, but they don’t sell the data. Instead, OTAs sell tickets, while aggregators charge referral fees.
Zapier and Plaid aggregate information on “interfaces”, for SaaS tools and bank accounts respectively, which they can then offer unified access to. Plaid wraps multiple bank websites into a single API; Zapier wraps multiple SaaS APIs into a single no-code tool.
Numeral and Quandri, both early stage startups, focus on state-by-state sales tax and personal insurance lines, respectively. These are highly fragmented regimes with lots of interaction effects, making data unification super valuable.

In each case, the data collected is both essential to the business offering, and difficult for others to collect: a data moat.

There’s a subtlety here. Firms like Airbnb, Shopify and Uber also collect data from around the world, but their customers don’t really care: any individual transaction is localized, and doesn’t benefit from aggregation. For clearinghouse effects to kick in, there needs to be a one-to-many relationship between the customer use case and the fragmented data.

Information Produces Action

Zooming out, what are we unifying here? Data, yes, but which data, and why?

Essentially, this is data that is a substrate for action. Rules and regulations have to be complied with, and they’re mostly deterministic: they lead directly to action. Flight info is both necessary and sufficient for the action of booking tickets; sales intel likewise for conducting outreach. Information produces action.12

Data unification is genuinely hard, which makes it a good moat. And data unification with an action layer on top often requires domain specialization, which makes it even moatier.

Sales enrichment shows how the frontier changes with time. The progression from D&B to Clearbit to ZoomInfo to Apollo to Clay is a story of value being captured from unifying fragmented data, making it API-accessible, exploiting network effects, adding workflows, and layering in AI actions, respectively.

Controlling Data Movement

This is a blurry category. Partly because of boundaries: where does data movement end, and data usage begin? And partly because of overlaps: companies with this type of control often achieve it via a combination of tactics, both data and non-data. Let’s look at some examples!

Visa has perhaps the single most famous network effect in all of business13. An intriguing (and perhaps provocative) claim is that their network effect is largely based on the control of data movement.

Consider: when I buy something with my credit card, money doesn’t actually flow. Instead, data flows: customer name and verification, transaction details, credit limits, outstanding balances, merchant and bank IDs, payment schedules. The money flows much later, and not 1-for-1 either.

The Visa network orchestrates all of this — across consumer, merchant, payment gateway, payment processor, acquiring bank, issuing bank, and more. The nodes and edges are the network effect; the knowledge (and control) of these nodes and edges is the data moat.

Any middleman business is vulnerable to disintermediation. In Visa’s case, merchants could communicate directly with banks to inquire about customer credit scores. But they don’t! Visa controls that interaction. Visa controls almost all data movement across the network. Abstracting away the (immense) complexity of the network is precisely what brings participants on board and keeps them from defecting; in a very real way, data control underlies the network effect.

Amadeus and Sabre are like specialized Visas for the travel industry, controlling the flow of data (inventory, bookings, identities) across airlines, hotels, car rentals, travel agents and aggregators, and of course travellers. Change Healthcare is a Visa for the healthcare industry, controlling the flow of data (and payments) between patients, healthcare providers, insurance companies, and government payers programs.

Businesses built on controlling external data movement are lucrative, but rare. More common are businesses that manage the flow of data internally for their clients. Do they have moats?

Usually, no. Managing the movement of data is not the same as controlling it. Moving data around is what 99% of software tools do, and most such tools self-evidently do not have moats.

The exception is data flow in highly regulated industries. In healthcare, for example, patient data is highly sensitive and you can’t just access it or move it around willy-nilly. So you have firms like Epic, that specialize in managing internal data access (this is also a system-of-record effect; see below), and firms like Datavant, that specialize in transporting data between organizations while remaining secure and privacy-compliant (via a data-standards effect; see below again). It’s not so easy to rip these out: there’s limited upside and large downside, so most clients stick instead of twisting.14

Data Usage

The final type of data control is the most powerful: when you control the usage of data. This is a large sub-category, encompassing systems of record and action, catalyst data, and exogenous data moats.

System Of Record

System of Record (“SoR”) is one of the oldest, best-known and most effective data moats.

In any large organization, information is scattered. It’s scattered across excel files and databases, emails and slack channels, PDFs and decks, user manuals and policy handbooks, contracts and filings.

There is substantial defensibility in owning the platform that collates all this scattered information. Such a platform is called a “system of record”, and the goal is for it to become the “single source of truth” for the organization. Data is piped into the SoR; queries are addressed to the SoR; answers come from the SoR. If it’s in the SoR, you can assume it’s true; if it isn’t, you cannot.15

The canonical example of a SoR is Salesforce, which collates everything businesses need to know about their customer and sales pipelines: dates, contact information, interaction history, pipeline stage, expected value, progress from opportunity to close, marketing campaigns, customer service and case management, and more (much much more). Entire sales orgs run on Salesforce; they cannot run without it.

Salesforce has analogies in other domains. Indeed, this is typical for SoRs. It’s impractical and inefficient to unify all an organization’s data in a single SoR; instead, there are separate SoRs for each function. Salesforce is aimed at the sales function; Oracle does financial management, Workday does human resources, Quickbooks does accounting, Ariba does supply chain and so on. Each of these SoRs acts as the single source of truth for their specific function.161718

It’s instructive to look at the founding years of some of these SoR companies:

SAP: 1972
Oracle: 1977
Epic: 1979
Quickbooks: 1983
Ariba: 1996
Zoho: 1996
Salesforce: 1999

That’s pretty staggering. These are incredibly long-lived businesses — especially given the pace of change in software; they must have incredible moats.

Why Are SoRs So Sticky?

Salesforce is not, shall we say, a widely loved product. Nobody is passionate about their Salesforce instance. But they can’t live without it; Salesforce is as sticky as hell.

Why so? Because Salesforce controls data usage. If you need “accurate” data about your prospects and customers you have to get it from Salesforce19. Without Salesforce, you can’t do anything: you can’t email prospects, update their status, understand their needs, close a contract, support them post-close, model your pipeline, run a campaign, anything. Salesforce has a monopoly on your internal sales data.

What’s more, this privileged position means that most Salesforce instances have years of “workflow barnacles” adhering to them. These are both procedural: rules that sales and marketing and customer success staff have to follow — and technological: Salesforce has a whole app store’s worth of third-party tools to read, write, modify, visualize, present, and analyze its data.

Salesforce isn’t great, but the thought of how much work it would take to rip out an instance, export all its data, load it elsewhere, replicate all that app functionality, transition all the users, and get back to full productive flow, intimidates almost all switchers. And this goes for most at-scale SoRs, which is why they’re so moated.20

Sticky No More?

And then came LLMs. It turns out that exporting data from a SoR is just the kind of tedious task that AI agents excel at.

One of my favourite go-to-market motions from recent years is where would-be SoR disruptors offer to do all the migration work themselves: we’ll export the data, put it into the new system, add in new app hooks, you name it. The key insight here is that this is not a “risk”, it’s just a “cost”, and vendors are happy to pay that cost in return for years of LTV.

Ironically, it’s probably better for this GTM if the cost doesn’t go down too much. The difficulty of migrating SoRs leads to low churn rates and high LTVs, which justifies the investment in doing the migration work as a vendor — but if migration costs drop too far, others can do the same to you; churn will tick back up, meaning LTVs drop, and soon you’re in a race to the bottom. Data viscosity is your friend, until it isn’t.

Systems Of Action

Systems of Record have a monopoly on internal data. This control makes them both valuable and defensible. But what if they could do more?

After all, the value of data lies, solely and entirely, in the value of what can be done with it. You don’t want your SoR to be the place where data goes to die21 ; you want to act on the data.

This thought leads to the next, perhaps even more powerful type of data moat: the System of Action (“SoA”).

Systems of Action don’t just store data passively; they enable actions on top of it. The key, and what sets SoAs apart from SoRs, is how specific these actions are: how well the action layer coheres with the data layer and the user function.

An example will make this clear. Consider “version control” — the process of maintaining a software code base. This is a classic System of Record: you want a centralized repository of all your source code with no ambiguity about what’s in production, what’s under development, and what’s deprecated. The earliest version control systems did this, and not much more; they were essentially just file management systems.

Over time, it became clear that version control required specialized hooks corresponding to the needs of code management: “locking” (so two programmers don’t work on the same file at the same time), “deltas” (store only the differences between successive versions, for efficiency), “repos” (manage groups of files instead of piecemeal), and “checkin/checkout” (another way to solve the conflicts problem). These were still SoR-ish; still fundamentally based on managing records.

But as codebases (and programming teams) expanded, even more functions became necessary. Modern version control systems offer atomic commits, complex branching logic, pull requests, “blame” and other collaboration features, hosting, CI/CD integrations, social profiles, metadata and asset management, co-pilots, and much more. Calling Github or Gitlab a mere SoR vastly undersells how tightly embedded — and finely tuned these systems are with every aspect of a programmer’s productive life. Systems of Action.

Agentic Systems

The obvious next step is for the system to take action itself. Wait ... whose music is that I hear?

Enter LLMs.

Systems-of-record store data; systems-of-action empower humans to act on data; systems-of-agents act on data themselves.

Let’s continue with the example above. What’s the primary action that humans take, on top of a software SoR / code repository? Why, programming, of course!

It turns out that LLMs are very very good at programming.

Possibly the hottest area within the supernova that is AI, is LLM code agents. Multiple firms are staking out this space, each with different lines of attack: “virtual junior devs” like Devin, in-IDE tools like Replit’s Ghostwriter, workflow solutions like Cursor, Bolt and Lovable, and of course SoR-level agents like Github Co-pilot.

Which approach will win? That’s the ten-billion-dollar question22 . I like this summary of the disruptor thesis:

The counter-thesis is that SoR owners will block these apps, build them on their own, and fast-follow any user-facing improvements; and their head-starts in data and distribution will suffice to win the market.23

Exogenous Control

The next set of data moats is a catch-all category that I call exogenous control. In this pattern, you control the usage of data, not through any attribute of the data itself, nor even of the software that manages the data, but instead through external carrots or sticks. Here are some examples:

IP Rights: it doesn’t matter if the data is unique, or fragmented, in a SoR, protected by process power, or none of the above: if you have exclusive IP rights to its usage, you control it. Consider the S&P 500 Index: the underlying data is public, and the index itself is trivially easy to replicate, yet S&P Global makes ~$1B a year from licensing it — to asset managers (for benchmarking and ETFs), exchanges (for index-linked derivatives) and banks (for structured products).24
Contractual Monopolies: where you corner a primary data source via a favourable contract. Impossible in an efficient market, but data markets are not efficient; datasets are often (massively) mispriced. Unfortunately, this moat is temporary: if the data turns out to hold value, the contract will almost certainly get renegotiated on renewal. The best strategy here is to use your contractual moat to buy time to establish other sources of defensibility; IQVIA did precisely this for pharmacy data, while Neustar failed to do this for telecom data.
Regulatory and Compliance Moats: If the government mandates that people use a particular dataset, that’s a moat for the company owning, controlling, or implementing that data. A good case study here is the ENERGY STAR program: companies like ICFI, Leidos, DNV and Guidehouse make massive annual revenue providing these certifications to both government and private sector customers. Similar patterns exist for other programs like CAFE, air and water quality, FDA labeling ...

You’ll notice that exogenous control is often linked to government action. And government action is strongly inertial: hard to get started, but even harder to stop once started. State-sponsored data moats! 25

Catalyst Data

A specific flavour of unique data that’s worth calling out is catalyst data: data whose value comes from enabling or activating the usage of other data. This category is interesting because it’s a type of “indirect control” — you don’t need to control the enabled data directly, it may not be unique or proprietary to you, but you do control the ability to extract value from it — which means you can capture disproportionate economics.

Here are a few examples; in each case, “activates” is a synonym for “materially increases the value of”.

Google: user intent data activates search results data
Amazon: purchase history data activates product listings data
Acxiom: customer profile data activates basic marketing lists
Any social media company: viewer history activates new content
CUSIP, DUNS, LiveRamp, Datavant: unique identifiers activate siloed intel
FICO, Nielsen, ratings agencies, IQVIA: consensus benchmarks activate unanchored performance data

In each example, the second dataset has some baseline value in itself, but is made very much more valuable by the addition of the first. Indeed, you could argue that the above companies grew to dominate their respective industries precisely because they were the first to figure out how to unlock the value of the “enabled” datasets.26

An interesting aspect of catalyst data is that, empirically, it seems to result in winner-take-all or at least winner-take-most markets. This is partly survivorship bias: after all, you never hear of the catalyst datasets that don’t result in huge outcomes.

But it also reflect two patterns. First, catalyst data, when it works, tends to work really well — it adds substantial value to (untapped but often very lucrative) data assets. Second, catalyst data often works in sync with various data loops: industry standards, consensus benchmarks, user network effects and so on. We’ll explore this in more detail later in this essay.

Data Control, Summarized

Here’s a handy chart summarizing what we’ve learned so far:

Interlude

We’re about halfway through this essay. If you haven’t subscribed to my newsletter yet, now is a great time to do so.

I write infrequent, in-depth, original explorations of topics that I have substantial professional expertise in: data, investing, and startups. Learn more here.

PART TWO: DATA LOOPS

The second major category of data moat is the data loop: a positive feedback process that links data and business value in a virtuous cycle. Data improves the business, and the business improves the data (for some value of the word “improve”), and the flywheel spins sufficiently fast that no competitor can catch up.

This, for many, is the most familiar form of data moat27. It’s also the most misunderstood. Some data loops make strong and certain moats. Others are weak, have limited scale, or hidden vulnerabilities. Still others are effective, but they’re not data moats at all; they rely on scale or network effects, and you could take the data part out with no loss.

There are three main families of data loop: the quantity loop, the learning loop, and the usage/value loop. Let’s dig deeper into each of these.

Quantity Loops

The quantity loop is the simplest family of data loop: data attracts data. This can occur through quite a few different mechanisms:

User-Generated-Content (UGC) Loop

This is the loop that drives Facebook, Youtube, Instagram, TikTok, X, and even LinkedIn. All these platforms host user-generated content for free: photos, videos, posts, resumes. This content attracts other users, who post even more content. The more content, the more users; also, the more content, the better the recommendations, and hence again, the more users. The presence of all these users (and their attention!) attracts advertisers, who subsidize all of this.

Content, of course, is just another word for data. This is a perfect data quantity loop, and once it reaches maturity, both lucrative and very hard to displace.

Equally fascinating are the firms that fizzled after reaching non-trivial scale using this loop. Consider MySpace, Tumblr, Quora, Vine, Digg, and perhaps Stack Overflow28. What happened to their (supposed) data moats?

The reasons are varied — failure is overdetermined! — disastrous M&A (MySpace by News Corp, Tumblr by Yahoo), failure to monetize effectively (Quora’s paywall), self-inflicted wounds (Tumblr’s NSFW ban), product missteps and technical debt (most of them), and competing with the apex predator that is Facebook (all of them). 2930

But fundamentally, these are just triggers. The problem with the UGC data loop is that it can reverse just as fast as it builds. Everybody goes where the lights are brightest; and conversely, everybody flees the ghost town. Missteps, if not quickly reversed, become death sentences; momentum is a fickle friend. So this moat is deceptive: more vulnerable than it appears.

Search-Engine-Optimization (SEO) Loop

Once scaled, the classic UGC loop tends to lead to “walled gardens” of content, that users never leave or indeed want to leave. But there’s another, very similar loop where users are constantly re-acquired, and SEO is the engine of this re-acquisition.

This is the SEO data loop. Users create content, or the platform itself generates content programmatically; other users, looking for that specific content, find their way to the platform via Google or another search engine.31

The phrase “looking for that specific content” is important. Unlike the UGC loop, the SEO loop is task-oriented, not feed-oriented. The content has to be useful, and actually answer the user’s search query.

What are some examples of useful content? It’s a wide spectrum:

Reddit and Quora answer specific questions
Expedia, Booking, Kayak et al provide travel information and actions
Yelp and TripAdvisor provide service reviews
Zillow does house prices
Glassdoor and LinkedIn cover various aspects of professional life

These companies monetize in different ways. The broad horizontal platforms tend to monetize via ads, while the vertical ones mostly do affiliate or lead-gen. And then there are a few that monetize via subscriptions or services. Lead-gen, in particular, is sufficiently lucrative — think insurance, financial products, legal services, healthcare, education, travel, home maintenance — that there’s a whole cottage industry of service directories that exist solely to aggregate provider data, capture Google traffic, and take an introducer’s fee.

In each case, adding more data gives these sites more search equity, leading to more traffic, and hence (either directly or indirectly) more data. Flywheel unlocked, moat established.

How moaty is this moat? In the Google Era — roughly speaking, 2005 to 2020 — it was pretty damn moaty. Multiple billion-dollar businesses were built, and defended, using this loop.

But this era might be coming to an end, along with this moat. One reason is over-saturation: there’s so much AI slop out there, Google search is simply less useful32 33. Another is dis-intermediation: LLMs are already replacing search for purely informational queries, and agents may soon do the same for navigational and transactional queries. This would bypass the entire search-learn-select-purchase funnel on which the SEO loop is built.34 A final headwind for the SEO loop is the migration of content behind paywalls and logins. People are locking down their data assets; farewell, fully-open web.

Stack Overflow offers a cautionary tale. This graph of monthly questions posted on the popular programming site, created by Gergely Orosz based on data from Theodore Smith, says it all:

There’s an initial steep rise fuelled by search traffic. Then there’s an almost ten-year period of stability, despite minimal changes to the product. Surely competitors should have emerged during that window? The fact that they didn’t is testament to the power of this moat.

And indeed, the final decline isn’t due to direct competition; it’s because the world suddenly bypassed Google + Stack Overflow as the best way to answer coding questions. The dates line up almost perfectly: Stack Overflow’s traffic peaks in May 2020, and then in June, OpenAI releases GPT-3 ...

SaaS Data Gravity

Stars form by accretion. Amidst vast clouds of diffuse cosmic dust, local concentrations of matter attract more matter, in a gravitational spiral that ends with temperatures and pressures high enough to initiate nuclear fusion.

Empirically, software exhibits a similar pattern: diffusion is replaced by concentration. Here’s how it works.

A lot of software is initially sold via a “wedge” — a tightly-scoped, low-cost, low-risk insertion into the target customer’s stack. Step two is to “land-and-expand” into more users, applications, and revenue; often, this entails “going multi-product”.

But which software tools are able to expand, and which ones fizzle? The analogy with star formation provides the answer: the winners are those that already have local concentrations.

Sometimes, it’s a local concentration of “workflows” — customers like the productivity boost from combining multiple tools into one, but they hate switching their processes, so whichever platform owns the most frequent or most important workflows will tend to swallow the others.

And sometimes, it’s a local concentration of “data”. This is data gravity.

Tools that control the most important or valuable data tend to swallow tools whose data is peripheral or less productive or worse integrated.

Toast is a great case study here. Toast started out as a point-of-sale (POS) system for restaurants. Owning front-of-house order data gave them pole position to expand into kitchen display tickets, online and mobile ordering, consumer-facing apps, delivery integrations, gift cards, payments — and then, eventually, into restaurant financing, payroll, HR and more. All those other functions had competitor apps with their own data, but Toast’s POS data was both the most important and the most central; its gravity enabled Toast to swallow the rest.35

Data gravity moats are similar to, but not quite the same as system-of-record moats. Both moats benefit from workflow lockin, data viscosity, and controlling usage. Where they differ is the dynamics.

There’s no inherent requirement or expectation of growth in a SoR: they are perfectly sticky even when static. But growth is everything for a data gravity loop: growth in data coverage, in product use cases, in audience, in surface area.36

Data gravity in vertical saas is one of the best data moats out there, not least because it feeds into many other (non-data) moats: workflow, trust, revenue control and network effects.

Give-To-Get (G2G) Loop

This is a well-known pattern for data businesses. In a G2G loop, customers of the business receive data if (and sometimes only if) they also contribute data to the business. The more data the business has, the more attractive it is to customers — and hence the more likely they will sign up and contribute data! Positive feedback thus kicks in above a certain initial critical mass.

A pure give-to-get loop is when the data contributed is the same as the data received; and value is created by the act of aggregation. Waze is a great example: users contribute (individual) and receive (aggregated) traffic data. Geographic aggregation in fact seems to be a common theme for G2G models — think of OpenStreetMap, Weather Underground and GasBuddy.

Data consortia (aka data co-ops) are a special case of G2G where a group of competing businesses decide to pool their data resources to generate insight unavailable to them individually — and thus outperform rivals outside the group. The canonical example here is Verisk. Verisk’s ISO (Insurance Sales Office) served as “neutral territory” for a consortium of 7 large insurers; they aggregated and standardized data from these 7 firms, leading to better underwriting for all of them.

Marketplaces with 2-way review systems (like Airbnb — hosts review guests, guests review hosts) are also give-to-get, but here the value isn’t in aggregation, it’s in eliminating selection bias in the reviews, and hence fostering trust.

The theme of eliminating bias is why benchmarking is a very common application area for give-to-get data models. Dun & Bradstreet does this for B2B credit; PayScale does it for salaries; Glassdoor does it for employer reputation; Cambridge Associates does it for private fund performance; and so on.

Of course, the very best examples of benchmarking using a give-to-get data model are the big consulting firms: McKinsey, BCG, Bain et al. The G2G part isn’t explicit, and these certainly aren’t software or data firms, but the loop (and the moat) is very real. (The data in question is sometimes called “best practices”.)

Give-to-get is pretty moaty once you reach critical mass. The hard part is getting there: it’s a classic “cold start” problem, with challenges and solutions that are very reminiscent of the 0-to-1 step for marketplace businesses.37

Anonymization

Aggregation is not the only benefit of clearinghouse and give-to-get models. Anonymization also matters. This is for competitive reasons — there will always be specific business details that companies don’t want to share with competitors, but sometimes it’s hard to “give” data without also giving away those details. So you need a neutral party who can navigate those nuances. And it’s for regulatory reasons — privacy laws dictate what information companies can and cannot share, and adding an anonymity layer makes the process much easier, safer, and more compliant. This isn’t really a moat in itself, but it makes the actual moat (clearinghouse or quantity loop) more resilient.

Learning Loops

The learning loop is the next major family of data loop. Businesses use (“learn from”) data to run better; and running better helps them get more and better data. I wrote about this flywheel in Data in the Age of AI; here’s a stylized view:

This loop works, and at scale, it works extraordinarily well. But it is not a moat.

In fact, “data learning loops” are tied with “unique data” for first place in the category of “moats that aren’t really moats”. I already talked about unique data; but why is learning so un-moaty?38

It’s a question of limits. Business efficiency doesn’t scale indefinitely with data inputs; instead, the value of learning plateaus above a certain level. Meanwhile, costs show the opposite pattern: long-tail and edge-case effects kick in at scale, making “move-the-needle” data collection more expensive. Acquisition costs go up and marginal data value goes down; in other words, advantages diminish with scale instead of accelerating. This is no moat.

There are two and a half exceptions.

The first is what I call the business model unlock: when your learning loop reaches a magic threshold that enables a business model that simply isn’t possible without that learning. Threshold effects are important here: you need a discontinuity in the customer value function for this moat to work.

Amazon Prime is a good example of the business model unlock. Once Amazon acquired enough data (“learned”) about customer behaviour, purchase patterns, order frequency, warehouse location, inventory management, delivery scheduling, route optimization and so on, it was able to offer “free” 2-day delivery. This unlocked multiple virtuous cycles (in both time — order frequency and packed routes — and space — warehouse and driver density), which demolished every other horizontal marketplace. The data didn’t merely drive an iterative, incremental, quantitative improvement in Amazon’s operations (the classic, un-moaty learning loop); it enabled a whole new service that competitors simply could not match. An incredible moat.

The second exception is data businesses. It’s not that (increasing) cost and (diminishing) value effects don’t apply to data businesses; it’s just that the data gives them access to a bunch of other levers (unique data products, easier GTM, price discrimination, ecosystem tactics) that counter these effects. I wrote about these levers in The Economics of Data Businesses.

The half-exception is AI. In AI, outcomes do appear to scale indefinitely with data inputs, taking away that particular limit. On the other hand, there’s no actual loop here; it all happens during pre-training. Recent advances in test-time inference (and especially the tantalizing prospect of cross-user / cross-session learning) could change this, but we’re not quite there — yet.39

Secondary Learning Loops

There also exist “non-core” or “secondary” data learning loops, which are even weaker and less moat-y than core learning loops. I include them here for the sake of completeness:

Data quality loop: somewhat valuable, but quality is not a moat.
Product recommendation loop: eigenvalues and e-commerce are a cool combination, but also not a moat.
Product optimization loop: lol, no. A/B testing is not even a shallow ditch.40

I’ve heard each of these being referred to as a material source of defensibility. Nope.

If all these data learning loops are so weak, why do they have such a hold on the popular imagination? I suspect the reason is largely political. “We win because we use data and technology to provide a better service” is a much better sell than “we win because we charge monopolistic tolls on attention, commerce, and devices”.

Aside: The Barton And Von Ahn Loops

I want to highlight two data-loop entrepreneurs: Rich Barton and Luis von Ahn. They’ve each founded multiple decacorns or decacorn equivalents, using clear and distinctive approaches:

The Rich Barton Playbook: find an industry with knowledge that is valuable, fragmented, and opaque; make this hidden knowledge public; dominate search traffic; own “customer demand” in the industry; repeat. He’s done this for travel (Expedia), employment (Glassdoor), and homebuying (Zillow). Kevin Kwok wrote an excellent essay about this.
The Luis von Ahn Playbook: find a labelling problem where users get utility from the act of labelling — aka the two-sided learning loop. In Duolingo, users learn a language but also translate untranslated books. In Recaptcha, users authenticate themselves, but also label ambiguous images. (Mercor, not a von Ahn company, appears to be working along similar lines.)

What other repeatable data-moat playbooks exist? In The Economics of Data Businesses, I hypothesized that we’d see the emergence of lots of niche, vertical-specific, B2B data businesses. A year later, Travis May launched Shaper Capital to incubate precisely that style of business — “solving data fragmentation in various industries”, as he previously did at unicorns LiveRamp (identity) and Datavant (health). I’ll be watching the Travis May Playbook with great interest.

The Bootstrap And Switch

While learning loops may not be a sustainable moat in and of themselves, they are still valuable as the first leg of an extremely effective pattern that I call the bootstrap and switch.

In this pattern, you begin with a data learning loop to improve your product and grow your customer base; but as you reach scale, you build out other network effects and defensibilities that become your long-term moat. This is seen most often in content businesses (Facebook, Netflix, Youtube), and in marketplaces (Doordash, Uber, Airbnb).

Whenever recommendation or matching plays an important role in the value added by a platform — as in the examples listed above — learning becomes super valuable, and drives rapid growth. But the long term moats for those businesses come from network density, user trust, economies of scale, and attention-aggregation. Not from learning.

Usage/Value Loops

The next type of data loop, and possibly my favourite, is the usage/value loop: the more widely a particular data asset is used, the more valuable it becomes to its users. Whoever owns or controls this data asset mints money.

This loop comes in a few different flavours:

Data Exchange Standards

You provide an industry-wide “primary key” that can link records held by different companies. Factset’s CUSIP identifier is a good example: it allows capital markets participants to agree on precisely which securities they’re trading, clearly and unambiguously.

In a financialized world, with multiple debt and equity classes, listing venues, and security types, this is critical: firms simply cannot function in capital markets without CUSIP. And the more people use it, the more valuable / universal / lucrative / essential it becomes. It’s a toll on the industry, with increasing returns to adoption!

(Similar products to CUSIP are DUNS, VIN and ISBN; this category seems to have a fondness for acronyms.)

Business Evaluation Standards

You provide an industry-wide “benchmark” that everybody uses to evaluate or price transactions. S&P’s eponymous S&P 500 Index is a good example: every investment manager (and every LP in an investment manager) compares their performance to this index, and it’s also used to price derivatives, structured products, ETFs, you name it. Again, the more it’s used, the more dominant it becomes. Another tax!

(Similar products: Nielsen’s viewership benchmarks, corporate and sovereign risk ratings, FICO scores.)

Pass-Through Loop

I talked earlier about why most data learning loops aren’t moats: because there are limits to the returns from data. But what if you could explicitly pass through some of those gains back to your customers? For a select few companies, in a select few markets, this means the limit does not exist. The bigger you get, the greater your gains.

For example, Stripe uses its massive transaction volume to flag and prevent fraud more effectively, and they pass that through to merchants in the form of lower take rates and fewer chargebacks. This is better than mere “product improvement” because it is both unbounded and dynamic41, and the more people participate, the better it is for everyone.

“Better underwriting through data” is an interesting and perennially tempting target for a pass-through loop. It all hinges on the word “better”. Most challengers take it to mean “broader”, which is problematic — it means you’re going after the marginal creditor, with all sorts of adverse selection, moral hazard, and short convexity effects. But “better” can also mean “faster”, in which case you’re broadening the market in time not in space: more turns for each dollar lent, hence higher IRR, which you can then pass through. The general (counter-intuitive) lesson here is that if “learning” unlocks “speed”, that’s usually more moaty than unlocking mere “quality”.

Trust Loop

In many industries, “ground truth” isn’t fully objective, it’s subjective and endogenous — it’s defined by what people believe. In this scenario, being the “trusted source of truth” is a usage/value loop: the more people trust you, the more your data or content offering is accepted as ground truth, and the more people should trust you.

Examples include peer-review (Elsevier, Clarivate), security (Verisign), market research (IRI, Kantar), insurance (Verisk), and financial ratings (Fitch, Moody’s, Morningstar). You can bootstrap this trust with features like transparency or auditability, but beyond a certain point, these become self-fulfilling prophecies, hence moats. (And yes, these are all data or content assets.) 42

Implicit Knowledge Capture

This is an emerging flow that combines proprietary data, learning, trust, and usage/value. Many application-layer AI companies follow a certain template:

begin with a foundation model;
fine-tune it on vertical-specific knowledge;
automate basic workflows in that vertical;
discover edge cases and interaction effects and other wrinkles;
incorporate human feedback to get better at these messy “real world” tasks;
gain the trust of their human sponsors;
widen the aperture to replace or assist on ever more workflows;
repeat the above steps;
become irreplaceable.

The usage/value step here is capturing implicit (undocumented) human knowledge, which leads to better outcomes, which leads to more trust, which leads to more surface area. And the limiting factor here is thus the complexity of the real world (which is much higher than the point at which diminishing returns set in for a vanilla data learning loop).

Network Effects And Scale Effects

A business has a network effect when adding a new node (typically, a customer) to the network benefits all the other nodes (other customers), either directly or indirectly. This is a well-known and well-studied pattern, with examples from Amazon, Alipay and Airbnb to Zoom, Zomato and Zelle.

A data network effect is when the node of interest is data; hence, adding new data A makes data B more valuable, and vice versa.

True data network effects are rarer than once thought.

Sure, there are plenty of companies that exhibit weak data network effects. Several of the loops above exhibit this: they’re tenuous, or they plateau, or they decay with time. Meanwhile, there exist companies that exhibit strong data network effects, but they are often one-of-ones.

But there also exist data scale effects, and some of these can be quite powerful:

fixed cost amortization across a wider base of users and data
advantages in GTM: price discrimination, market coverage, segmentation
brand and trust effects: dominant positioning, category creation, inbound acquisition, efficient conversion, frictionless retention
market power: acquisitions, setting standards, commoditizing the complement

Scale effects aren’t quite as moaty as network effects, but they’re nothing to sneeze at either.

The catch of course is that there are also diseconomies of scale: long-tail effects, diminishing marginal performance, disruption from below, quantity-quality tradeoffs and so on. So it’s a question of whether the lift outweighs the costs, and the answer is usually quite domain and business-model-specific.

Data Loops, Summarized

That brings us to the end of our survey of data loops. Here’s handy chart the second:

Overlaps And Evolution In Data Moats

Two things should be clear by now.

First, moats overlap. Many of the companies listed above have multiple data moats, and they’re often mutually reinforcing. Hybrid moats exist, combining elements of both data control and data loops. Catalyst data can arise from an acquisition loop; implicit knowledge capture can be built on top of a system of action; process power can lead to customer data aggregation; and so on. Looking for a “single” data moat is analytically reductionist, and probably not the best business strategy either.

Second, moats have lifecycles. Many moats only kick in above a certain scale or critical mass: brute force, network knowledge or complexity, clearinghouse, UGC and SEO, data gravity, give-to-get, biz model unlock, exchange and evaluation standards, trust and brand effects. And many moats have upper limits to their potency, thanks to LLM replacement, diminishing marginal value (of data quantity or quality), long-tail effects, and diseconomies of scale. This suggests that in addition to defining your moat, you need to understand where you are in its effectiveness curve.

CONCLUSION

How To Use This Essay

“Philosophers have only interpreted the world. The point, however, is to change it.”

This essay is for founders, investors, and operators who care about data moats: who want to build them for their own companies, or compete with rivals who have moats of their own. But as we’ve seen, “data moats” are a wide and varied concept.

Step 1, then, is to understand what type of moat you have (or can build). Do you have unique, valuable data, acquired in a future-proof manner? Do you unify fragmented data and build products or services on top of it? Do you control a system of record, or control data movement in a complex or regulated network? Do you have exclusive IP or contractual or compliance rights? Do you own data that gets more valuable with adoption, like an exchange or evaluation standard? Do you benefit from data brand effects? Do you have a quantity loop like UGC, SEO, G2G or data gravity? Can you unlock new business models with data? Is your moat future-proof and specifically LLM-proof? Do you have a moat that’s not in the list above?43

And so on. The set of questions is lengthy, but finite; the detailed exploration above should help you identify exactly what moats you have access to.

Step 2 is to understand how strong, and how durable, those moats are, and where you are in its lifecycle. Here’s handy chart the third:

As you can see, not all data moats are created equal44. Few understand this.

And therein lies alpha. Knowing the strengths and weaknesses of your own moats, of your rivals’ moats, and knowing what you (and they) can potentially build towards, is a major strategic advantage. Data moats are real, and they’re powerful, and AI makes them even more so; harness this power!

Have fun and good luck!

Toronto, April 2025.

END NOTES, REFERENCES, BONUS THOUGHTS

Essay Meta

“This tale grew in the telling” — J. R. R. Tolkien, foreword to The Lord of the Rings.

I feel an odd kinship with the good professor. At the start of every essay, I tell myself I’m going to keep it short and sweet this time. But the process of writing uncovers new ideas and insights, new underlying structures and themes, new data points and anecdotes, new illustrations and examples — all of which I want to include, because I think they make my essays stronger.

Today’s essay is no exception. It’s my longest yet, but I also think it’s one of my meatiest. I hope you agree!

My philosophy of essay-writing is “fewer but better”. I publish 2-3 essays a year. Each essay is 5-7k words long, takes 2-8 weeks of active writing, and at least 5-6 months of thinking before that. I only write when I have something meaningful, original, and material to say. I use Claude as a thought partner and research assistant, but every single word (and every em-dash) is written by me.

The algorithms hate this, but it works for me, and I think it selects for a high-quality group of readers who appreciate what I do and the work I put in. Thank you for being one of them!

… And A Request

I hope you found this essay interesting, informative, and useful.

If you did, I would greatly appreciate it if you could help spread it far and wide::

Send it to your colleagues and friends.
Share it on social media — X, LinkedIn, or the platform of your choice.
And of course, subscribe!

Your likes, comments, shares, and subscribes are the best incentive for me to keep writing these pieces, and giving them away for free.45

Tapping A Tuning Fork

After selling Quandl, the company I co-founded, to Nasdaq a few years ago, I’ve been on sabbatical — writing this newsletter, angel investing, and advising / serving on boards. But now I’m feeling the itch to jump back into a more active role, and so I’m tapping a tuning fork to see who resonates.

Specifically, I’m actively exploring startup ideas in the data + defensibility space covered by this essay. If you’re interested in working with me, and especially if you’re technical, please do reach out.

I’m also very open to and welcome interesting third-party opportunities. If you have an interesting opportunity for me to explore, especially if it touches one or more of data, investing, and startups (the Pivotal trifecta), I’d love to hear from you.

Tapping A Tuning Fork, Part Two

I’m an active angel investor in technology startups, and I put my money where my mouth is: almost all my portfolio companies have (or are building) defensibility through data. Here are just a few examples:

Daloopa is building a foundational data layer for equity investors, involving elements of process power, clearinghouse, knowledge capture, and user trust.
Quandri is building an agentic system for insurance, incorporating a system of action, unique data, knowledge capture, and surface/area trust.
Arima has a unique data asset that is partly synthetic, partly clearinghouse; and is also a catalyst for customers to unlock their own data value.
Setyl is a system of record for IT assets, with a strong data-gravity effect.
Canopy is a system of vision / engagement for multi-family managers, with aspects of give-to-get and benchmarking baked in.
Cascade is a system of record / action for private credit, with data gravity, pass-through value capture, and network / complexity moats.
Citylitics has unique data, industry-specific labelling, implicit knowledge capture in a domain-tuned LLM, workflow embed, and G2G customer data aggregation.

These companies had negligible revenue when I invested; now they’re collectively at around $25M ARR. Data moats work!

If you’re the founder of a company with interesting data defensibility, or if you’re a VC who’d like to jam on these ideas or collaborate, please do reach out.

Companies Mentioned In This Essay

Acxiom, Adyen, Airbnb, Alipay, Amadeus, Amazon, Apollo, Ariba, Arima, Bain, BCG, BenefitFocus, Bolt, Booking.com, Cambridge Associates, Canopy, Cascade, Change Healthcare, Citylitics, Clay, Clearbit, Clarivate, Cursor, Daloopa, Datavant, Devin, Digg, DNV, Doordash, Dun & Bradstreet, Elsevier, Epic, Expedia, Facebook, Factset, FICO, Fitch, GasBuddy, Github, Gitlab, Glassdoor, Google, Greenhouse, Guidehouse, Gusto, ICFI, Instagram, IRI, IQVIA, Kantar, Kayak, Lattice, Leidos, LinkedIn, LiveRamp, Lovable, Mastercard, McKinsey, Mercor, Microsoft, Moody’s, Morningstar, MySpace, NASDAQ, Netflix, Neustar, News Corp, Nielsen, Numeral, NYSE, OpenAI, OpenStreetMap, Oracle, Palantir, PayScale, Plaid, Quandl, Quandri, Quora, Reddit, Remote.com, Replit, Rippling, S&P Global, Sabre, Salesforce, SAP, Scale, Setyl, Shopify, Stack Overflow, Stripe, Studio Ghibli, TikTok, Toast, TripAdvisor, Tumblr, Twitter, Uber, Verisign, Verisk, Vine, Visa, Waze, Weather Underground, Workday, X, Yahoo, Yelp, YouTube, Zapier, Zelle, Zillow, Zomato, Zoho, Zoom, ZoomInfo.

Acknowledgements

As a general principle, I do not share drafts of my essays or ask for feedback before publishing them; I don’t want to dilute my voice or opinions. I haven’t done so for this essay either. However, I do want to acknowledge the many conversations on this topic that I’ve had over the years with people whose thinking I respect:

Tripp Jones, Dave Marquardt, Dave Yuan, Kevin Kwok, Matt Ober, Sheetal Bahl, Bill Dague, Tammer Kamel, Carrie Shaw, Beth Adams, Sergei Ryshkevich, Auren Hoffman, Matt Turck, Greg Neufeld, Jeremy Giffon, Alex Izydorczyk, Evan Reich, Thomas Li, Ahmed Badruddin, Jackson and Jamieson Fregeau, Sunny Juneja, Winston Li, Christopher Batts, Joe Schmidt, Philippe Kwiatkowski, Nick Achkarian, Jeremy Diamond, Ruben Schreurs, Jared Bochner, Alex Niehenke, Leo Polovets, Ben Rollert, Susan Liu, “The Terminalist”, Abhishek Sharma, and the late, much missed Naren Gupta.

Thank you! (All errors are, of course, my own).

— AT.

Footnotes

Did you know that you can click on the footnote number — the 1 to the left of this line — to jump back to your spot in the essay? This will prove useful; read on.

There’s a vast literature around if, and how, and when, various types of moats are effective, with myriad catchy frameworks. 7 powers! 5 forces! 3 rings for the elven kings! You can add the current essay to the list.

“Tools are useless without materials, and materials don’t have value unless worked on with tools.” — a quote from my 2023 essay Data in the Age of AI. I think that essay may also have been one of the earliest to cite Jevons’ Law while analyzing NVDA. Go read it after you finish this one!

Data in the cloud is the new moat in the air. Sorry, not sorry at all.

I first used this meme in a talk I gave almost a decade ago, back when data was the new gold (oh, innocent days). The plan back then was 1. collect data 2. ???? 3. profit !!!! Plus ça change ...

But mostly in the footnotes.

aka virtuous cycles, flywheels, self-fulfilling prophecies, chain reactions, upward spirals — the names are varied, the mechanisms are similar.

See my previous essays The Economics of Data Businesses and especially How to Price a Data Asset for more about the unique dynamics of unique data.

One could argue that making money from exhaust data improves NYSE’s overall finances, and hence allows them to out-compete in their core business. But by that logic, anything that enhances margins is a moat. I think that’s stretching the definition beyond the point of usefulness.

Specialized vertical knowledge or domain expertise is sometimes proffered as a reason why LLMs won’t completely displace brute force, but I think the reality is exactly the opposite: the more specialized your domain, the easier it is for LLMs to grok.

Market timing and cost of capital are, in turn, often a function of geography. Consider ride-share map data: totally commoditized in developed economies, but a genuine moat (brute force, unique, self-improving) in developing countries, where capital remains expensive.

A fun reversal of the standard trope that “action produces information”, though the contexts are completely different.

Why do case studies always mention Visa, and never Mastercard? That’s the power of mythologizing: specifically, Dee Hock’s outstanding book One From Many.

This sometimes works even without external regulations or compliance to act as a forcing function. If the data flow is sufficiently complex, the same upside/downside calculus applies. We’ll see this more clearly in the discussion of system-of-record moats.

Note the similarity with clearinghouse data moats. Instead of unifying fragmented external data, SoRs can be thought of as clearinghouses for fragmented internal data. Systems of record make organizations internally legible.

For some reason, functional SoRs typically have their own category names: HRIS for HR, ERP for budget, CRM for sales and so on.

Each mega provider tries to expand into adjacent functions and become the One SoR to Rule Them All. Hasn’t happened yet.

And simultaneously, each mega provider is constantly undercut by smaller SoRs that try to do better on specific niche tasks. For example: Greenhouse (applicant tracking) and Lattice (performance management) and Benefitfocus (which, ahem, focuses on benefits) within the HRIS complex. Something, something, bundling, unbundling.

“Accurate” here does not mean “possessed of verifiable truth value”, as any experienced CRM user will tell you. It means “you won’t be fired for using this data”. An important distinction that goes to the heart of why a SoR / source-of-truth is so valuable.

Seen in this light, most SoRs establish loyalty through compulsion, not love. It’s a hostage situation, and users who claim to love their SoR software are suffering from Stockholm syndrome ...

There’s a name for that as well: the “Write-Only Ledger”.

The purported valuation of Cursor’s latest funding round. By way of comparison, Github was bought for $7.5B in MSFT stock in 2018, while Gitlab went public in 2021 at a market cap of $11B. Those numbers seem small today. Per BLS data, there are 1.7M software developers in the USA, with a mean annual wage of $140k. That’s $200B of potential labour revenue to be captured.

The counter-counter thesis is that foundation models will eat both the SoR and the app layer, and Sam Altman will fulfill his destiny and become first God-Emperor of humanity.

But why do they pay so much? IP is the enforcement mechanism, but the motivation derives from industry standards and brand moats. We’ll get to these soon.

Alex Izydorczyk has an excellent essay on this: Unfair Data Moats & Regulatory Capture

Don’t confuse software that catalyzes data — for example: data catalogues, taxonomy, and governance solutions — for catalyst data. The former is far less moaty; it tends to be replaced by SoAs — think Palantir — or these days, by LLMs.

Yes, it has taken me 5000 words to get to “the most familiar moat”. This is the Pivotal you all know and love.

It’s a sad indictment of Google that Google+ didn't even make the list of interesting failures. But don’t feel too sorry for them. Google’s moat is elsewhere: as a system of record for the internet, plus a system of action for commerce. Infinite money glitch.

What might kill Facebook? Absent a new apex predator emerging — an Orca to FB’s Great White Shark — the mostly likely scenario is just generational transition. Old Facebookers never fade away or churn, they just ... die.

There’s also a sort of Pauli exclusion principle at work here. In any given geography + demographic + use case, the strength of the UGC data loop tends to result in a single, dominant platform. Facebook’s baseline geog-demog-use just happens to be the largest, and they use that position to buy or bully the competition.

i.e., via Google.

“Nobody goes there any more, it’s too crowded”. The SEO loop is a victim of its own success.

Honestly, it’s ludicrous that appending “reddit” to your query still works as well as it does. Major Google fail.

An interesting speculation is whether SEO will be replaced by “LEO” — “LLM-engine-optimization” — and who controls / benefits from transaction activity in that new paradigm: is it site owners, content and attention aggregators, app-layer agents, foundation models, or someone else?

The analogy doesn’t end there. Stars that grow too massive explode. Software platforms that grow too bloated are disrupted “from below” by cheaper, more focused offerings. Something, something, bundling, unbundling, redux.

Of coures, the best SoRs grow, and the best gravity-assisted tools eventually become SoR-like sources of truth for their orgs. Indeed, all successful software platforms eventually become SoRs. It’s a form of convergent evolution: SORcinization?

Single player value, seeding the platform, local concentration, come for the tool etc.

Moated, moatiness, moaty: this field needs a new vocabulary.

Breaking: while I was putting the final touches on this essay, OpenAI released an update to ChatGPT with “memory” across user sessions. The footnote you’re currently reading could well be the most important part of the whole essay.

I will point out here that optimizing a product is not the same as optimizing a business.

As Patrick McKenzie notes, the optimal amount of fraud in a payments system is not zero.

All usage/value loops rely heavily on brand effects. Some, like the trust loop, do so very explicitly; in other cases it’s more implicit. “Nobody gets fired for using dataset X” is the new “nobody gets fired for buying IBM”. This is not entirely exogenous; brand is often downstream of data quality and performance. IBM was pretty good, back in the day. (See also footnote 19).

If so, I would be delighted to hear about it; please email me!

The goal is “durable defensibility”; h/t Tripp Jones for the phrase.

Twitter used to be a terrific channel to share, discover and amplify long-form content; unfortunately, X just isn’t the same. So it takes a bit more effort; I get by with a little help from my friends.

Pivotal