Data pricing is often hand-wavy and inexact; datasets can be free or worth millions; very few people know how to price a data asset or even what criteria go into such a calculation. This essay is an attempt to change that.
Introduction
Data is the new oil, they say; data is the new gold. Very well, then: oil costs eighty dollars a barrel, and gold is twenty three hundred dollars an ounce. How much does data cost?
It’s a meaningless question.
The factors driving oil prices may be complex, but there’s a well-established consensus on transaction criteria: volume, location, grade, and date. There are exchanges which specify delivery rules for benchmark contracts like WTI, Brent and Dubai. When you buy a barrel of crude, you know what you’re getting.
Data ... is not like that. Data is inherently heterogeneous. Dataset A and Dataset B may both be bits on a drive somewhere, but often have absolutely nothing in common beyond that. Different fields, schemas, specs; different themes, coverages, informational content; different consumers, use cases, and value. Every barrel of WTI crude is identical; no two datasets are identical.
Does this mean that data pricing is all art, no science? Not quite. Data’s innate heterogeneity means that no criteria can be absolute; there’s no single formula you can apply. But there are definite principles that generalize across a wide range of data assets.
I was the co-founder and chief data officer of Quandl, a successful data marketplace (now owned by Nasdaq). In that role, I evaluated thousands of data assets and priced hundreds of data products. I can say, with some confidence, that I’ve priced more — and more varied — data products than almost anyone in the world.
In this essay, I’ll share a few of the things I’ve learned. I’ll start with some basic axioms of data value; then I’ll lay out the implications of those axioms. But first, an important aside on why this matters now.
There’s a new buyer in town
Historically, two industries have dominated transactions in data: finance, and adtech1. These are the only two industries with multiple buyers, multiple valuable data assets, multiple use cases; also the ability to pay consistent material recurring revenue for data2.
This has changed. There’s a new buyer in town: AI. AI models have an insatiable appetite for training data — it's almost certainly the current limiting factor for growth in their abilities — and so their sponsors (MSFT, OpenAI, Anthropic and friends) go to great lengths to acquire such data. But the data that they need isn’t necessarily like the data that finance and adtech need, and their utility/value curves are also different. This has implications for the pricing of training data, rendering much past intuition irrelevant. I’ll try to highlight some of these new effects as we go through our list of pricing principles. Let’s go!
PART ONE: AXIOMS
Data has no innate value
Starting with the obvious (but often misunderstood): data has no innate value. The value of data comes from the value of what can be done with it3. Therefore, every discussion of price has to begin with understanding that value: how the data will be used, and by whom.
Data value depends on the use case
It’s meaningless to talk about data value without specifying how the data will be used. Financial statements aren’t useful for an advertising campaign. Audience profiles aren’t useful for equity analysis. But flip those around, and the datasets are not just useful; they’re essential. The use case is everything.
Data value depends on the user
The identical dataset, with the identical use case, may nonetheless offer different value to different users. The value of training data to OpenAI is very different from its value to a solo hacker. The value of capital markets data to Citadel is very different from its value to a retail investor. Every data user is unique.
Some of this is just a scale effect; the same data has more impact when deployed against millions of customers or billions in capital or trillions of parameters. But it's also a capability effect: Citadel and OpenAI can get a lot more out of the same data than a smaller, less sophisticated user. A large part of data pricing is finding useful proxies for these two effects. We'll come back to this idea.
Data is fundamentally additive
Data is additive in a way that software is not. If you have one CRM, you don't need a second; adding HubSpot and Pipedrive to your Salesforce doesn't improve your sales performance. The same goes for your ticket tracking system, your HRIS, your payroll and expense systems, indeed all your software. Duplication is anathema; don’t repeat yourself.
This is not the case for data! Adding more names to a list of prospects makes the list more valuable. And in fact you can add data along more than one dimension: more names, but also more fields for each name, and more details for each field; you can even combine three perfectly overlapping datasets to generate a higher-quality merge. Additive!
Data is actually rivalrous ...
An almost universal misconception is that data is a ‘non-rival good’: person A using a dataset does not prevent person B using the same data. This is incorrect.
It’s incorrect because it focuses purely on the mechanics of data consumption. It’s trivial to duplicate most datasets, and in that sense, yes, A and B can both ‘use’ the same data. But that does not mean they derive the same value from it.
Financial markets provide the clearest example of this. If a particular dataset (satellite images, say, or credit card transactions) holds unpriced information content, then hedge fund A can trade on that data to make money. But once that's done, hedge fund B cannot! The opportunity goes away. The data is effectively rival: only one party can act on it.
The non-rival misconception stems from treating data as having innate value. If that were the case, merely having the data would suffice. But as we know, the value of data lies in the value of what you can do with it. And that’s often contingent on nobody else doing the same thing. In actual data business practice, the more valuable a dataset is, the more effort its owners expend on keeping it exclusive, proprietary and protected. This wouldn’t happen if data were a non-rival good.
Note that with the planet-scale datasets used for AI model training, even the starting assumption — that it’s trivial to duplicate and consume data — is incorrect; these datasets are huge.
... until it isn't
Advantages don’t last forever — not in capital markets, not in AI training, not anywhere. Datasets become commoditized, and substitutes arise. At this point, the data becomes non-rival.
This hurts most data vendors, but is a massive boost for a select few. The dream of every data owner is for their data to become table stakes — commoditized, but essential.
Data assets have well-defined lifecycles
When pricing a data asset for sale, it’s critical to understand where it is in its specific lifecycle.
Early on, neither the dataset nor the market are mature enough to drive value. The dataset is incomplete, inaccurate, slow, irrelevant. The market lacks the tools or sophistication to use the dataset effectively. There may be exploratory activity, but transactions are rare.
The second stage is when early adopters realize there is alpha to be found in the data — better targeted ads, or better model evals, or excess market returns. (I'm using the word alpha here in its broadest sense — an edge over the rest of the industry). At this point the data is super-valuable (and priced accordingly) but the audience is still narrow. Most data assets, by the way, never make it this far.
The third stage is for a given dataset to become widespread in its industry. Substitutes proliferate, and prices decline. With more suppliers and also more users of this data, its alpha decays; there’s still some there, but not much4.
The final frontier is when the dataset becomes table stakes: when people use it, not because it confers an advantage, but because not using it would put them at a disadvantage. At this point, prices go back up — not as high as during the alpha phase, but higher than during the decay phase — and usage expands dramatically. This is the best position to be in as a data owner; companies that get to this stage can tap years if not decades of revenue from a single asset5.
It's the marginal lift that matters
Combining point 1 (the value of data lies in what can be done with it) and point 3 (data is fundamentally additive) leads us to a rigorous definition of data value:
The value of data is the value of the marginal change in actions taken after adding the data to your business process.
where that business process could be your model training step, your quant trading strategy, your ad auction, or anything else that’s data-driven.
Finding ways to estimate this marginal value is the key to effective data pricing6 7.
PART TWO: IMPLICATIONS
Unique data is uniquely valuable
We now begin to understand why truly unique (aka proprietary) data is so valuable.
First, it's universally additive — you can combine it with almost any existing corpus, and increase its utility, and thus its value8.
Second, the data owner can (attempt to) control its lifecycle. By controlling the pace of commodification / alpha decay / transition to table stakes, you maximize the area under the [price x transactions] curve.
Third, if and when a unique data product becomes table stakes, its owner has monopoly power; this is tantamount to collecting a tax on an entire industry. Paying for the data is just ‘the cost of doing business’.
Beware of functional substitutes
The value of data is the value of what can be done with it. Therefore, completely different datasets may be competitors, vying to offer the same value! This is ‘functional substitution’, and it throws off many who think their data assets are unique.
Here are two examples:
Foot traffic, email receipts, and credit card transaction logs are all very different datasets, but they offer similar insight: what are people buying at the mall. Your foot traffic data may be proprietary and unique, but credit card transactions are a functional substitute, and so your data isn’t really unique in the value it offers.
Demographic profiles, social network behaviour, and search histories are all very different datasets, but they offer similar insight: what are people interested in and hence what are they likely to buy.
Note two things here. First, functional substitutes can be (and usually are) additive. You can combine them to paint a richer picture of reality, especially if the underlying sources / mechanisms are sufficiently uncorrelated.
Second, there’s usually a hierarchy of value among functional substitutes, and it has to do with which dataset is ‘closest to the sun’ — i.e., most tightly linked to the underlying event of interest. Yes, demographics can sometimes predict buyer intent, but it's not as powerful (or valuable) a signal as e-commerce search activity, because the latter is much closer to the act of purchase.
Standard software pricing fails for data ...
The vast majority of enterprise software today is sold via tiered subscription plans. Tiers — with labels like bronze, silver, gold; or maybe individual, professional, enterprise — are distinguished by features, seat count, usage limits and other proxies for value delivered9.
Unfortunately, most of these proxies don’t work for data10. For example:
Pricing by seat doesn't work, because data value doesn’t scale linearly with user count. Adding one more data user is not like adding one more software user or login account. Data is used by teams; you can sometimes charge per team, but teams are heterogeneous in size, use case, and value generated.
Pricing by feature (in the software sense, not the ML sense) doesn't work. Indeed it’s not clear what a ‘data feature’ would even be. There’s rarely a slider that makes sense here.
Pricing by raw volume (i.e., how many terabytes are you buying) doesn't work unless the data is perfectly fungible, somewhat commoditized, and probably quite unstructured. But note that this actually is the case for some AI training datasets.
Pricing by API call doesn't work unless the data changes / decays extremely rapidly. (But if the API call triggers an action, not just a download, then this can work; see the discussion of wrappers below).
Pricing by download —the equivalent of paying for an on-prem software license — doesn’t work because data is trivial to copy, and auditing is hard.
But there are other angles you can take that do work:
Pricing by structured volume works: you pay more for more records, or for more fields (if the data is tabular-ish), or for more coverage, or for more granularity. For example, paying for more profiles (in a marketing database) or longer history (in a financial database).
Pricing by quality kind of works. Note that quality exists across multiple dimensions — accuracy, completeness, annotations, structure. This deserves an entire section to itself; see below.
Pricing by access works. You can charge more for speed, recency, update cadence, exclusivity, and custom usage rights. These are genuine sliders, and many data vendors will explicitly tier their pricing based on what buyers want and need11.
Pricing by use case is something you can do with data that you cannot do with software. Software use cases are largely fixed: DocuSign won’t do your accounts, and Quickbooks won’t manage your signatures12. But the same dataset may help different users perform completely different tasks. (This of course is why data pricing is hard!)
Pricing by customer scale works. This is obvious, but worth being explicit about: large customers get more value from a given data asset than small ones, hence they should be willing to pay more. And this is not just a deep-pockets effect; it's rational behaviour on all sides.
Pricing by business unit is a special (and useful) case of pricing by customer scale; the business unit here could be a team, or a geography, or a product line, or even a specific model generation. Customers are usually more amenable to this slider than paying straight up for scale.
AI changes one thing in all the above, which is that pricing by raw volume now works. Data quantity is unreasonably effective in improving model outcomes, and so it makes sense to not quibble over structure, quality and access, and just pay by the petabyte.
... unless you wrap the data in software …
A common pattern is to identify the most valuable use case for your data asset yourself; wrap it into a software app; and then sell the app. Google is the canonical example here: one of their data assets is ‘knowledge of customer intent’, which they wrap into their Ads business. Advertisers covet the ability to get their product in front of the ‘right’ customers; they’d do this themselves if they knew who the right customers were; Google has that info, and so does it for them, at a price.
Experian does the same for consumer credit data; Bloomberg built a whole terminal to manage access to its financial data assets; and for that matter ChatGPT is just a user interface, built on GPT-4’s model weights, which in turn rely on oceans of training data. (One could argue that given the pace of progress in both closed- and open-source LLM architectures, it’s the training data — and resulting model weights — that's the true secret sauce here.)
Wrapping your data into software has two major advantages: first, it’s now much easier to link data value to value delivered; and second, you can now sell using the traditional software pricing axes. (BBG has per seat, Experian has per inquiry, Google has the incredible keyword auction.)
Note, by the way, that monetizing your data asset by wrapping it in software is not at all the same as using data to improve the performance of your software asset. (This is an extremely common category error.) The latter is the classic ‘data learning loop’ used by businesses from Amazon to Zendesk; but data is not the core offering for those companies; data merely helps them optimize their software and operations.
… or you use the data to wrap a service
An increasingly common business model is to deliver what is essentially a service, in the form of a data asset.
Consider Scale.ai or Clearbit or Datavant. They each offer what is fundamentally a service — data labelling, profile enrichment, and medical record anonymization, respectively — but they offer this service in the form of a data product, that you access via download or API call or common key.
The beauty of this is that unlike traditional service models, this approach scales up: perform the service once, but sell it many times. And just as with software wrappers, you can monetize service wrappers along the traditional software pricing axes — by action, or by API call, or even by seat.
Data quality is multi-dimensional
High-quality data costs more. But what exactly do we mean by high quality? It turns out that — as with all things data — the definition of data quality depends on the use to which the data is put.
Quant and systematic hedge funds — perhaps the largest purchasers of raw data in the world — care about accuracy and precision. Their business is predicated on identifying data points that represent violations of efficient markets — outlier prices, predictable patterns, unexpected correlations, internal inconsistencies, misunderstood risk. Bad data points are dangerous because they suggest violations where none exist.
Adtech platforms — the other major buyer/user of data at scale — care about coverage and depth. Outliers and individual bad/missing data points don't matter that much; their business is all about capturing as much of the heart of the distribution as possible, so that when an advertiser defines a target profile, keyword or behaviour, they have a rich audience to offer them.
AI models in need of training data — the new buyer in town — care about structure and internals. As of right now (and this field evolves incredibly rapidly, so who knows how long this will remain true), the quality attributes of training data that appear to have the most marginal impact on model performance are structural (as opposed to informational) cleanliness (i.e. deduped, denoised, debiased data); annotations; data diversity; and perhaps some amount of domain specificity.
Note that these quality factors aren't mutually exclusive; ideally you want all of them (and more), no matter what your use case. But their relative importance varies.
How does this affect data pricing? Price discrimination is the obvious answer: the same data asset may have different value to different users, based on its distribution of quality factors. (This is just a restatement of axiom 3 above.)
A subtler point is that many of these quality attributes are improvable. You can annotate raw data; you can combine datasets to increase coverage; you can cross-reference datasets to boost accuracy. Taking these actions is an easy way to boost data value, either for external sale or for internal use.
(Who is best placed to take these actions? Data producers have the advantage that they can amortize the costs of these actions across multiple buyers; data consumers have the advantage that they know their own use cases and can therefore prioritize the most effective actions.)
Value also derives from non-quality factors
Depending on the user profile and specific use case, there may be other drivers of dataset value:
Provable compliance becomes more valuable as a data asset traverses its maturity curve. In the early days, when the utility of the data is unknown, participants are less inclined to pay for compliance. But as the ecosystem matures, this becomes more of a priority13.
Data provenance is valuable because it makes compliance easier and it reduces the scope for error introduction. Primary sources are the gold standard here.
Uncontaminated data is incredibly valuable in any use case involving prediction (gen AI, or quant investing). This is data that has never been mined or fitted against, and that therefore has not biased your evals. Uncontaminated data is a non-renewable asset; the minute you use it to either train or test, it loses its virgin status, and all future uses must reckon with the risk of overfitting.
Data fungibility — the idea that every record in your dataset is like every other — is good because it makes the data more tractable, more additive, and more repeatable in its value generation.
These are all internal drivers of value: attributes, not necessarily what you would think of as “quality”, that are nonetheless inherent to a dataset. But there are also external drivers of value, and many of them have to do with table stakes status.
Table stakes data comes in different flavours
Datasets can attain table stakes status in different ways:
Some datasets become standards for data exchange. S&P’s CUSIP for publicly traded securities, D&B’s DUNS for private companies, Datavant’s universal patient key for medical records, LiveRamp’s RampID for advertiser audiences are all examples here, albeit for varying degrees of “table-stakes-ness / industry adoption”.
Some datasets become evaluation benchmarks. The dream here is to offer “ground truth” that every industry participant has to measure themselves against. S&P, Nasdaq and MSCI do this for investor performance via their market indices. Nielsen does this for TV advertising: both advertiser and network rely on Nielsen ratings to mark their contracts.
Some data owners have quasi-monopoly power, either driven by a data flywheel, or from a non-data network effect of some sort. Meta and Google are examples of the former: their knowledge graphs, constantly updated by user content and intent, dominate online advertising. Bloomberg is an example of the latter, thanks to the ubiquity of its terminal.
Some datasets rely on bundled usage to become table stakes. To work in a given industry, you have to (effectively) purchase a dataset; and conversely, to use the data, you have to be licensed to work in that industry. The relationship between individual Realtors, various Realtor organizations / brokerages / local certification bodies, and the MLS real estate database(s) is a good example of this.
Usage rights are monetizable
The value of data is the value of what you can do with it. Therefore, the more you’re allowed to do — the more usage rights the seller grants you — the more you should be willing to pay! This is a pricing effect that’s completely independent of quality, quantity, dataset internals, or table stakes status.
Common usage rights include:
Scope of use rights: in-house, in-product, customer-facing, full re-distribution
Ownership rights: transfer versus license, explicit versus implicit permissions, approved and forbidden use cases
Audit rights: usage tracking, observation, compliance, post-contract deletion
Derived data rights: modifications, contribution and attribution (e.g. if the data is combined with other datasets), ownership of downstream products
Compliance rights: liability, reps and warranties on the data, legal shields
Sophisticated data contracts delineate very carefully what you can and cannot do with the data you license, and these allowances often have a dollar value attached to them.
Payment-in-kind is an emergent pattern
An interesting recent development is the emergence of payment-in-kind as a compensation pattern for data assets. When (say) OpenAI licenses content from a news media org, part of what they offer is to highlight that organization’s brand in chat conversations and link placements. This has multiple benefits: the media org gets ‘AI-organic’ traffic; the LLM gets fresh data; both sides get to use click-through and engagement data to measure the actual lift and value of the content; and there’s a recurring component to it all. Squint and you can even see analogies to Google’s one-two punch of Adwords and SEO for content creators.
For AI, data quantity matters — a lot
Does a small amount of high-quality (accurate, factual) data outperform a large amount of ‘good enough’ data? For AI use cases, the answer is increasingly in the negative.
We see this in ideas like ‘the unreasonable effectiveness of data’, and ‘scale is all you need’. There seems to be no upper limit to how much better models become, the more training data you throw at them. Fine-tuning and domain-specificity and especially human heuristics tend to plateau; this is ‘the bitter lesson’ that many researchers learned the hard way.
(A telling recent example is the way GPT-4, a one-shot model trained on vastly more tokens, outperforms BloombergGPT on almost all financial analysis tasks.)
One consequence here is that Sturgeon’s Law — “90% of everything is junk” — no longer holds so strongly for data. In the past, the vast majority of datasets in the world held no insight, catalyzed no actions, and had no value. AI’s voracious appetite for all sorts of training data means this is no longer the case: even the junky stuff helps model evals (perhaps not by a lot, but it’s not zero).
So does quantity unequivocally beat quality for AI? It's not quite that simple, because we also care about marginal impact and hence ROI. Sprinkling just a little bit of quality on top of your massive corpus — for example, via simple de-duping — has dramatic effects on model performance. As training sets grow ever larger, it’s often more efficient to do this than to acquire the next token; beyond a certain point, data quality scales better than data quantity14.
The implications for data pricing are straightforward. Quantity matters; high-quality quantity is even better; and there's no upper limit.
Recurring revenue is always the key
The big challenge with pricing data for AI use cases is that most of the training value comes from the historical corpus, meaning there's limited opportunity for recurring revenue from ongoing updates. Reddit’s decade-plus of archived content is far more interesting to OpenAI than an incremental day, week, or month of new content15.
Recurring revenue is, of course, the secret to effective long-term data pricing and economics. It works best when the data has some combination of short shelf-life, regular update cadence and repeatable usage. AI training datasets typically don’t have these attributes16.
There are ways to mitigate this. You can charge higher prices, of course, to make up for future payments; or you can try to charge per model or per training run.
The long-term solution here is to build a data flywheel — a ‘perpetual data machine’ that generates or captures a steady stream of new data, or constantly improves your existing data. These flywheels can be based on user behaviour (Reddit’s post data), or business model (Nasdaq’s exchange data), or tech (Google’s content-intent loop), or many other archetypes; the key is that they constantly offer new data to train against.
Synthetic data offers new economics
Synthetic data pipelines offer a way to generate unlimited quantities of high-quality, always-new training data, at far lower costs than acquiring comparable data ‘in the wild’. (Both the quantity and cost effects span orders of magnitude). And early indications are that well-constructed synthetic data is almost as effective as natural data in training frontier models. This offers the tantalizing prospect of bootstrapping the data curve indefinitely: use each generation of LLMs to generate synthetic data to train the next generation17.
The viability of synthetic data and the importance of quantity over quality are both negatives for the price of existing data assets. Proprietary data ain’t what it used to be!
(That being said, purely synthetic data tends to degrade over time, in a sort of entropic drift. A parallel phenomenon may occur in the wild, as LLM output proliferates across the web. The worst-case here is a Gresham effect, whereby bad content drives out good. All of these possibilities argue that provably-human proprietary data has its place after all.)
Sales cycle determines ACV
Normally, ACV determines sales cycle. Your price point decides your go-to-market motion (enterprise sales, inside sales, self-serve etc.), and hence the time it takes to close a deal. This is true for data sales as well, but there’s an additional effect where the causation runs the opposite way: sales cycle determines ACV.
Here’s why. In enterprise software, much of the time taken to close a sale is spent arming your champion, doing feature bakeoffs, getting buy-in from various stakeholders, choosing tiers and negotiating prices, onboarding users and so on. Very little of that process translates to data sales (no tiers, no features, no users).
Instead, much of the time in the data sales cycle is spent onboarding, linking, sampling, testing and using the actual data. And this is expensive; MAD-stack engineers don’t come cheap. The more time and resources a buyer spends on a dataset, the stronger a signal it is that the buyer thinks the dataset is valuable.
Given how opaque data value often is, this signal is meaningful — especially if the buyer is large and sophisticated. Sellers recognize this, and bump up their prices18.
Legibility determines market size
Much of this essay has been about pricing individual data assets or transactions. But to data owners, it’s also important to know the size of the market; after all, revenue is not price alone, it’s price times transactions. The most lucrative data assets are often those with a (relatively) low price and a large transaction base.
What determines market size? Legibility.
The more legible a dataset is — the easier it is to objectively compute its ROI — the larger its market. This manifests as more customers, more types of customer, more usage within a given customer, higher prices and lower acquisition costs19. It’s no coincidence that adtech and finance are the most lucrative data verticals: those are also the industries where it’s easiest to put a dollar value on databases, datasets, and sometimes individual data records. Will AI training data reach the same level of maturity? We shall see!
Conclusion
So there you have it: 5000+ words on data pricing. We’ve covered use cases and users; quality and quantity; internal and external value factors; pricing axes and maturity curves; table stakes and usage rights; and much more. If there’s anything I’ve missed, I’d love to hear from you in the comments. Happy data pricing!
If you liked this essay:
Share it with others who might like it.
Subscribe to my newsletter, and get your friends to subscribe too!
Read my previous essays on related topics:
I’m an active early-stage investor in companies with interesting data assets; if this is you, please reach out.
I write infrequently, but I like to think my essays are worth the wait. Your subscriptions, shares, comments and likes are the strongest incentive for me to write more!
Toronto, May 2024
I'm using broad definitions here. Finance includes capital markets, insurance, banking and retail financial services. Adtech includes online ads, email, and other digital sales and marketing infra.
Medical and healthcare data is sometimes cited as a third such category, but much of the value in that vertical comes from data management, analytics and infrastructure; not the raw data itself.
To be fair, this is true of oil and gold as well; their value accrues from the actions they enable (energy/work and monetary exchange).
Most datasets in the wild have already decayed, which is why people think data is non-rival and/or not really very valuable. But they're not sharing the good stuff!
Within capital markets, for example, unstructured training data is somewhere between stage 1 and 2; a lot of ‘alternative data’ is in stage 3; market data is in step 4.
This is why Google’s ad auction is such a work of genius — Google doesn’t need to estimate marginal value, the auction mechanism forces buyers to reveal it.
This is also why table stakes status is so powerful: if you can’t do without the data, then the marginal value is basically your entire business.
Within reason, of course. Data on, say, the distribution of shoe sizes in Inner Mongolia is unlikely to add much utility to the typical Wall St quant algo, no matter how unique.
The fancy term for this is ‘multi-axis variable pricing’. There are multiple variables whose value you can slide up and down — seats, features, usage — and the price is a function of the combination you choose. In practice, it’s easier for both buyer and seller if the sliders move in discrete jumps rather than continuously; hence tiers. In further practice, the slider-implied price is often just the starting anchor for a negotiation by the sales team.
Which is not to say that people don’t try using them, sometimes even with some success. But the sales process tends to be longer and more contentious (on both sides), because of fundamental misalignments between these proxies and actual value.
Some people call these data features, but these attributes are typically not features of the data per se, merely of the commercial model wrapped around it, so I think that’s a misnomer.
Excel, of course, can do anything.
A good example here is early-generation LLMs scraping the web for training data without asking for permission; these days however all the big players are signing data usage contracts with content owners.
This discussion is partly a reflection of our current historical juncture, where the limiting factors to AI performance seem to be compute, energy and data. We're short on all three of those, while model architectures have spare capacity; research suggests that most current models are over-trained.
Alex Izydorczyk of CyberSyn has coined the phrase ‘marginal temporal value’ for this phenomenon, and it’s spot on.
The notable exception here is news media. It’s no surprise that AI content-licensing deals in this industry typically include a fixed component (for the archives) and a variable component (for fresh articles). An even more sophisticated spin on this is to link the variable component to reader click-through and engagement — almost a rehash of the Google model, but with LLMs as the intermediary instead of the search bar.
The extremely hand-wavy intuition here is that we want models to reason about the world, not merely memorize facts. To achieve this, massive quantities of internally-consistent, well-structured, non-duplicative data are more effective than smaller sets of repetitive and messy real-world knowledge. Hence the efficacy of synthetic data, assuming you've moved beyond a baseline level of knowledge aka the start of the bootstrap.
This also explains the phenomenon whereby messy datasets are often more expensive than easy-to-use ones. Some of that is rational: if you believe you have an edge over your rivals in handling messy data — and many top data firms believe precisely this — then it makes sense to bid more for it. And some of it is irrational: data buyers are not immune to sunk cost fallacy (especially at the organizational level).
Perversely, it also manifests in the form of higher churn rates: if the ROI vanishes, customers stop using the data (unless it’s table stakes for their industry). Conversely, there exist ‘illegible’ datasets whose value is impossible to measure — the ROI is opaque, the insights unfalsifiable — and customers for this kind of data almost never churn. But that market is understandably limited.
Awww yeah
Excellent article ! Learned a ton. Thanks for sharing. 💪🏻