AI Has Run Out of Public Training Data. Regulated Firms Hold What Comes Next. | DataEquitySkip to main content
Background pattern design

AI Has Run Out of Public Training Data. Regulated Firms Hold What Comes Next.

Data Monetisation8 min read
Apr 18, 2026Data Monetisation

AI Has Run Out of Public Training Data. Regulated Firms Hold What Comes Next.

AI Training DataData MonetisationData ValuationDataVaultDE MarketplaceIntangible Assets

Data Monetisation | 8 min read | April 2026


AI training data supply gap showing proprietary regulated enterprise data as the emerging supply source for foundation model development


Goldman Sachs's head of data strategy has stated publicly that AI has already run out of training data. The precise version is more specific: it has exhausted the freely available, high-quality, human-generated public text and content that has powered foundational model development for the past decade. What has not been exhausted, and in fact has barely been touched, is the corpus of proprietary, structured, and longitudinal data held by regulated enterprises in financial services, healthcare, energy, and retail. This data is categorically different from what is available on the public internet. It is governed, consent-managed, temporally deep, and behavioural in ways that synthetic alternatives cannot replicate. The AI labs know this. The question is whether your organisation does.

The global market for AI training datasets is valued at approximately $4.4 billion in 2026 and is projected to exceed $23 billion by 2034. That growth trajectory is being driven almost entirely by demand for proprietary, domain-specific data from regulated industries: clinical records, financial transaction histories, energy consumption patterns, and retail behaviour datasets that no public internet scrape can supply. Regulated firms are, right now, sitting on the supply side of one of the fastest-growing data markets in history. Most of them have never formally valued what they hold.


Key Takeaways

  • Goldman Sachs's data chief has publicly stated that AI has already run out of freely available public training data; the next supply is proprietary enterprise datasets from regulated industries.
  • The AI training data market is valued at approximately $4.4 billion in 2026 and is projected to reach $23.18 billion by 2034, driven by demand for domain-specific, high-quality data that synthetic alternatives cannot credibly replace.
  • AI-enabled companies with demonstrable proprietary data assets command significant market premiums: AI-native healthcare startups with proprietary clinical data received an 83% valuation premium in venture funding in 2025 over non-AI counterparts lacking comparable data assets.
  • The International Accounting Standards Board moved its intangible assets review, which includes data resources as an explicit test case, from research to active standard-setting in January 2026, raising the prospect that formal data valuations will become a financial reporting requirement.
  • Gartner predicts that by 2026, more than a quarter of Fortune 500 CDAOs will be responsible for at least one top-earning product based on data and analytics, yet most regulated firms lack the valuation infrastructure to support that commercial responsibility.
  • DataEquity's DataVault and DE Marketplace provide the discovery, valuation, and commercial distribution capabilities that allow regulated firms to enter the AI data market as informed, appropriately priced sellers.

The AI Data Scarcity Crisis: What Is Actually Happening

The mainstream narrative around artificial intelligence has concentrated overwhelmingly on model capability: parameter counts, benchmark performance, and the pace of product releases from foundation model providers. The supply-side story has received considerably less attention and is significantly more consequential for regulated enterprises.

Modern large language models and domain-specific AI systems are trained on vast corpora of human-generated text, code, and structured data. For several years, the primary source was the public internet: web crawls, digitised books, academic papers, and publicly available records. Multiple independent analyses now converge on the same conclusion: freely available, human-generated, high-quality public text data is exhausted or close to it. Researchers project this supply constraint extending across the 2026 to 2032 period for different data categories, with the most severe scarcity in structured, domain-specific content.

The Synthetic Data Problem

The AI industry's first response to this constraint has been synthetic data generation: using existing models to create training data for subsequent models. Synthetic data has genuine utility for certain narrow applications, but it carries a fundamental limitation. Models trained predominantly on synthetic data develop what researchers call model collapse, a progressive degradation in performance as the distribution of synthetic outputs drifts from the original distribution of real-world human behaviour. For applications in financial risk modelling, clinical decision support, or energy demand forecasting, synthetic data cannot replicate the causal, longitudinal signal that only real transactional records carry.

The World Economic Forum and leading AI researchers have flagged this explicitly: the competitive advantage in AI will accrue to those who control access to curated, real-world, proprietary data, not those who generate the most synthetic output. Goldman Sachs's data chief has framed the same point from the demand side: the data that AI needs is sitting inside regulated enterprises, waiting to be unlocked.

Diagram showing the AI training data supply gap: a declining curve of publicly available high-quality text data from 2015 to 2026, overlaid with a rising demand curve from foundation model development, with proprietary enterprise data shown as the emerging supply source filling the gap from 2024 onwards, and regulated industry sectors (financial services, healthcare, energy, retail) labelled as the primary holders of this premium supply


Why Regulated Industry Data Commands a Premium

Not all proprietary data is equal in the AI training market. What AI developers require, and are prepared to pay materially for, has specific characteristics that regulated industry data uniquely provides.

Regulated financial services data carries properties that are extraordinarily difficult to replicate: it is longitudinal, spanning years or decades of customer behaviour; consent-managed, making it legally usable for certain downstream applications; highly structured, captured in standardised formats under regulatory reporting requirements; and causally rich, linking individual behaviours to verifiable outcomes such as credit default, product uptake, or financial distress. The same structural properties apply to clinical healthcare records, energy consumption data tied to household or industrial characteristics, and retail transaction datasets linked to demographic and geographic signals.

The market is already pricing this premium in ways that are visible and quantifiable. In the healthcare sector, AI companies with demonstrable proprietary data assets commanded an 83% valuation premium in venture funding in the first half of 2025, compared with AI-native companies that lacked comparable data. Tempus, a healthcare AI platform built on 45 million de-identified patient records and over 400 petabytes of clinical data, has become a widely cited example of data-as-asset valuation: the data estate is a central pillar of the company's commercial and investment case, not a byproduct of its clinical operations.

In financial services, equivalent premiums are emerging for firms that can provide structured, consent-managed transaction data, creditworthiness signals, and behavioural datasets to AI developers building risk and advisory products. The challenge for most regulated firms is that they hold the data but lack the commercial infrastructure to monetise it: they cannot demonstrate its quality and governance credentials to a prospective buyer, have no mechanism to price access, and have no channel through which pre-qualified buyers can find them.


The Valuation Gap: Why Firms Cannot Monetise What They Cannot Measure

There is a structural irony at the heart of the current AI data market: the firms that hold the most valuable data are, by and large, the worst positioned to sell it. This is not because of regulatory constraint or institutional reluctance. It is because they have never formally valued what they own.

Valuing a data asset for commercial purposes requires more than knowing that a dataset exists. A credible, buyer-facing valuation must establish the dataset's uniqueness relative to market alternatives, its completeness and temporal depth, the robustness of its consent and governance framework, its technical readiness for third-party consumption, and its alignment with identifiable AI development use cases. Without this foundation, a firm cannot price API access, cannot respond credibly to a data buyer's due diligence questions, and cannot negotiate terms that reflect the genuine worth of the asset.

The scale of this gap in regulated industries is significant. Most financial services, healthcare, and energy organisations have never commissioned a formal data asset valuation and have no dedicated function for data commercialisation. They may be aware, in general terms, that their data has potential value. Without a structured methodology, however, they cannot convert that awareness into actionable intelligence: which datasets are worth the most, to which categories of buyer, under which commercial model, and with what remediation required to reach a state of market readiness.

The On-Premise Constraint

For regulated industries, data discovery and valuation carry an additional complexity that general-purpose cloud tools cannot address. Regulatory obligations, internal security policies, and client confidentiality requirements typically prevent regulated firms from uploading data to external platforms for assessment. Any discovery or valuation methodology that requires data egress creates the very compliance exposure it is supposed to manage.

DataVault, DataEquity's on-premise data discovery and valuation platform, addresses this directly. Its On-Premise Assessment Agent evaluates datasets using a five-lens methodology, covering uniqueness, coverage, regulatory quality, technical readiness, and market demand alignment, without requiring any data to leave the firm's environment. The output is a Data Equity Score and Market Readiness Score for each assessed dataset: a quantified, defensible valuation that forms the evidential basis for every subsequent commercial and governance decision.


The IASB Signal: Data on Balance Sheets Is Coming

The commercial urgency of data valuation is amplifying alongside a governance imperative that regulated firms' finance and audit functions cannot ignore.

In January 2026, the International Accounting Standards Board moved its intangible assets review, which specifically includes data resources as an explicit test case, from its research agenda to its active standard-setting work plan. The distinction is material: standard-setting work can produce exposure drafts and, ultimately, revised accounting standards that alter how companies are required to account for and disclose their data assets.

Current standards under IAS 38 prevent recognition of most internally generated intangible assets, including proprietary data built through operational activity. A financial services firm's decades of transaction records, or a healthcare provider's clinical dataset, appears nowhere on the balance sheet despite representing demonstrable commercial value. The IASB's review signals a serious reconsideration of this position, driven in part by the widening gap between the reported book value and the evident market value of data-intensive enterprises.

For CDOs and finance directors in regulated industries, the implication is clear. If data assets move towards balance sheet recognition, even in an enhanced disclosure form, organisations will need credible, auditable, and methodology-driven valuations that can withstand scrutiny from auditors, regulators, and investors. The UK Endorsement Board has confirmed it is tracking the IASB project closely. Building valuation capability reactively, once a standard is issued, will be considerably more disruptive and costly than building it proactively in 2026. The organisations that establish robust data valuation frameworks now will have a material head start when reporting requirements arrive.


Building a Data Commercialisation Capability

The path from acknowledging that data has commercial value to realising that value requires three sequential capabilities: discovery, valuation, and distribution. Each is necessary; none is sufficient alone.

Discovery is the foundation. A firm cannot commercialise datasets it has not mapped. This requires a systematic inventory of data assets across all operational environments, including legacy systems, acquired portfolios, and operational data warehouses, identifying what exists, where it sits, what governance frameworks cover it, and what technical condition it is in. For regulated firms, this exercise must be conducted within the firm's own security perimeter, and on-premise assessment tooling is the appropriate mechanism.

Valuation follows discovery. Once a firm knows what data it holds, it can assess commercial worth using a structured methodology. This produces the scoring outputs that anchor every subsequent commercial decision: which datasets are ready to take to market, which require remediation, which have the highest value-to-effort ratio for commercialisation investment, and what price each dataset can credibly sustain in a buyer negotiation.

Distribution is the final step. DataEquity's DE Marketplace provides an AI-driven commercial channel connecting organisations with pre-qualified buyers for their proprietary datasets. Pre-qualification matters acutely in this market: regulated data has a narrow universe of legally eligible, technically capable, and commercially serious buyers. A marketplace that screens for these characteristics eliminates the substantial due diligence burden that would otherwise fall entirely on the selling organisation, whilst reducing the risk of data misuse or contractual disputes.

The three capabilities are sequential but interlocking. A firm that completes all three will have converted a dormant, balance-sheet-invisible asset into a quantified, governed, and commercially distributed revenue stream, ahead of the IASB making that valuation a reporting requirement regardless.


Frequently Asked Questions

What does "AI training data" mean in a business context, and why are regulated industries particularly relevant to this market?

AI training data is the corpus of examples used to develop and refine machine learning models. The quality and specificity of training data directly determines the performance and applicability of the resulting model for real-world use cases. Regulated industries are particularly valuable to AI developers because their data carries properties that public internet data lacks: longitudinal depth spanning years of customer or patient behaviour; structured governance with consent frameworks and data lineage; and causal links to verified real-world outcomes such as default, diagnosis, energy consumption, or purchase behaviour. These properties make domain-specific regulated data significantly more valuable for applications in credit risk, clinical decision support, energy demand forecasting, and personalised financial advice than general-purpose public datasets, and command corresponding price premiums.

How do data protection regulations affect a regulated firm's ability to sell proprietary data to AI companies?

Data protection law under the UK GDPR governs the lawful basis for processing and sharing personal data. Selling raw personal data is not, in general, a permissible commercial model. However, regulated firms routinely hold or can create compliant data products: properly anonymised datasets, de-identified records, population-level aggregations, and structured access arrangements that preserve individual privacy whilst providing the statistical properties AI developers require. The critical requirement is that the consent framework, anonymisation methodology, and contractual terms are demonstrably robust and documented to an auditable standard. A formal data valuation exercise conducted within the firm's own environment is also the appropriate moment to assess whether each candidate dataset meets the legal threshold for commercial distribution, before any buyer engagement begins.

What is the typical timeline for taking a proprietary data asset from initial discovery to a completed commercial transaction?

The process runs in three stages. The first is discovery and valuation: a systematic inventory and scoring of the firm's data assets to identify commercially viable candidates and their current market readiness. This typically takes between four and twelve weeks depending on the complexity and scale of the data estate. The second stage is remediation and governance preparation: addressing the gaps identified in the valuation, which may include improving data quality, updating consent frameworks, and creating technical documentation sufficient for buyer due diligence. This stage varies from several weeks to several months depending on severity. The third is commercial distribution: engaging with pre-qualified buyers through a governed marketplace, conducting due diligence, and establishing data-sharing agreements. End-to-end, a firm starting from a standing position can typically complete its first commercial data transaction within six to twelve months of beginning the discovery process.

What are the risks of entering the AI data market without robust data governance in place?

The risks operate on two dimensions simultaneously. Regulatorily, sharing data under inadequate governance exposes the firm to ICO enforcement action under UK GDPR, sector-specific regulatory sanction from the FCA, CQC, or Ofgem depending on industry, and reputational damage disproportionate to any direct financial penalty. Commercially, the risks are equally significant: a firm that engages buyers without accurate valuation will consistently undersell its assets, whilst a firm that delivers data of poor or undocumented quality will face buyer disputes, contract terminations, and reputational damage in the marketplace that restricts future commercial opportunities. The investment in governance and valuation infrastructure is also the investment that creates the commercial premium: these objectives are not in tension with each other.

How is proprietary regulated data typically priced in the AI training market, and what determines the premium a firm can command?

Pricing in the AI training data market is not commoditised. Premium proprietary datasets are typically structured as bespoke commercial agreements combining upfront licensing fees, usage restrictions tied to specified AI applications, and in some cases revenue-sharing mechanisms linked to the downstream value the data enables. The key determinants of price are the uniqueness of the dataset relative to obtainable alternatives; the longitudinal depth of the historical record; the robustness of the consent and governance framework as evidenced by documentation; and the specificity of the dataset's alignment to the buyer's intended use case. Firms that have completed a formal valuation using a structured methodology can negotiate from a position of documented evidence rather than intuition, and can defend pricing positions under buyer due diligence scrutiny.

What are DataVault and DE Marketplace, and how do they support data commercialisation in a regulated environment?

DataVault is DataEquity's on-premise data discovery and valuation platform. It deploys an On-Premise Assessment Agent that surveys and scores a firm's data assets across five commercial dimensions, covering uniqueness, coverage, regulatory quality, technical readiness, and market demand alignment, without requiring any data to leave the firm's controlled environment. The output is a Data Equity Score and Market Readiness Score for each assessed dataset: a quantified, auditable foundation for every commercial and governance decision that follows. DE Marketplace is DataEquity's AI-driven commercial distribution channel, connecting data asset holders with a pre-qualified pool of buyers for proprietary datasets. The pre-qualification process screens buyers for legal eligibility to use sector-specific regulated data, technical integration capability, and commercial seriousness, materially reducing the due diligence burden on the selling organisation. Together, the two platforms provide the complete pipeline from data estate discovery to completed commercial transaction, designed specifically for the security and governance constraints of regulated industries.


Your Data Has Buyers. The Question Is Whether You Know What It's Worth.

The AI industry's demand for proprietary regulated data is active, funded, and accelerating. The firms that build data discovery, valuation, and commercialisation capabilities in 2026 will enter that market from a position of knowledge; those that wait will be approached by buyers who already understand the value of the asset better than the seller does. Contact the DataEquity team to begin a DataVault assessment and find out what your data estate is worth: https://www.dataequity.io/contact


Get Started with DE platform

Data Equity - AI Data Management Platform Logo

Helping organisations discover, value and commercialize their data assets through evidence-based DataVault assessments and an AI-driven marketplace. Your data stays in your environment. We earn when you earn.

Company

Background pattern design

Copyright @ 2026 Data Equity. All rights reserved