Regulon — T-03 · Training Data Disclosure

Core Obligation

Developers must disclose information about the data used to train AI models. Public disclosure obligations require posting documentation on the developer's website covering dataset sources, data types, volume, IP status, personal information presence, processing methods, collection timeframes, and use of synthetic data. Regulator disclosure obligations require submitting similar documentation to a designated authority, which may treat it as confidential.

Sub-Obligations3 sub-obligations

Name & Description

Enacted

Proposed

T-03.1

Regulator disclosure Developers must provide training data documentation to designated regulatory authorities. This disclosure may be treated as confidential and is not required to be made public.

0 enacted

2 proposed

T-03.2

Public disclosure Developers must post training data documentation publicly on their website before making a system available and before each new release or substantial modification. Substantial modification includes retraining and fine-tuning.

1 enacted

3 proposed

T-03.3

Training Data Governance Disclosure to Deployers Developers must disclose to deployers the data governance measures applied to training datasets, including examination of data source suitability, possible biases, and mitigation steps taken, as part of pre-deployment technical documentation.

1 enacted

3 proposed

Bills That Map This Requirement 10 bills

Bill

Status

Sub-Obligations

Section

CA AB 2013 (Training Data Transparency) CA

Enacted 2026-01-01

T-03.2

Civ. Code § 3111(a)(1)-(12), (b)

Plain Language

Developers of generative AI systems or services available to Californians must publish a detailed training data documentation page on their website. The documentation must include a high-level summary covering twelve enumerated categories: dataset sources/owners, purpose alignment, data point counts (general ranges and estimates permitted), data point types, IP status (copyright/trademark/patent or public domain), whether data was purchased or licensed, presence of personal information (per CCPA definition) or aggregate consumer information, any cleaning or processing performed, collection time periods, dates of first use in development, and whether synthetic data generation was used. This obligation applies to any system released on or after January 1, 2022, with initial documentation due by January 1, 2026, and updated documentation required before each new release or substantial modification. Three exemptions apply: systems solely for security and integrity purposes, systems solely for national airspace aircraft operations, and national security/military/defense systems available only to federal entities. Notably, the statute contains no enforcement mechanism or penalties — compliance is effectively self-enforced. This is one of the earliest U.S. state laws requiring public training data disclosure and is considerably more detailed in its enumerated requirements than the EU AI Act's Article 53 training data summary obligation, though weaker in enforcement.

On or before January 1, 2026, and before each time thereafter that a generative artificial intelligence system or service, or a substantial modification to a generative artificial intelligence system or service, released on or after January 1, 2022, is made publicly available to Californians for use, regardless of whether the terms of that use include compensation, the developer of the system or service shall post on the developer's internet website documentation regarding the data used by the developer to train the generative artificial intelligence system or service, including, but not be limited to, all of the following: (a) A high-level summary of the datasets used in the development of the generative artificial intelligence system or service, including, but not limited to: (1) The sources or owners of the datasets. (2) A description of how the datasets further the intended purpose of the artificial intelligence system or service. (3) The number of data points included in the datasets, which may be in general ranges, and with estimated figures for dynamic datasets. (4) A description of the types of data points within the datasets. For purposes of this paragraph, the following definitions apply: (A) As applied to datasets that include labels, "types of data points" means the types of labels used. (B) As applied to datasets without labeling, "types of data points" refers to the general characteristics. (5) Whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain. (6) Whether the datasets were purchased or licensed by the developer. (7) Whether the datasets include personal information, as defined in subdivision (v) of Section 1798.140. (8) Whether the datasets include aggregate consumer information, as defined in subdivision (b) of Section 1798.140. (9) Whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the artificial intelligence system or service. (10) The time period during which the data in the datasets were collected, including a notice if the data collection is ongoing. (11) The dates the datasets were first used during the development of the artificial intelligence system or service. (12) Whether the generative artificial intelligence system or service used or continuously uses synthetic data generation in its development. A developer may include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the system or service. (b) A developer shall not be required to post documentation regarding the data used to train a generative artificial intelligence system or service for any of the following: (1) A generative artificial intelligence system or service whose sole purpose is to help ensure security and integrity. For purposes of this paragraph, "security and integrity" has the same meaning as defined in subdivision (ac) of Section 1798.140, except as applied to any developer or user and not limited to businesses, as defined in subdivision (d) of that section. (2) A generative artificial intelligence system or service whose sole purpose is the operation of aircraft in the national airspace. (3) A generative artificial intelligence system or service developed for national security, military, or defense purposes that is made available only to a federal entity.

CO SB 189 (ADMT in Consequential Decisions) CO

Pending 2027-01-01

T-03.3

C.R.S. § 6-1-1702(1)(a)-(e)

Plain Language

Developers must provide each deployer of their covered ADMT with technical documentation that is reasonably understandable and that protects trade secrets. The documentation must cover: intended uses and known harmful uses, categories of training data (including personal data), known limitations and risks, instructions for appropriate use and human review, and information the deployer needs to comply with its own disclosure obligations. If any information is withheld for trade secret or legal protection reasons, the developer must notify the deployer.

ON AND AFTER JANUARY 1, 2027, A DEVELOPER SHALL MAKE AVAILABLE TO EACH DEPLOYER OF A COVERED ADMT DEVELOPED BY THE DEVELOPER, IN A FORM AND MANNER THAT IS REASONABLY UNDERSTANDABLE TO A DEPLOYER AND THAT PROTECTS TRADE SECRETS OR INFORMATION PROTECTED FROM DISCLOSURE BY STATE OR FEDERAL LAW: (a) A GENERAL STATEMENT DESCRIBING THE INTENDED USES AND KNOWN HARMFUL OR INAPPROPRIATE USES OF THE COVERED ADMT; (b) A DESCRIPTION OF THE CATEGORIES OF DATA, INCLUDING PERSONAL DATA, USED TO TRAIN THE COVERED ADMT, TO THE EXTENT KNOWN; (c) KNOWN LIMITATIONS OF THE COVERED ADMT, INCLUDING KNOWN RISKS AND CIRCUMSTANCES IN WHICH THE COVERED ADMT SHOULD NOT BE USED; (d) INSTRUCTIONS FOR THE DEPLOYER'S APPROPRIATE USE, MONITORING, AND MEANINGFUL HUMAN REVIEW, WHERE APPLICABLE; (e) INFORMATION REASONABLY NECESSARY FOR THE DEPLOYER TO COMPLY WITH SECTION 6-1-1704. IF INFORMATION IS WITHHELD, THE DEVELOPER SHALL NOTIFY THE DEPLOYER.

CO SB 25B-004 (AI Transparency Amendments) CO

Enacted 2026-06-30

T-03.3

C.R.S. § 6-1-1702(2)-(3)(a)

Plain Language

Developers must provide deployers and downstream developers with the documentation and information — such as model cards, dataset cards, and other impact assessment materials — necessary for the deployer or its contracted third party to complete a required impact assessment under § 6-1-1703(3). This is a 'to the extent feasible' obligation. The documentation must be provided at or before the point when the system is made available. This is developer-to-deployer disclosure, not public-facing.

(2) On and after June 30, 2026, and except as provided in subsection (6) of this section, a developer of a high-risk artificial intelligence system shall make available to the deployer or other developer of the high-risk artificial intelligence system: (3) (a) Except as provided in subsection (6) of this section, a developer that offers, sells, leases, licenses, gives, or otherwise makes available to a deployer or other developer a high-risk artificial intelligence system on or after June 30, 2026, shall make available to the deployer or other developer, to the extent feasible, the documentation and information, through artifacts such as model cards, dataset cards, or other impact assessments, necessary for a deployer, or for a third party contracted by a deployer, to complete an impact assessment pursuant to section 6-1-1703 (3).

GA SB 167 (Automated Decision Systems) GA

Pending 2025-07-01

T-03.3

O.C.G.A. § 10-16-2(c)

Plain Language

When a developer distributes an automated decision system to a deployer or another developer, it must share all the documentation it provides to the AG — including training data governance measures and impact assessment artifacts such as model cards and data set cards — to the extent feasible. This ensures deployers have sufficient information to complete their own impact assessments. A developer that is also the deployer of its own system is exempt from generating this documentation unless the system is provided to an unaffiliated deployer.

(1) Except as provided in subsection (f) of this Code section, a developer that offers, sells, leases, licenses, gives, or otherwise makes available to a deployer or other developer an automated decision system shall make available to the deployer or other developer, to the extent feasible, all of the information required to be provided to the Attorney General by subsection (b) of this Code section, as well as the documentation and information, through artifacts such as model cards, data set cards, or other impact assessments, necessary for a deployer or third party contracted by a deployer to complete an impact assessment pursuant to subsection (e) of Code Section 10-16-3. (2) A developer that also serves as a deployer for an automated decision system is not required to generate the documentation required by this subsection unless the automated decision system is provided to an unaffiliated entity acting as a deployer.

NY A 6578 (AI Training Data Transparency Act) NY

Pending 2027-01-01

T-03.2

Gen. Bus. Law § 1432(1)(a)-(l), (2)

Plain Language

Developers of generative AI models or services made publicly available to New Yorkers — whether free or paid — must post detailed training data documentation on their website by January 1, 2027, and again before each subsequent release or substantial modification of any model released on or after January 1, 2022. The required disclosure is a high-level summary covering twelve categories: dataset sources/owners, how datasets serve the model's purpose, data point counts (ranges permitted), data point types, IP status (copyright/trademark/patent or public domain), whether data was purchased or licensed, whether personal information is included, whether aggregate consumer information is included, cleaning/processing modifications, data collection timeframes, dates datasets were first used, and whether synthetic data generation was used. Two narrow exemptions apply: models solely for national airspace aircraft operations, and national security/military/defense models available only to federal entities. Note that 'training' is defined broadly to include testing, validating, and fine tuning.

1. On or before January first, two thousand twenty-seven, and prior to each time thereafter that a generative artificial intelligence model or service, or a substantial modification to a generative artificial intelligence model or service, released on or after January first, two thousand twenty-two, is made publicly available to New Yorkers for use, regardless of whether the terms of such use include compensation, the developer of such model or service shall post on the developer's website documentation regarding the data used by the developer to train the generative artificial intelligence model or service, including a high-level summary of the datasets used in the development of the generative artificial intelligence model or service, including, but not limited to: (a) the sources or owners of the datasets; (b) a description of how the datasets further the intended purpose of the artificial intelligence model or service; (c) the number of data points included in the datasets, which may be in general ranges, and with estimated figures for dynamic datasets; (d) a description of the types of data points within the datasets. For purposes of this paragraph, the following definitions apply: (i) as applied to datasets that include labels, "types of data points" means the types of labels used; and (ii) as applied to datasets without labeling, "types of data points" refers to the general characteristics; (e) whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain; (f) whether the datasets were purchased or licensed by the developer; (g) whether the datasets include personal information or personal identifying information, as defined in section eight hundred ninety-nine-aaa of this chapter; (h) whether the datasets include aggregate consumer information; (i) whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the artificial intelligence model or service; (j) the time period during which the data in the datasets were collected, including a notice if the data collection is ongoing; (k) the dates the datasets were first used during the development of the artificial intelligence model or service; and (l) whether the generative artificial intelligence model or service used or continuously uses synthetic data generation in its development. A developer may include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the model or service. 2. A developer shall not be required to post documentation regarding the data used to train a generative artificial intelligence model or service for any of the following: (a) A generative artificial intelligence model or service whose sole purpose is the operation of aircraft in the national airspace; or (b) A generative artificial intelligence model or service developed for national security, military, or defense purposes that is made available only to a federal entity.

NY A 6578 (AI Training Data Transparency Act) NY

Pending 2027-01-01

Gen. Bus. Law § 1433(1)(a)-(f), (2)

Plain Language

Any entity — including persons, partnerships, government agencies, or corporations — that develops or substantially modifies a generative AI model using data substantially derived from its own employees or contractors must disclose specified information to each employee whose data was used. This obligation applies regardless of whether the resulting model is made publicly available, meaning purely internal AI tools are covered. Required disclosures to affected employees include: the model's intended purpose, how the datasets serve that purpose, types of data points, whether personal information is included, dates the datasets were first used, and the data collection timeframe. The same narrow exemptions apply as for the public disclosure obligation (national airspace operations and national security/defense models for federal entities only). Notably, the obligated entity in this section is not limited to the defined term 'developer' — it uses a broader formulation covering any entity that uses employee data, potentially capturing entities that would not qualify as developers under § 1431(2) because the model is not made available to the public.

1. Any person, partnership, state or local government agency, or corporation that designs, codes, produces, or substantially modifies a generative artificial intelligence model or service using data of which a substantial part is derived from individuals employed or contracted by the entity, regardless if whether the model is made publicly available, shall ensure that the following information is disclosed to each employee whose data is used to train the artificial intelligence model: (a) the intended purpose of the artificial intelligence model or service; (b) a description of how the collected datasets further the intended purpose of the artificial intelligence model or service; (c) a description of the types of data points within the datasets; (d) whether the datasets include personal information or personal identifying information, as defined in section eight hundred ninety-nine-aaa of this chapter; (e) the dates the datasets were first used during the development of the artificial intelligence model or service; and (f) the time period during which the data in the datasets were collected, including a notice if the data collection is ongoing. 2. An entity that uses employee or contractor data to design, code, produce, or substantially modify a generative artificial intelligence model or service shall not be required to disclose the information required by this section if the model or service: (a) is solely intended to be used in the operation of aircraft in the national airspace; or (b) is developed for national security, military, or defense purposes and only made available to a federal entity.

NY S 6955 (AI Training Data Transparency Act) NY

Pending 2027-01-01

T-03.2

Gen. Bus. Law § 1432(1)(a)-(l), (2)

Plain Language

Developers of generative AI models or services that are made publicly available to New Yorkers — whether free or paid — must post detailed training data documentation on the developer's website. The initial deadline is January 1, 2027, and the obligation recurs before each subsequent release or substantial modification of any generative AI model or service released on or after January 1, 2022. The required documentation includes a high-level summary covering twelve enumerated categories: dataset sources or owners, how the data furthers the model's purpose, data point counts (ranges permitted), data types, IP status (copyright/trademark/patent or public domain), whether data was purchased or licensed, presence of personal information, presence of aggregate consumer information, data cleaning or processing applied, data collection timeframes, dates datasets were first used in development, and synthetic data use. Data point counts may be expressed in general ranges, and dynamic datasets may use estimated figures. Two narrow exemptions apply: aviation-only AI and national security AI available exclusively to federal entities. The bill contains no enforcement mechanism or penalties, creating a disclosure obligation without a specified consequence for noncompliance.

1. On or before January first, two thousand twenty-seven, and prior to each time thereafter that a generative artificial intelligence model or service, or a substantial modification to a generative artificial intelligence model or service, released on or after January first, two thousand twenty-two, is made publicly available to New Yorkers for use, regardless of whether the terms of such use include compensation, the developer of such model or service shall post on the developer's website documentation regarding the data used by the developer to train the generative artificial intelligence model or service, including a high-level summary of the datasets used in the development of the generative artificial intelligence model or service, including, but not limited to:
(a) the sources or owners of the datasets;
(b) a description of how the datasets further the intended purpose of the artificial intelligence model or service;
(c) the number of data points included in the datasets, which may be in general ranges, and with estimated figures for dynamic datasets;
(d) a description of the types of data points within the datasets. For purposes of this paragraph, the following definitions apply:
(i) as applied to datasets that include labels, "types of data points" means the types of labels used; and
(ii) as applied to datasets without labeling, "types of data points" refers to the general characteristics;
(e) whether the datasets include any data protected by copyright, trademark, or patent, or whether the datasets are entirely in the public domain;
(f) whether the datasets were purchased or licensed by the developer;
(g) whether the datasets include personal information or personal identifying information, as defined in section eight hundred ninety-nine-aaa of this chapter;
(h) whether the datasets include aggregate consumer information;
(i) whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the artificial intelligence model or service;
(j) the time period during which the data in the datasets were collected, including a notice if the data collection is ongoing;
(k) the dates the datasets were first used during the development of the artificial intelligence model or service; and
(l) whether the generative artificial intelligence model or service used or continuously uses synthetic data generation in its development. A developer may include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the model or service.
2. A developer shall not be required to post documentation regarding the data used to train a generative artificial intelligence model or service for any of the following:
(a) A generative artificial intelligence model or service whose sole purpose is the operation of aircraft in the national airspace; or
(b) A generative artificial intelligence model or service developed for national security, military, or defense purposes that is made available only to a federal entity.

PA HB 2288 (AI Training Disclosure) PA

Pending 2026-06-11

Section 9.5(a)-(c)

Plain Language

Any platform that collects user-generated content for AI training must inform each user of that fact. The disclosure must be presented at sign-up and must be separate from the platform's terms of service — it cannot be buried in the TOS. Users must affirmatively acknowledge receipt of the disclosure before they are permitted to post any content. This is a notice-and-acknowledgment obligation, not a consent or opt-out mechanism — the statute requires acknowledgment, not agreement. The obligation applies broadly to any application, website, or interface that collects user content for AI training purposes.

(a) A platform that collects user-generated content for the purpose of training artificial intelligence algorithms shall disclose to the user that the user-generated content may be used for the purpose of training artificial intelligence. (b) The disclosure shall be presented to the user at the time the user signs up for the platform and shall be separate from the platform's terms of service agreement. (c) Each user of a platform must acknowledge receipt of the disclosure before being allowed to post user-generated content on the platform.

RI H 7190 (AI Use by Health Insurers) RI

Pending 2026-01-21

T-03.1T-03.3

R.I. Gen. Laws § 27-84-3(b)(1)-(2)

Plain Language

OHIC and DBR must publish an initial report within 18 months of the chapter's effective date, and annually thereafter, to the governor, senate president, and house speaker. The report must cover, for each insurer: AI model types, the role of AI in claims and coverage decision-making, training data governance and bias mitigation measures, and detailed performance metrics including claim acceptance/denial rates, average review times, appeal rates, and denial reversal rates. While the direct obligation falls on the regulators, this creates an indirect compliance obligation for insurers to supply this information — particularly the training data governance measures, bias analysis, and performance metrics — to OHIC/DBR in a form and on a schedule that enables timely annual reporting. Insurers should anticipate recurring data requests tied to the reporting cycle.

(1) DBR/OHIC shall provide an initial report to the governor, the senate president and the speaker of the house on the use of artificial intelligence by health insurers within eighteen (18) months of the effective date of this chapter and annually thereafter. (2) The annual report shall state how health insurers use artificial intelligence to manage claims and coverage. The report shall state, for each insurer: (i) The types of artificial intelligence models used; (ii) The role of artificial intelligence in the decision-making process to approve or deny healthcare claims or coverage whenever artificial intelligence is used to make, or is a substantial factor in making, a decision on healthcare claims or coverage; (iii) Information regarding training, testing, and risk management including data governance measures used to cover the training data sets and the measures used to examine the suitability of data sources, possible biases and appropriate mitigation; and (iv) Performance metrics including: number of claims; percentage of claims accepted and denied; the average time claim reviewers and medical professional reviewers spend on each claim and on denials of claims; percentage of claims appealed; and percentage of denials reversed.

RI S 2010 (Health Insurer AI Transparency) RI

Pending 2026-01-09

T-03.1

R.I. Gen. Laws § 27-84-3(b)(1)-(2)

Plain Language

OHIC/DBR must produce an initial report to the governor, senate president, and speaker of the house within 18 months of the act's effective date, and annually thereafter, on how health insurers use AI to manage claims and coverage. While this provision is directed at the regulators, it creates a derivative compliance obligation for insurers: the report must include, for each insurer, the types of AI models used, AI's role in decision-making, training data governance measures (including bias assessment and mitigation), and detailed performance metrics such as claim acceptance/denial rates, reviewer time per claim, appeal rates, and denial reversal rates. Insurers must be prepared to produce all of this information to OHIC/DBR on an ongoing basis to support the annual report.

(1) DBR/OHIC shall provide an initial report to the governor, the senate president and the speaker of the house on the use of artificial intelligence by health insurers within eighteen (18) months of the effective date of this chapter and annually thereafter. (2) The annual report shall state how health insurers use artificial intelligence to manage claims and coverage. The report shall state, for each insurer: (i) The types of artificial intelligence models used; (ii) The role of artificial intelligence in the decision-making process to approve or deny healthcare claims or coverage whenever artificial intelligence is used to make, or is a substantial factor in making, a decision on healthcare claims or coverage; (iii) Information regarding training, testing, and risk management including data governance measures used to cover the training data sets and the measures used to examine the suitability of data sources, possible biases and appropriate mitigation; and (iv) Performance metrics including: number of claims; percentage of claims accepted and denied; the average time claim reviewers and medical professional reviewers spend on each claim and on denials of claims; percentage of claims appealed; and percentage of denials reversed.

WA HB 1168 (AI Training Data Transparency) WA

Pending 2026-01-01

T-03.2

Sec. 2(1)(a)(i)-(x), (b)

Plain Language

Developers of generative AI systems or services must post detailed training data documentation on their website by January 1, 2026, and again before each subsequent release or substantial modification of a system released on or after January 1, 2022. The documentation must include a high-level summary covering ten enumerated categories: dataset sources/owners, how datasets relate to the system's intended purpose, data point counts (ranges and estimates permitted), data point types, whether data was purchased/licensed/public, presence of personal information, presence of aggregate consumer information, cleaning or processing applied, training or update dates, and use of synthetic data generation. The definition of 'training' is broad — it includes testing, validating, and fine-tuning. This is a public disclosure obligation, not a regulator-only submission.

(1) On or before January 1, 2026, and before each time thereafter that a generative artificial intelligence system or service, or a substantial modification to a generative artificial intelligence system or service, released on or after January 1, 2022, is made publicly available to Washingtonians for use, regardless of whether the terms of that use include compensation, the developer of the system or service shall post on the developer's internet website documentation regarding the data used by the developer to train the generative artificial intelligence system or service including, but not limited to: (a) A high-level summary of the datasets used in the development of the generative artificial intelligence system or service including, but not limited to: (i) The sources or owners of the datasets; (ii) A description of how the datasets further the intended purpose of the generative artificial intelligence system or service; (iii) The number of data points included in the datasets, which may be in general ranges, and with estimated figures for dynamic datasets; (iv) A description of the types of data points within the datasets; (v) Whether the datasets were purchased or licensed by the developer or if the datasets were publicly available; (vi) Whether the datasets include personal information, as defined in RCW 19.373.010; (vii) Whether the datasets include aggregate consumer information; (viii) Whether there was any cleaning, processing, or other modification to the datasets by the developer, including the intended purpose of those efforts in relation to the generative artificial intelligence system or service; (ix) The dates the datasets were first trained or the date of the last significant update to the datasets during the development of the generative artificial intelligence system or service; and (x) Whether the generative artificial intelligence system or service used or continuously uses synthetic data generation in its development. A developer may include a description of the functional need or desired purpose of the synthetic data in relation to the intended purpose of the system or service. (b) For purposes of this subsection, the following definitions apply: (i) As applied to datasets that include labels, "types of data points" means the types of labels used; and (ii) As applied to datasets without labeling, "types of data points" refers to the general characteristics.

WA HB 1168 (AI Training Data Transparency) WA

Pending 2026-01-01

T-03.2

Sec. 2(2)(a)-(c)

Plain Language

Three categories of generative AI systems are exempt from the training data disclosure requirement: (1) systems whose sole purpose is security and integrity (broadly defined to cover cybersecurity, fraud detection, law enforcement assistance, and physical safety); (2) systems whose sole purpose is aircraft operation in national airspace; and (3) systems developed for national security, military, or defense purposes that are available only to federal entities. The 'sole purpose' limitation is narrow — a multi-purpose system that also serves security functions would not qualify for the exemption. This provision modifies the obligation in Sec. 2(1) and creates no independent compliance obligation.

(2) A developer is not required to post documentation regarding the data used to train a generative artificial intelligence system or service for any of the following: (a) A generative artificial intelligence system or service whose sole purpose is to help ensure security and integrity; (b) A generative artificial intelligence system or service whose sole purpose is the operation of aircraft in the national airspace; and (c) A generative artificial intelligence system or service developed for national security, military, or defense purposes that is made available only to a federal entity.