Instrument Validation

Feasibility Test: Measuring Governance Signals

Can AI adoption level and code review governance be measured from public repository metadata?

This report documents a feasibility test — not a study — conducted April 13, 2026 across 100 public GitHub repositories. The purpose was to validate that our measurement instruments can detect governance signals from repository metadata, and to identify the methodological requirements for a rigorous Phase I study.

What We Tested

Three instrument capabilities validated

We built a collection pipeline that extracts review governance metrics (comment density, zero-comment PR rate, review latency) and test coverage ratios from GitHub's API, then classifies repositories by AI tool adoption level based on dependency and configuration file analysis. We ran it against 100 public repos to answer one question: do the instruments produce measurable, differentiated signals across cohorts?

They do. The raw data shows variation across AI adoption levels on every metric we measured. This validates the instrument design — it does not validate any hypothesis about AI's effect on governance. The sample is too small, too imbalanced, and has critical confounds that must be addressed in Phase I.

Instrument captures governance variation

The pipeline successfully extracted review comment density, zero-comment PR rates, and review latency from 10,779 merged pull requests across 100 repos. All metrics showed measurable variation across cohorts, confirming the instruments produce usable signal.

Instrument validated

AI classification needs structural parsing

Substring-based detection of AI dependencies (e.g. "langchain" in requirements.txt) conflates AI products with repos that use AI tools for development. Langchain, autogen, and aider are themselves AI tools — their presence means the repo IS an AI product, not that the team uses AI to code. Phase I requires AST-level dependency parsing and developer survey validation.

Known limitation

Balanced cohort design required for Phase I

This scan produced 88 non-AI repos vs. 12 AI-adopting repos (5 low, 3 medium, 4 high). No statistical comparison is valid at these ratios. Phase I will use stratified sampling to achieve balanced cohorts of 30+ repos per adoption level, with propensity score matching on repo age, team size, and language.

Phase I requirement

Raw Instrument Output

Uncontrolled metrics by detected AI adoption level

Interpret with caution. The charts below show raw instrument output from an unbalanced, uncontrolled feasibility scan. Cohort sizes range from 3 to 88 repos. AI classification is based on dependency substring matching, which conflates AI products with AI-assisted development. These charts demonstrate that the instruments produce differentiated signal — they do not demonstrate causation or even reliable correlation.

Comment Density

Zero-Comment PR Rate

Average Review Hours

Test-to-Source Ratio

Limitations & Phase I Design Requirements

What this test cannot tell us — and what Phase I must address

A feasibility test is valuable precisely because it identifies what a rigorous study requires. Each limitation below maps to a specific Phase I design requirement.

Cohort contamination The "high AI" cohort contains repos that ARE AI products (langchain, autogen, aider, open-interpreter), not repos where developers use AI tools. These projects have fundamentally different governance patterns for structural reasons unrelated to AI-assisted development. Phase I design Structural dependency parsing (AST-level) to distinguish AI-as-product from AI-as-tool. Developer survey validation for a subset of repos. Explicit exclusion criteria for AI product repos.
Sample imbalance 88 repos in the baseline cohort vs. 3–5 in each AI-adopting cohort. No statistical test is valid with these ratios — any apparent pattern could be driven by a single outlier repo. Phase I design Stratified sampling targeting 30+ repos per AI adoption level. Power analysis to determine minimum sample size for target effect sizes. Propensity score matching across confounds.
Uncontrolled confounds No controls for repo age, team size, organizational maturity, language ecosystem norms, PR size distribution, or contributor count. Any observed governance difference could be explained by these factors rather than AI adoption. Phase I design Collect and control for: repo age, team size, PR size distribution, contributor count, organizational affiliation, language. Multivariate regression to isolate AI adoption effect.
Single time-point snapshot Data collected at one moment. Cannot distinguish whether governance patterns preceded AI adoption, co-evolved with it, or are unrelated trends. Phase I design Longitudinal analysis: collect governance metrics at multiple time points relative to AI tool adoption date. Interrupted time series design for repos with clear adoption events.
Public repos only Public open-source repos may have fundamentally different governance cultures than private enterprise codebases, which are the primary target population for commercial governance tooling. Phase I design Partnerships with 3–5 private engineering organizations. Comparative analysis of public vs. private governance patterns within matched cohorts.
Discovery bias Repos were found via hand-curated search queries (e.g. "AI coding assistant", "copilot integration"). This biases toward repos that self-identify with AI tooling rather than representing the broader population. Phase I design Systematic sampling from GitHub's full repository index, stratified by language and star count. AI detection applied post-selection to avoid discovery bias.

Scan Data

Repository-level detail from feasibility test

Raw metrics for the 12 repos classified as AI-adopting, plus aggregated baseline. Note that "high AI" repos (langchain, autogen, aider, open-interpreter) are AI products — they would be excluded from a Phase I study cohort.

Repository	AI Level	PRs Analyzed	Comment Density	Zero-Comment %	Test Ratio
Non-AI repos (88 total)	None	9,434	0.75	43.1%	0.23
langchain *	High	100	0.22	39.0%	0.37
autogen *	High	100	0.17	61.0%	0.29
open-interpreter *	High	100	0.35	44.0%	0.05
aider *	High	100	0.20	51.0%	0.46
crewai	Medium	100	0.29	14.0%	0.52
gpt-engineer	Medium	100	0.38	32.0%	0.00
haystack	Medium	100	0.79	36.0%	0.72
guidance	Low	100	0.73	4.0%	0.23
semantic-kernel	Low	100	1.32	4.0%	0.40
lmstudio	Low	49	0.55	6.0%	0.00
promptflow	Low	100	0.99	11.0%	0.00
transformers	Low	100	0.94	10.0%	0.23

* AI product repo — would be excluded from Phase I cohort design. Included here to demonstrate instrument detection capability.

Implications for Phase I

What this feasibility test demonstrates

This test validated three things: (1) governance metrics can be extracted at scale from repository metadata via GitHub's API, (2) the instruments produce differentiated signal across AI adoption cohorts, and (3) doing this rigorously — with proper cohort design, confound controls, and longitudinal analysis — requires funded research infrastructure that doesn't exist yet.

The limitations documented above are not flaws in the research question. They are the precise engineering and design challenges that Phase I must solve. A 10,000-repo public scan with stratified sampling, structural AI detection, an NLP quality-analysis layer, and multivariate confound controls (Phase I) — extended with private-organization validation in Phase II — would produce the first empirical evidence base for understanding how AI-accelerated development affects engineering governance.