Concordance
Velocity Governance

Feasibility Test: Measuring Governance Signals

Can AI adoption level and code review governance be measured from public repository metadata?

This report documents a feasibility test — not a study — conducted April 13, 2026 across 100 public GitHub repositories. The purpose was to validate that our measurement instruments can detect governance signals from repository metadata, and to identify the methodological requirements for a rigorous Phase I study.

Three instrument capabilities validated

We built a collection pipeline that extracts review governance metrics (comment density, zero-comment PR rate, review latency) and test coverage ratios from GitHub's API, then classifies repositories by AI tool adoption level based on dependency and configuration file analysis. We ran it against 100 public repos to answer one question: do the instruments produce measurable, differentiated signals across cohorts?

They do. The raw data shows variation across AI adoption levels on every metric we measured. This validates the instrument design — it does not validate any hypothesis about AI's effect on governance. The sample is too small, too imbalanced, and has critical confounds that must be addressed in Phase I.

Instrument captures governance variation

The pipeline successfully extracted review comment density, zero-comment PR rates, and review latency from 10,779 merged pull requests across 100 repos. All metrics showed measurable variation across cohorts, confirming the instruments produce usable signal.

Instrument validated

AI classification needs structural parsing

Substring-based detection of AI dependencies (e.g. "langchain" in requirements.txt) conflates AI products with repos that use AI tools for development. Langchain, autogen, and aider are themselves AI tools — their presence means the repo IS an AI product, not that the team uses AI to code. Phase I requires AST-level dependency parsing and developer survey validation.

Known limitation

Balanced cohort design required for Phase I

This scan produced 88 non-AI repos vs. 12 AI-adopting repos (5 low, 3 medium, 4 high). No statistical comparison is valid at these ratios. Phase I will use stratified sampling to achieve balanced cohorts of 30+ repos per adoption level, with propensity score matching on repo age, team size, and language.

Phase I requirement

Uncontrolled metrics by detected AI adoption level

Interpret with caution. The charts below show raw instrument output from an unbalanced, uncontrolled feasibility scan. Cohort sizes range from 3 to 88 repos. AI classification is based on dependency substring matching, which conflates AI products with AI-assisted development. These charts demonstrate that the instruments produce differentiated signal — they do not demonstrate causation or even reliable correlation.

Comment Density
Zero-Comment PR Rate
Average Review Hours
Test-to-Source Ratio

What this test cannot tell us — and what Phase I must address

A feasibility test is valuable precisely because it identifies what a rigorous study requires. Each limitation below maps to a specific Phase I design requirement.

Repository-level detail from feasibility test

Raw metrics for the 12 repos classified as AI-adopting, plus aggregated baseline. Note that "high AI" repos (langchain, autogen, aider, open-interpreter) are AI products — they would be excluded from a Phase I study cohort.

Repository AI Level PRs Analyzed Comment Density Zero-Comment % Test Ratio
Non-AI repos (88 total) None 9,434 0.75 43.1% 0.23
langchain * High 100 0.22 39.0% 0.37
autogen * High 100 0.17 61.0% 0.29
open-interpreter * High 100 0.35 44.0% 0.05
aider * High 100 0.20 51.0% 0.46
crewai Medium 100 0.29 14.0% 0.52
gpt-engineer Medium 100 0.38 32.0% 0.00
haystack Medium 100 0.79 36.0% 0.72
guidance Low 100 0.73 4.0% 0.23
semantic-kernel Low 100 1.32 4.0% 0.40
lmstudio Low 49 0.55 6.0% 0.00
promptflow Low 100 0.99 11.0% 0.00
transformers Low 100 0.94 10.0% 0.23

* AI product repo — would be excluded from Phase I cohort design. Included here to demonstrate instrument detection capability.

What this feasibility test demonstrates

This test validated three things: (1) governance metrics can be extracted at scale from repository metadata via GitHub's API, (2) the instruments produce differentiated signal across AI adoption cohorts, and (3) doing this rigorously — with proper cohort design, confound controls, and longitudinal analysis — requires funded research infrastructure that doesn't exist yet.

The limitations documented above are not flaws in the research question. They are the precise engineering and design challenges that Phase I must solve. A 10,000-repo scan with stratified sampling, structural AI detection, multivariate confound controls, and private organization partnerships would produce the first empirical evidence base for understanding how AI-accelerated development affects engineering governance.