PR Stats Collector - Script Summary¶

Last Updated: 2025-12-12
Location: D:\dump\py\pr_stats_collector.py

Purpose¶

Collects pull request statistics from all repositories in the majority-dev GitHub organization. The script analyzes merged PRs created in 2025 and generates CSV reports with detailed metrics.

Current State¶

What It Does¶

Fetches all repos from majority-dev organization (excluding be-deployments, be-docs, be-devtools)
For each repo, collects merged PRs created in 2025
Filters out PRs and comments from bots (renovate, GitHub Copilot)
Generates individual CSV files per repo in pr_stats_output/ directory
Merges all CSVs into a single timestamped file at the end

Data Collected Per PR¶

Repository name and PR number
PR title, author, created/merged timestamps
Files changed count
Code changes: additions, deletions, total changes
Review comments count (excluding bot comments)
Top commenter (user with most comments)
Approvers (list of users who approved)
PR URL

Key Features & Recent Improvements¶

1. Resume Capability ✅¶

Checks for existing CSV files before processing repos
Skips any repo that has pr_stats_output/pr_stats_{repo}.csv already
Creates empty CSV files even for repos with no PRs (acts as completion marker)
Creates marker files on errors to prevent infinite retries

2. Rate Limit Handling ✅¶

Critical fix: GitHub API has 5,000 requests/hour limit
Monitors remaining API calls before processing each repo
Automatically waits and retries when rate limit is hit (403 Forbidden errors)
Shows remaining API calls and estimated wait times
Exponential backoff on transient failures
Prevents the 403 errors that previously caused the script to fail

3. Immediate File Writing ✅¶

Critical fix: CSV files are written immediately after each repo is processed
Previously waited until all repos were done (data loss on interruption)
Now safe to Ctrl+C and resume later

Usage¶

Run on all repos (production mode):¶

uv run python pr_stats_collector.py

Test mode (single repo - be-lithic):¶

uv run python pr_stats_collector.py --test

Important Notes:¶

Always use uv run python for this project
Script can be safely interrupted and resumed
Already processed repos will be skipped automatically
Rate limit pauses are automatic and expected behavior

Output Structure¶

pr_stats_output/
├── pr_stats_android-ci.csv          # Individual repo files
├── pr_stats_cicd.csv
├── pr_stats_limkevin-argo-cd.csv
├── pr_stats_team-majority-android.csv
└── ... (more as repos are processed)

pr_stats_all_repos_YYYYMMDD_HHMMSS.csv  # Final merged file

Known Limitations & Behavior¶

API Pagination: Currently fetches up to 100 items per request
- PRs with >100 comments may not capture all comments
- PRs with >100 files may not capture all files
- Should be sufficient for most PRs
Bot Filtering: Excludes users matching:
- Contains "renovate" in username
- Contains "copilot" in username
- Username is "github-actions[bot]"
Date Filtering: Only PRs created >= 2025-01-01
- Sorted by creation date descending
- Stops early when hitting pre-2025 PRs
Merge Status: Only includes merged PRs
- Open and closed-but-not-merged PRs are skipped

Important Learnings¶

Rate Limiting is Critical¶

The majority-dev org has many repos with lots of PRs
Each PR requires multiple API calls (comments, files, reviews)
Without rate limit handling, script fails partway through with 403 errors
Lesson: Always implement rate limit checking for batch GitHub API operations

Incremental Saves are Essential¶

Processing all repos can take hours
Network issues, rate limits, or user interruption can occur
Saving after each repo prevents data loss
Empty marker files prevent wasted retries on error cases

Resume Logic Saves Time¶

With ~50+ repos in majority-dev, reprocessing wastes API quota
Simple file existence check enables safe resumption
Critical for long-running data collection tasks

Troubleshooting¶

If you see "403 Forbidden" errors:¶

Fixed in current version - script will auto-wait
Rate limit resets every hour
Check remaining calls: The script shows this automatically

If script stops/crashes:¶

Just run again - it will skip completed repos
Check pr_stats_output/ for what's been completed
Marker files (even empty) mean that repo is done

If you need to reprocess a specific repo:¶

Delete its CSV file: pr_stats_output/pr_stats_{repo}.csv
Run script again

Current Run Status (as of 2025-12-12)¶

Completed repos: 4 (android-ci, cicd, limkevin-argo-cd, team-majority-android)
Running: Full organization scan in progress
Next steps: Will process remaining ~50+ repos with automatic rate limit handling