PR Stats Collector - Script Summary¶
Last Updated: 2025-12-12
Location: D:\dump\py\pr_stats_collector.py
Purpose¶
Collects pull request statistics from all repositories in the majority-dev GitHub organization. The script analyzes merged PRs created in 2025 and generates CSV reports with detailed metrics.
Current State¶
What It Does¶
- Fetches all repos from
majority-devorganization (excludingbe-deployments,be-docs,be-devtools) - For each repo, collects merged PRs created in 2025
- Filters out PRs and comments from bots (renovate, GitHub Copilot)
- Generates individual CSV files per repo in
pr_stats_output/directory - Merges all CSVs into a single timestamped file at the end
Data Collected Per PR¶
- Repository name and PR number
- PR title, author, created/merged timestamps
- Files changed count
- Code changes: additions, deletions, total changes
- Review comments count (excluding bot comments)
- Top commenter (user with most comments)
- Approvers (list of users who approved)
- PR URL
Key Features & Recent Improvements¶
1. Resume Capability ✅¶
- Checks for existing CSV files before processing repos
- Skips any repo that has
pr_stats_output/pr_stats_{repo}.csvalready - Creates empty CSV files even for repos with no PRs (acts as completion marker)
- Creates marker files on errors to prevent infinite retries
2. Rate Limit Handling ✅¶
- Critical fix: GitHub API has 5,000 requests/hour limit
- Monitors remaining API calls before processing each repo
- Automatically waits and retries when rate limit is hit (403 Forbidden errors)
- Shows remaining API calls and estimated wait times
- Exponential backoff on transient failures
- Prevents the 403 errors that previously caused the script to fail
3. Immediate File Writing ✅¶
- Critical fix: CSV files are written immediately after each repo is processed
- Previously waited until all repos were done (data loss on interruption)
- Now safe to Ctrl+C and resume later
Usage¶
Run on all repos (production mode):¶
uv run python pr_stats_collector.py
Test mode (single repo - be-lithic):¶
uv run python pr_stats_collector.py --test
Important Notes:¶
- Always use
uv run pythonfor this project - Script can be safely interrupted and resumed
- Already processed repos will be skipped automatically
- Rate limit pauses are automatic and expected behavior
Output Structure¶
pr_stats_output/
├── pr_stats_android-ci.csv # Individual repo files
├── pr_stats_cicd.csv
├── pr_stats_limkevin-argo-cd.csv
├── pr_stats_team-majority-android.csv
└── ... (more as repos are processed)
pr_stats_all_repos_YYYYMMDD_HHMMSS.csv # Final merged file
Known Limitations & Behavior¶
-
API Pagination: Currently fetches up to 100 items per request
- PRs with >100 comments may not capture all comments
- PRs with >100 files may not capture all files
- Should be sufficient for most PRs
-
Bot Filtering: Excludes users matching:
- Contains "renovate" in username
- Contains "copilot" in username
- Username is "github-actions[bot]"
-
Date Filtering: Only PRs created >= 2025-01-01
- Sorted by creation date descending
- Stops early when hitting pre-2025 PRs
-
Merge Status: Only includes merged PRs
- Open and closed-but-not-merged PRs are skipped
Important Learnings¶
Rate Limiting is Critical¶
- The majority-dev org has many repos with lots of PRs
- Each PR requires multiple API calls (comments, files, reviews)
- Without rate limit handling, script fails partway through with 403 errors
- Lesson: Always implement rate limit checking for batch GitHub API operations
Incremental Saves are Essential¶
- Processing all repos can take hours
- Network issues, rate limits, or user interruption can occur
- Saving after each repo prevents data loss
- Empty marker files prevent wasted retries on error cases
Resume Logic Saves Time¶
- With ~50+ repos in majority-dev, reprocessing wastes API quota
- Simple file existence check enables safe resumption
- Critical for long-running data collection tasks
Troubleshooting¶
If you see "403 Forbidden" errors:¶
- Fixed in current version - script will auto-wait
- Rate limit resets every hour
- Check remaining calls: The script shows this automatically
If script stops/crashes:¶
- Just run again - it will skip completed repos
- Check
pr_stats_output/for what's been completed - Marker files (even empty) mean that repo is done
If you need to reprocess a specific repo:¶
- Delete its CSV file:
pr_stats_output/pr_stats_{repo}.csv - Run script again
Current Run Status (as of 2025-12-12)¶
- Completed repos: 4 (android-ci, cicd, limkevin-argo-cd, team-majority-android)
- Running: Full organization scan in progress
- Next steps: Will process remaining ~50+ repos with automatic rate limit handling