Skip to content

PR Stats Collector - Script Summary

Last Updated: 2025-12-12
Location: D:\dump\py\pr_stats_collector.py

Purpose

Collects pull request statistics from all repositories in the majority-dev GitHub organization. The script analyzes merged PRs created in 2025 and generates CSV reports with detailed metrics.

Current State

What It Does

  • Fetches all repos from majority-dev organization (excluding be-deployments, be-docs, be-devtools)
  • For each repo, collects merged PRs created in 2025
  • Filters out PRs and comments from bots (renovate, GitHub Copilot)
  • Generates individual CSV files per repo in pr_stats_output/ directory
  • Merges all CSVs into a single timestamped file at the end

Data Collected Per PR

  • Repository name and PR number
  • PR title, author, created/merged timestamps
  • Files changed count
  • Code changes: additions, deletions, total changes
  • Review comments count (excluding bot comments)
  • Top commenter (user with most comments)
  • Approvers (list of users who approved)
  • PR URL

Key Features & Recent Improvements

1. Resume Capability ✅

  • Checks for existing CSV files before processing repos
  • Skips any repo that has pr_stats_output/pr_stats_{repo}.csv already
  • Creates empty CSV files even for repos with no PRs (acts as completion marker)
  • Creates marker files on errors to prevent infinite retries

2. Rate Limit Handling ✅

  • Critical fix: GitHub API has 5,000 requests/hour limit
  • Monitors remaining API calls before processing each repo
  • Automatically waits and retries when rate limit is hit (403 Forbidden errors)
  • Shows remaining API calls and estimated wait times
  • Exponential backoff on transient failures
  • Prevents the 403 errors that previously caused the script to fail

3. Immediate File Writing ✅

  • Critical fix: CSV files are written immediately after each repo is processed
  • Previously waited until all repos were done (data loss on interruption)
  • Now safe to Ctrl+C and resume later

Usage

Run on all repos (production mode):

uv run python pr_stats_collector.py

Test mode (single repo - be-lithic):

uv run python pr_stats_collector.py --test

Important Notes:

  • Always use uv run python for this project
  • Script can be safely interrupted and resumed
  • Already processed repos will be skipped automatically
  • Rate limit pauses are automatic and expected behavior

Output Structure

pr_stats_output/
├── pr_stats_android-ci.csv          # Individual repo files
├── pr_stats_cicd.csv
├── pr_stats_limkevin-argo-cd.csv
├── pr_stats_team-majority-android.csv
└── ... (more as repos are processed)

pr_stats_all_repos_YYYYMMDD_HHMMSS.csv  # Final merged file

Known Limitations & Behavior

  1. API Pagination: Currently fetches up to 100 items per request

    • PRs with >100 comments may not capture all comments
    • PRs with >100 files may not capture all files
    • Should be sufficient for most PRs
  2. Bot Filtering: Excludes users matching:

    • Contains "renovate" in username
    • Contains "copilot" in username
    • Username is "github-actions[bot]"
  3. Date Filtering: Only PRs created >= 2025-01-01

    • Sorted by creation date descending
    • Stops early when hitting pre-2025 PRs
  4. Merge Status: Only includes merged PRs

    • Open and closed-but-not-merged PRs are skipped

Important Learnings

Rate Limiting is Critical

  • The majority-dev org has many repos with lots of PRs
  • Each PR requires multiple API calls (comments, files, reviews)
  • Without rate limit handling, script fails partway through with 403 errors
  • Lesson: Always implement rate limit checking for batch GitHub API operations

Incremental Saves are Essential

  • Processing all repos can take hours
  • Network issues, rate limits, or user interruption can occur
  • Saving after each repo prevents data loss
  • Empty marker files prevent wasted retries on error cases

Resume Logic Saves Time

  • With ~50+ repos in majority-dev, reprocessing wastes API quota
  • Simple file existence check enables safe resumption
  • Critical for long-running data collection tasks

Troubleshooting

If you see "403 Forbidden" errors:

  • Fixed in current version - script will auto-wait
  • Rate limit resets every hour
  • Check remaining calls: The script shows this automatically

If script stops/crashes:

  • Just run again - it will skip completed repos
  • Check pr_stats_output/ for what's been completed
  • Marker files (even empty) mean that repo is done

If you need to reprocess a specific repo:

  • Delete its CSV file: pr_stats_output/pr_stats_{repo}.csv
  • Run script again

Current Run Status (as of 2025-12-12)

  • Completed repos: 4 (android-ci, cicd, limkevin-argo-cd, team-majority-android)
  • Running: Full organization scan in progress
  • Next steps: Will process remaining ~50+ repos with automatic rate limit handling