Git Archiver Web

Last updated Feb 2026

Free GitHub repository archive service - backup any public repo forever

Architecture

System Overview

Git-Archiver Web is a serverless GitHub repository archiving service that I designed to run entirely on free-tier cloud services. The architecture prioritizes zero operational cost while maintaining reliability and scalability.

flowchart TB
    subgraph User["User Interface"]
        FE[Static Frontend<br/>GitHub Pages]
    end

    subgraph Submission["Submission Layer"]
        CW[Cloudflare Worker<br/>API Proxy]
    end

    subgraph Processing["Processing Layer"]
        GI[GitHub Issues<br/>Request Queue]
        GA[GitHub Actions<br/>Archive Engine]
    end

    subgraph Storage["Storage Layer"]
        GR[GitHub Releases<br/>Archive Storage]
        IDX[index.json<br/>Master Index]
    end

    subgraph External["External"]
        GHAPI[GitHub API]
        REPO[Target Repository]
    end

    FE -->|POST /submit| CW
    FE -->|GET /index| CW
    CW -->|Create Issue| GI
    CW -->|Fetch Index| IDX
    GI -->|Trigger on label| GA
    GA -->|Clone| REPO
    GA -->|Validate| GHAPI
    GA -->|Upload Archive| GR
    GA -->|Update| IDX
    GA -->|Close| GI
    FE -->|Fetch Releases| GHAPI

Component Architecture

Frontend Layer

flowchart LR
    subgraph Frontend["Static Frontend"]
        HTML[index.html]
        CSS[styles.css]
        APP[app.js<br/>Main Logic]
        API[api.js<br/>API Client]
        UTIL[utils.js<br/>Helpers]
    end

    APP --> API
    APP --> UTIL
    HTML --> CSS
    HTML --> APP

The frontend is a single-page application built with vanilla JavaScript. I deliberately avoided frameworks to minimize bundle size and eliminate build steps. The entire frontend consists of:

index.html: Semantic markup with accessibility considerations
styles.css: Custom CSS with dark theme and responsive design
app.js: Application state management and UI rendering
api.js: API client abstracting all backend communication
utils.js: Pure utility functions for formatting, validation, and DOM manipulation

Worker Layer (Cloudflare)

flowchart TB
    subgraph Worker["Cloudflare Worker"]
        CORS[CORS Handler]
        ROUTE[Router]

        subgraph Endpoints
            SUBMIT[POST /submit]
            BULK[POST /bulk-submit]
            INDEX[GET /index]
            README[GET /readme]
            STATUS[GET /status]
        end

        subgraph Validation
            URL[URL Validator]
            RATE[Rate Limiter]
            SIZE[Size Checker]
            DUP[Duplicate Checker]
        end
    end

    CORS --> ROUTE
    ROUTE --> Endpoints
    SUBMIT --> Validation
    BULK --> Validation

The Cloudflare Worker serves as a secure proxy between the frontend and GitHub API. Key responsibilities:

Token Protection: GitHub PAT never exposed to client
Request Validation: URL format, size limits, duplicate checking
Rate Limiting: IP-based throttling (configurable via KV)
CORS Handling: Enables cross-origin requests from GitHub Pages
Index Proxying: Avoids CORS issues with GitHub release asset redirects

Processing Layer (GitHub Actions)

flowchart TB
    subgraph Archive["archive.yml Workflow"]
        TRIGGER[Issue Trigger]
        PARSE[Parse URL]
        VALIDATE[Validate Repo]
        CLONE[Clone Repository]
        COMPRESS[Create tar.gz]
        HASH[Calculate SHA256]
        CHECK[Check for Changes]
        RELEASE[Create Release]
        UPDATE[Update Index]
        CLOSE[Close Issue]
    end

    TRIGGER --> PARSE
    PARSE --> VALIDATE
    VALIDATE -->|Valid| CLONE
    VALIDATE -->|Invalid| CLOSE
    CLONE --> COMPRESS
    COMPRESS --> HASH
    HASH --> CHECK
    CHECK -->|Changed| RELEASE
    CHECK -->|Unchanged| CLOSE
    RELEASE --> UPDATE
    UPDATE --> CLOSE

I chose GitHub Actions as the archive engine because:

Free compute: Unlimited minutes for public repositories
Native integration: Direct access to GitHub API with built-in tokens
Event-driven: Triggers on issue creation without polling
Reliable: Managed infrastructure with automatic retries

Storage Layer

flowchart TB
    subgraph Releases["GitHub Releases"]
        IDX_REL[index Release<br/>tag: index]
        REPO_REL[Archive Releases<br/>tag: owner__repo__date]
    end

    subgraph Assets
        IDX_JSON[index.json<br/>Master Index]
        ARCHIVE[repo.tar.gz<br/>Archive File]
        META[metadata.json<br/>Repo Metadata]
        README[README.md<br/>Extracted README]
    end

    IDX_REL --> IDX_JSON
    REPO_REL --> ARCHIVE
    REPO_REL --> META
    REPO_REL --> README

Key Architecture Decisions

Why Serverless?

I chose a serverless architecture for several reasons:

Zero maintenance: No servers to patch, scale, or monitor
Cost efficiency: All services operate within free tiers
Global distribution: Cloudflare and GitHub CDN provide edge caching
Automatic scaling: Handles traffic spikes without configuration

Why GitHub Issues as Queue?

Using GitHub Issues as a job queue was an unconventional but effective choice:

Visibility: Users can track their request status
Auditability: Complete history of all archive requests
Native triggering: GitHub Actions can trigger on issue events
No additional services: Eliminates need for Redis, SQS, etc.

Why GitHub Releases for Storage?

Unlimited storage: No stated limits for public repos
CDN-backed: Fast downloads globally
Versioning: Natural support for multiple archive versions
API accessible: Easy programmatic access to assets

Deduplication Strategy

I implemented content-based deduplication using SHA256 hashes:

Each archive's hash is stored in metadata.json
Before creating a new release, the workflow compares hashes
If unchanged, no new release is created (saves storage)
Daily update job re-archives repos only when content changes

Security Considerations

Token isolation: GitHub PAT stored in Cloudflare secrets, never in frontend
Input sanitization: Strict URL regex validation
XSS prevention: All user input escaped before rendering
Rate limiting: Prevents abuse of submission endpoint
Size limits: 2GB cap prevents storage abuse

Data Flow

Submission Flow

sequenceDiagram
    participant U as User
    participant F as Frontend
    participant W as Worker
    participant G as GitHub API
    participant I as GitHub Issue

    U->>F: Enter repo URL
    F->>W: POST /submit
    W->>G: Validate repo exists
    G-->>W: Repo metadata
    W->>W: Check size < 2GB
    W->>G: Check existing issues
    W->>G: Check today's release
    W->>G: Create issue with label
    G-->>W: Issue created
    W-->>F: Success response
    F-->>U: "Queued for archiving"

Archive Flow

sequenceDiagram
    participant I as GitHub Issue
    participant A as GitHub Action
    participant R as Target Repo
    participant S as GitHub Releases

    I->>A: Issue opened with label
    A->>A: Parse URL from body
    A->>R: Validate repo (API)
    A->>R: Clone (depth 100)
    A->>A: Create tar.gz
    A->>A: Calculate SHA256
    A->>S: Check previous hash
    alt Content changed
        A->>S: Upload archive + metadata
        A->>S: Update index.json
    end
    A->>I: Comment result
    A->>I: Close issue

Limitations

Repository size: 2GB maximum (GitHub release asset limit)
Clone depth: Limited to 100 commits for speed
Private repos: Not supported (intentional)
Rate limits: 10 requests/hour per IP (configurable)
GitHub dependency: Entire system relies on GitHub availability

Tech Stack

Overview

I built Git-Archiver Web as a fully serverless application using only free-tier services. The stack prioritizes simplicity, zero operational cost, and minimal dependencies.

Frontend

Technology	Version	Purpose
HTML5	-	Semantic markup
CSS3	-	Custom styling with CSS variables
JavaScript	ES2020+	Application logic

Frontend Details

No Framework: I intentionally avoided React, Vue, or other frameworks. The entire frontend is ~40KB uncompressed, with no build step required.

Styling Approach:

Custom CSS with CSS variables for theming
Dark theme by default (GitHub-inspired color palette)
Fully responsive design with mobile-first breakpoints
No CSS frameworks (no Tailwind, Bootstrap)

JavaScript Architecture:

app.js (24KB): Main application with state management, event handling, and rendering
api.js (8KB): API client with all HTTP requests abstracted
utils.js (6KB): Pure utility functions (formatting, validation, DOM helpers)

Why Vanilla JS?

No build step: Just edit and deploy
Fast load times: ~40KB total vs 100KB+ for React alone
Simplicity: Easy to understand and modify
Browser support: Works in all modern browsers without transpilation

Backend (Cloudflare Worker)

Technology	Version	Purpose
Cloudflare Workers	V8 Runtime	Serverless API
Wrangler	^3.0.0	CLI for deployment

Worker Details

Runtime: Cloudflare Workers run on the V8 engine (same as Chrome/Node.js), providing excellent performance with a cold start under 5ms.

Code Size: ~700 lines of JavaScript handling:

URL validation and routing
GitHub API integration
CORS handling
Rate limiting (prepared for KV)

Endpoints:

Endpoint	Method	Description
`/submit`	POST	Submit single repo URL
`/bulk-submit`	POST	Submit up to 20 URLs
`/index`	GET	Fetch master index (proxied)
`/readme`	GET	Fetch archived README
`/status`	GET	Check if original repo exists
`/health`	GET	Health check

Why Cloudflare Workers?

Free tier: 100,000 requests/day
Global edge: Low latency worldwide
No cold starts: Instant response times
Built-in secrets: Secure token storage

Processing (GitHub Actions)

Component	Purpose
archive.yml	Main archive workflow
update-archives.yml	Daily re-archive job
pages.yml	Frontend deployment

Workflow Details

archive.yml (~470 lines):

Triggers on issue creation with archive-request label
Validates repository existence and size
Clones with depth 100 (balances speed vs history)
Creates tar.gz archive
Calculates SHA256 for deduplication
Uploads to GitHub Releases
Updates master index
Comments on and closes issue

update-archives.yml (~100 lines):

Runs daily at 3 AM UTC
Selects oldest archives for re-checking
Triggers archive workflow via dispatch
Smart deduplication prevents duplicate releases

Why GitHub Actions?

Unlimited minutes: Free for public repositories
Native GitHub integration: Built-in GITHUB_TOKEN
Event-driven: Triggers on issues without polling
Powerful runners: 2-core machines with 7GB RAM

Storage

Service	Purpose	Limits
GitHub Releases	Archive storage	2GB per asset
GitHub Pages	Frontend hosting	Unlimited bandwidth

Storage Structure

Releases/
  index (tag)
    index.json          # Master index of all repos

  owner__repo__date (tag)
    owner_repo.tar.gz   # Archive file
    metadata.json       # Size, hash, stars, etc.
    README.md           # Extracted README

Why GitHub Releases?

Free unlimited storage: No explicit limits for public repos
CDN-backed: Fast global downloads
Versioning: Natural support for multiple versions
API accessible: Easy to query and download

Infrastructure

Service	Tier	Cost
GitHub	Free	$0
Cloudflare Workers	Free	$0
Domain (optional)	-	$0 (using github.io)

Deployment

Frontend:

Automatic deployment via GitHub Actions on push to main
GitHub Pages serves from frontend/ directory
No build step required

Worker:

cd worker
npx wrangler deploy

Secrets:

GITHUB_TOKEN: Personal Access Token (repo scope)
GITHUB_OWNER: Repository owner
GITHUB_REPO: Repository name

Key Dependencies

Worker Dependencies

Package	Version	Reason
wrangler	^3.0.0	Cloudflare CLI for development and deployment

I kept dependencies minimal. The worker itself uses no external packages - just vanilla JavaScript with the Workers API.

GitHub Actions

Action	Version	Reason
actions/checkout	v4	Clone repository
softprops/action-gh-release	v1	Create releases
actions/github-script	v7	Issue management
actions/configure-pages	v4	Pages deployment
actions/upload-pages-artifact	v3	Upload static files
actions/deploy-pages	v4	Deploy to Pages

Development Setup

Prerequisites

Node.js 18+
Cloudflare account (free)
GitHub account with PAT

Local Development

# Frontend (any static server)
cd frontend
npx serve .
# Opens at http://localhost:3000

# Worker
cd worker
npm install
npx wrangler dev
# Opens at http://localhost:8787

Environment Variables

Variable	Location	Description
GITHUB_TOKEN	Cloudflare Secrets	PAT with repo scope
GITHUB_OWNER	Cloudflare Secrets	Your GitHub username
GITHUB_REPO	Cloudflare Secrets	Repository name

Performance Characteristics

Metric	Value
Frontend load time	<1s (40KB total)
Worker cold start	<5ms
Archive creation	2-10 minutes
Index fetch	<100ms

Scalability

Component	Limit	Notes
Worker requests	100K/day	Free tier
Actions minutes	Unlimited	Public repos
Storage	Unlimited	GitHub may contact at scale
Concurrent archives	20	GitHub Actions limit

Future Stack Considerations

If I needed to scale beyond free tiers:

Workers Paid ($5/mo): 10M requests, KV storage
GitHub Pro ($4/mo): Private repos, more Actions minutes
Custom domain: More professional appearance
Algolia (free tier): Full-text search
IPFS/Pinata: Redundant storage

Back to All Projects

Git Archiver Web

Architecture

System Overview

Component Architecture

Frontend Layer

Worker Layer (Cloudflare)

Processing Layer (GitHub Actions)

Storage Layer

Key Architecture Decisions

Why Serverless?

Why GitHub Issues as Queue?

Why GitHub Releases for Storage?

Deduplication Strategy

Security Considerations

Data Flow

Submission Flow

Archive Flow

Limitations

Tech Stack

Overview

Frontend

Frontend Details

Why Vanilla JS?

Backend (Cloudflare Worker)

Worker Details

Why Cloudflare Workers?

Processing (GitHub Actions)

Workflow Details

Why GitHub Actions?

Storage

Storage Structure

Why GitHub Releases?

Infrastructure

Deployment

Key Dependencies

Worker Dependencies

GitHub Actions

Development Setup

Prerequisites

Local Development

Environment Variables

Performance Characteristics

Scalability

Future Stack Considerations

Project Q&A

Project Overview

Key Features

1. One-Click Archiving

2. Bulk Upload

3. Archive Versioning

4. Live Status Indicators

5. README Preview

6. Processing Queue

7. Search and Browse

Technical Highlights

Challenge: Zero-Cost Architecture

Challenge: CORS and Token Security

Challenge: Preventing Duplicate Archives

Challenge: Using GitHub Issues as a Queue

Challenge: Rate Limiting Without a Database

Challenge: Handling Large Repositories

Frequently Asked Questions

Q: Why would I use this instead of just forking the repository?

Q: How long are archives stored?

Q: Can I archive private repositories?

Q: What's the maximum repository size I can archive?

Q: How do I download an archive?

Q: Why does my archive request say "no changes"?

Q: How often are repositories re-archived?

Q: Is there an API I can use programmatically?

Q: What happens if GitHub changes their API or policies?

Q: Can I run my own instance?