Company Atlas

A unified firmographic data platform with thousands of companies from open-source datasets

Unified Company Data

Company Atlas collects, cleans, and normalizes firmographic data from multiple sources, producing an analytics-ready dataset with thousands of companies worldwide.

📊

Multi-Source

Combines open-source Kaggle Fortune 1000 dataset with web crawler enrichment from multiple data sources

🔧

Automated Pipeline

Apache Airflow orchestration with automated data quality checks using dbt tests and Great Expectations

🎯

Unified Schema

Star schema design with dimension and fact tables, deduplication across sources, ready for analytics and visualization

Top Companies by Market Cap

Explore profiles of leading companies ranked by market capitalization

Dataset Statistics

-
Total Companies
-
Total Revenue
-
Industries
-
Avg Employees

Architecture & Features

A modern data pipeline architecture built with industry-leading tools for scalable, reliable, and maintainable data processing

Cloud Storage

AWS S3 for raw data storage with CSV and Parquet formats

Data Warehouse

Snowflake for staging and analytics-ready tables

Transformation

dbt for modeling and data transformation with version control

Data Quality

dbt tests and Great Expectations for comprehensive validation

Orchestration

Apache Airflow for automated workflow scheduling

API & Visualization

FastAPI REST endpoints with interactive web interface

REST API

Access the unified company dataset through a modern REST API built with FastAPI. Interactive documentation and comprehensive filtering capabilities.

GET
/api/v1/companies

Search and retrieve companies with filtering and pagination

page page_size company_name industry country
GET
/api/v1/companies/{id}

Get a specific company by ID

GET
/api/v1/statistics

Get dataset statistics and distributions

GET
/api/v1/industries

Get list of all industries

GET
/api/v1/countries

Get list of all countries

Example Usage

Python
import requests

# Search for Apple by company name
response = requests.get(
    "http://localhost:8000/api/v1/companies",
    params={
        "company_name": "Apple",
        "page": 1,
        "page_size": 10
    }
)

companies = response.json()
print(f"Found {companies['total']} companies")
for company in companies['companies']:
    print(f"- {company['company_name']} ({company['domain']})")
    print(f"  Industry: {company['industry']}")
    print(f"  Revenue: ${company['revenue']:,.0f}")
    print(f"  Employees: {company['employee_count']:,}")
cURL
# Get statistics
curl "http://localhost:8000/api/v1/statistics"

# Search for Apple
curl "http://localhost:8000/api/v1/companies?company_name=Apple"

# Get specific company by ID
curl "http://localhost:8000/api/v1/companies/{company_id}"