AI Product Studio•Est. 2014

AI Research2025

Fundamental Research Labs

ML infrastructure for the frontier.

Fundamental Research Labs is the a16z and Prosus-backed AI research company behind Shortcut.ai, an AI-powered analyst trusted by KKR, Microsoft, OpenAI, and Anthropic. We built production-grade ML infrastructure for their model research pipeline in two months: a full-stack deployment platform with intelligent routing between inference providers based on model size, an OpenAI-compatible API layer with authentication and rate limiting, and an evaluation pipeline covering 199+ benchmarks across academic, function-calling, and SWE-Gym software engineering tasks with distributed compute scaling to thousands of concurrent evaluations.

199+

Benchmarks

EleutherAI lm-eval, BFCL, and SWE-Gym tasks

1000s

Concurrent Evals

Distributed SWE-Gym evaluation on Modal

2mo

Delivered In

Complete ML infrastructure from zero

$33M

Series A

Raised from Andreessen Horowitz and Prosus

Overview

Challenge

Fundamental Research Labs' team spans MIT EECS, Stanford NLP Group, Google X, and Citadel. Beyond Shortcut.ai, they're advancing frontier agent research: their Fairies platform powers autonomous AI agents (originally built for Minecraft simulations under their earlier name, LyfeGame), and Project Sid demonstrated thousands of AI agents developing specialized roles, democratic governance, and cultural propagation. Their model research pipeline needed infrastructure that didn't exist off the shelf: a unified platform to deploy model checkpoints, intelligently route them to the right GPU infrastructure based on size and architecture, expose them via standard APIs, and evaluate them against academic benchmarks, function-calling leaderboards, and real-world software engineering tasks. The cycle of train, deploy, evaluate, iterate needed to be fast, reliable, and repeatable with full environment isolation.

Solution

We built two interconnected systems. A full-stack Next.js deployment platform handles model lifecycle from checkpoint to live inference endpoint. The platform auto-detects model parameters and routes deployments accordingly: smaller models deploy to Modal with vLLM for fast iteration, larger models route to Baseten with Truss for multi-GPU inference. Every deployment exposes an OpenAI-compatible API with a LiteLLM-based gateway handling API key management, per-key rate limiting, and authentication. A custom evaluation workspace with an embedded Monaco Editor lets researchers write evaluators in JavaScript or Python, upload datasets, and track job results. The companion Evaluation Service, built on EleutherAI's lm-evaluation-harness and powered by Apache Airflow, runs 199+ benchmark tasks including 60+ academic benchmarks like HellaSwag and MMLU, the Berkeley Function Call Leaderboard across 10+ datasets, and SWE-Gym with Docker-based isolation and distributed execution on Modal. A professional CLI wraps the entire pipeline for terminal-first researchers.

Outcome

The platform gave researchers a unified workflow: select a checkpoint, deploy with one click, test in the inference playground, then run comprehensive benchmarks from the CLI or dashboard. Auto-detection of model architecture and tool call format eliminated manual configuration. Environment isolation separated dev experiments from production endpoints. The architecture was designed for extensibility so the team could build on the foundation independently.

The Context

Building tools for the toolmakers

Fundamental Research Labs isn't a typical software company. They're an a16z and Prosus-backed AI research lab whose team comes from MIT EECS, Stanford NLP Group, Google X, and Citadel. Their flagship product, Shortcut.ai, is an AI-powered analyst trusted by firms like KKR, Microsoft, OpenAI, and Anthropic. Their Fairies platform powers autonomous AI agents, originally built for Minecraft simulations under their earlier name LyfeGame. Their research program, Project Sid, demonstrated thousands of AI agents developing specialized roles, democratic governance systems, and cultural propagation.

The work they do requires a constant cycle of training model checkpoints, deploying them for inference, testing quality, and evaluating performance against standardized benchmarks. That cycle needs to be fast, reliable, and repeatable. Manual deployment means wasted researcher time. Inconsistent evaluation means unreliable results. No environment isolation means one experiment can break another.

They needed infrastructure purpose-built for ML research workflows: a platform that understands model checkpoints, GPU requirements, inference APIs, and evaluation pipelines as first-class concepts. Not a generic dashboard bolted onto a deployment script. A system where a researcher can go from checkpoint to benchmark results without context-switching between different tools.

That's what we built.

The Engineering

Model deployment is harder than it looks

Deploying an AI model for inference sounds straightforward until you account for the variables. A 7B parameter model needs different infrastructure than a 70B parameter model. Different model architectures use different tool call formats. Some checkpoints live on HuggingFace, others in cloud storage. GPU selection depends on model size and throughput requirements. Memory utilization needs to be maximized to justify GPU costs without crashing.

The deployment platform handles this automatically. It detects parameter count from the checkpoint and routes to the appropriate inference provider: Modal with vLLM for smaller models that benefit from fast cold starts, Baseten with Truss for larger models that need multi-GPU setups. Tool call format detection runs at deploy time so the inference API handles function calling correctly regardless of architecture. GPU tier is configurable per deployment with sensible defaults, and auto-scaling brings idle deployments down to zero cost.

The inference layer is OpenAI-compatible, so any tool that works with GPT works with these endpoints. A LiteLLM-based API gateway handles authentication with per-key rate limiting (requests per minute and tokens per minute), key expiration, and endpoint caching for fast routing.

Evaluation is equally complex. The service integrates EleutherAI's lm-evaluation-harness for 60+ academic benchmarks, the Berkeley Function Call Leaderboard for tool-use evaluation across multi-turn and language-specific variants, and SWE-Gym for evaluating models on real GitHub issues. SWE-Gym runs in Docker containers for complete environment isolation, with distributed execution on Modal scaling to thousands of concurrent tasks. Airflow DAGs orchestrate different evaluation modes for local models, API-hosted models, and distributed SWE-Gym workloads.

A professional CLI wraps the entire pipeline. Researchers run evaluations, list available benchmarks, test configurations, and manage settings from the terminal. The CLI supports local models, remote API endpoints, quantization options, automatic device detection, and configurable batch sizes.

The Transformation

What we rebuilt

Intelligent Model Deployment

Select a checkpoint from HuggingFace or cloud storage and deploy with one click. The platform auto-detects parameter count and routes to the optimal provider: Modal with vLLM for smaller models, Baseten with Truss for larger multi-GPU deployments. Configurable GPU tiers with auto-scaling to zero on idle and instant wake on request.

OpenAI-Compatible Inference API

Every deployed model exposes standard chat completion endpoints. A LiteLLM-based API gateway handles authentication with per-key rate limiting (RPM and TPM), key expiration, and endpoint caching. Any tool built for GPT works with these endpoints out of the box.

Custom Evaluation Workspace

Write evaluator code in JavaScript or Python using an embedded Monaco Editor. Pre-built templates for common evaluation patterns. Upload datasets in JSON/JSONL with syntax highlighting and pagination. Track evaluation jobs through their lifecycle and export results as CSV or JSON.

199+ Benchmark Pipeline

Built on EleutherAI's lm-evaluation-harness with 60+ academic benchmarks, 10+ Berkeley Function Call Leaderboard datasets across simple, multi-turn, parallel, and language-specific variants, and SWE-Gym for real-world software engineering tasks. Airflow DAGs orchestrate evaluation runs with results flowing to PostgreSQL.

Distributed SWE-Gym Evaluation

SWE-Gym integration for evaluating models on real GitHub issues with Docker-based environment isolation per task. Modal distributed compute scales to thousands of concurrent evaluations. Multiple prompt strategies (systematic, TDD, scientific, defensive, simple, rapid). Designed for running comprehensive benchmark suites in hours rather than days.

Professional CLI

A terminal-first interface for researchers who prefer the command line. Run evaluations, list available benchmarks and models, test configurations, and manage settings. Supports local models, API endpoints, quantization options, automatic device detection (CUDA, CPU, MPS), and configurable batch sizes.

Architecture

The new foundation

Two interconnected systems: a deployment platform and an evaluation service. The deployment layer intelligently routes models between inference providers based on parameter count. The evaluation layer orchestrates benchmarks through Airflow with distributed execution on Modal. Supabase provides the data layer with environment isolation. A LiteLLM-based gateway handles API authentication and rate limiting. The entire pipeline supports a researcher going from checkpoint to benchmark results in a single workflow.

Next.js Dashboard

Researcher workspace for deployment, evaluation, and inference

React, TypeScript, Radix UI, Monaco Editor

Modal + vLLM

Smaller model deployment with fast cold starts

FastAPI proxy, auto-scaling

Baseten + Truss

Larger model deployment with multi-GPU inference

FastAPI proxy, multi-GPU

LiteLLM API Gateway

API key management, rate limiting, authentication

OpenAI-compatible proxy

Evaluation Harness

199+ benchmarks across academic, tool-use, and SWE tasks

EleutherAI lm-eval, BFCL, SWE-Gym

Apache Airflow

Orchestration for local, API, and distributed evaluations

DAGs, LocalExecutor, PostgreSQL

Modal + SWE-Gym

Distributed SWE-Gym evaluation at scale

Docker isolation, auto-scaling

Supabase PostgreSQL

Model registry, deployments, evaluations, API keys

Auth, database, env isolation

Cloud Storage

Model checkpoint and evaluation dataset storage

Checkpoint and dataset hosting

Technology Stack

The modern toolkit

Next.jsFrontend

ReactUI Library

TypeScriptLanguage

Tailwind CSSStyling

PythonBackend

PostgreSQLDatabase

DockerContainers

Hugging FaceModel Registry

vLLMInference

ModalCompute

BasetenDeployment

Built On

Open source foundations

EleutherAI lm-evaluation-harness

The open-source framework for evaluating language models across 60+ academic benchmarks including HellaSwag, MMLU, and more.

SWE-Gym

An open-source environment for evaluating AI agents on real-world software engineering tasks from GitHub repositories.

Berkeley Function Call Leaderboard

UC Berkeley's benchmark for evaluating LLM tool-use capabilities across simple, multi-turn, parallel, and language-specific function calling.

vLLM

High-throughput inference engine for LLMs with PagedAttention, continuous batching, and OpenAI-compatible API server.

LiteLLM

Open-source proxy for calling 100+ LLM APIs in a unified OpenAI-compatible format with authentication, rate limiting, and spend tracking.

The Result

See what emerged

Researcher Dashboard

A unified workspace for the entire model research pipeline. Model management, deployment tracking, evaluation results, API key administration, and inference testing. Environment isolation separates dev, preview, and production.

Dashboard Home

Model overview with deployment status, recent evaluations, and quick actions

Model Management

Browse and manage model checkpoints with auto-detected parameter count and architecture info

API Key Management

Create, configure, and monitor API keys with per-key rate limiting, expiration, and usage tracking

Environment Isolation

Separate dev, preview, and production environments with independent deployments and API keys

Model Deployment

One-click deployment with intelligent infrastructure routing. Select a checkpoint, configure GPU and scaling, and the platform handles parameter detection, provider routing, tool call format detection, and API endpoint registration.

Checkpoint Selection

Browse checkpoints with auto-detected parameter count and architecture information

Deployment Configuration

GPU tier selection, memory settings, concurrency limits, and scaling configuration

Deployment Progress

Real-time tracking of deployment status from provisioning through active inference readiness

Endpoint Management

Active deployments with OpenAI-compatible URLs, health status, and auto-scaling config

Evaluation Workspace

Write custom evaluators with Monaco Editor, upload datasets, run evaluation jobs, and analyze results. Pre-built templates and 199+ benchmark tasks available out of the box.

Evaluator Editor

Monaco-powered code editor for writing custom evaluation logic in JavaScript or Python

Evaluation Templates

Pre-built templates for common evaluation patterns and metrics

Dataset Management

Upload JSON/JSONL datasets with syntax highlighting and pagination

Job Tracking

Monitor evaluation runs with status tracking and CSV/JSON export

Results Analysis

Benchmark scores with visualizations for comparing model performance across tasks

Inference Playground

Test deployed models directly in the browser. Send prompts, test function calling, and validate model behavior before running formal evaluations.

Chat Interface

Interactive prompt testing with streaming responses

Tool Call Testing

Test function calling with auto-detected format parsers

CLI & Pipeline

A professional command-line interface for terminal-first researchers. Run benchmarks, manage models, and configure evaluations. Backed by Airflow for orchestrated, repeatable evaluation runs.

Run Evaluations

Execute benchmarks against local or API models with configurable tasks, quantization, and batch size

Browse Benchmarks

List 199+ available benchmark tasks and supported model configurations

Airflow Orchestration

DAGs for local models, API models, and distributed SWE-Gym workloads

Distributed SWE-Gym

Thousands of concurrent evaluations on Modal with Docker isolation and multiple prompt strategies

Only accepting 1 new project in Q3

Ready to build something
extraordinary?

Let's discuss how we can bring your vision to life.

Book a Discovery Call

View More Work