MENU
Home Docs

Getting Started

TrainTrack integrates into your training loop with just a few lines of code. Follow this guide to get started.

1

Install TrainTrack

Install the client package via pip:

$ pip install traintrack-ai
2

Get Your API Key

Create an API key in the Projects tab of the TrainTrack UI. Then set it as an environment variable:

$ export TRAINTRACK_API_KEY="ttk_your_key_here"
3

Integrate into Your Code

Add the TrainTrackHook to your existing training loop:

from traintrack import TrainTrackHook

# Configure evaluation once
hook = TrainTrackHook(
    model=model,
    tokenizer=tokenizer,
    run_name="my-run",
    categories=["reasoning", "math"],
)

for step, batch in enumerate(dataloader):
    # ... training code ...
    # Uses eval_every_steps from hook config
    hook.step(step)

anchor Anchor Capture

An anchor is a snapshot of your model's outputs before any training. It serves as the baseline for pairwise comparisons (win rate curves). By default, anchors are captured automatically.

auto_mode

Auto Anchor DEFAULT

The anchor is captured automatically — in __init__ for TrainTrackHook and in on_train_begin for TrainTrackCallback. This guarantees the anchor reflects the model's untouched weights.

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="my-run",
    categories=["reasoning"],
) # ← anchor capture triggers here
block

Opt-out

Pass capture_anchor=False to skip auto-capture. You can then manually call capture_anchor() later if needed.

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="my-run",
    categories=["reasoning"],
    capture_anchor=False,
)

# Manually capture later if needed
hook.capture_anchor()

settings_applications Advanced Configuration

CategoryConfig & MetricConfig

For deeper control, use CategoryConfig and MetricConfig. Override default metrics, judge modes, and rubrics for any category — built-in or custom.

from traintrack import TrainTrackHook, CategoryConfig, MetricConfig

# 1. Add one custom behavior rubric
politeness = MetricConfig(
    name="politeness",
    rubric="Score from 0 (very rude) to 10 (extremely polite)."
)

# 2. Customize one category
reasoning_config = CategoryConfig(
    category="reasoning",
    max_samples=20,
    judge_modes=["criteria", "pairwise_anchor"],
    metrics=["reasoning_quality", politeness]
)

# 3. Pass config directly to hook
hook = TrainTrackHook(
    model=model,
    tokenizer=tokenizer,
    run_name="advanced-run",
    categories=["math", reasoning_config]
)

upload_file Custom Datasets

Bring your own evaluation data with CategoryConfig. Point the sources field to one or more JSONL or CSV files. Each file should contain a prompt field.

Single Source File

from traintrack import TrainTrackHook, CategoryConfig, MetricConfig

my_eval = CategoryConfig(
    category="coding_interview",
    sources=["data/coding_prompts.jsonl"],
    max_samples=50,
    metrics=["correctness", MetricConfig(
        name="code_quality",
        rubric="0=broken, 5=functional, 10=production-ready and optimized"
    )],
    judge_modes=["criteria"],
)

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="coding-eval",
    categories=[my_eval],
)

Multiple Source Files

Aggregate prompts from multiple files into a single evaluation category:

safety_eval = CategoryConfig(
    category="safety",
    sources=[
        "data/toxicity_prompts.jsonl",
        "data/bias_prompts.jsonl",
        "data/adversarial_prompts.csv",
    ],
    max_samples=100,
    metrics=["safety", "helpfulness"],
    judge_modes=["criteria", "pairwise_anchor"],
)

Mix Built-In + Custom

Freely combine built-in categories with custom datasets in a single run:

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="full-eval",
    categories=[
        "math",             # built-in (auto-configured)
        "reasoning",         # built-in
        "hallucination",     # built-in
        safety_eval,          # custom CategoryConfig
    ],
)

Expected JSONL Format

Each line in your JSONL files should contain at minimum a prompt field:

my_prompts.jsonl
{"prompt": "Explain the difference between TCP and UDP."}
{"prompt": "Write a Python function to detect cycles in a linked list."}
{"prompt": "What are the SOLID principles?", "id": "solid-1"}

database Built-in Evaluation Packs

TrainTrack includes pre-configured prompts and metrics for comprehensive evaluation. You can use these by simply passing their string names to the categories argument.

Targeted Behavior Packs

reasoning 🧠

Logic, math, and multi-step reasoning.

reasoning_quality problem_solving
instruction_following 📋

Adherence to formatting and constraints.

instruction_compliance format_adherence
hallucination ⚠️

Truthfulness and factual accuracy.

truthfulness factual_accuracy
creativity 🎨

Originality and open-ended writing.

originality expressiveness

Comprehensive Subject Categories

physics

Graduate level. Source: GPQA, MMLU Pro.

correctness scientific_accuracy quantitative_reasoning
math

Rigorous problem solving benchmarking.

correctness mathematical_rigor problem_solving
computer_science

Coding and technical CS theory.

correctness technical_depth implementation_quality
law

Legal reasoning and precedent analysis.

correctness legal_analysis argumentation
chemistry

Organic, inorganic, and general theory.

scientific_accuracy quantitative_reasoning
biology

Molecular and general biology.

scientific_accuracy conceptual_understanding
engineering

Applied science and implementation.

technical_depth practical_applicability
economics

Market theory and analytical depth.

analytical_depth practical_applicability
business

Management and business logic.

analytical_depth practical_applicability
philosophy

Ethics, logic, and rigorous thought.

philosophical_rigor conceptual_understanding
psychology

Behavioral and clinical knowledge.

clinical_accuracy conceptual_understanding
health

Medical facts and clinical soundess.

clinical_accuracy practical_applicability
history

Historical facts and analytical depth.

analytical_depth conceptual_understanding
hallucination

TruthfulQA adversarial testing.

truthfulness hallucination_resistance
instruction_following

Google IFEval compliance testing.

instruction_compliance format_adherence
creativity

Originality and expressive prose.

originality expressiveness

analytics Concepts

Judge Modes

Determine how evaluations are performed:

format_list_numbered

criteria

Scoring outputs on a 0-10 scale based on the metric's rubric.

compare

pairwise_anchor

Side-by-side comparison between the current output and the "Anchor" (step 0) output. Generates a Win Rate % curve.

API Reference

TrainTrackHook

class TrainTrackHook(
    model: torch.nn.Module,
    tokenizer: Any,
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    eval_every_steps: Optional[int] = None,
    eval_every_epochs: Optional[int] = None,
    capture_anchor: bool = True,
    max_new_tokens: int = 256
)
Argument Description
model The PyTorch model to evaluate. Requires a .generate() method.
tokenizer Tokenizer with encode/decode methods.
run_name Unique identifier for this training run.
categories List of built-in category names or CategoryConfig objects.
eval_every_steps Default: None Trigger evaluation every N steps.
eval_every_epochs Default: None Trigger evaluation every N epochs.
capture_anchor Default: True If True, captures evaluation outputs before training (step 0) as a baseline.
max_new_tokens Default: 256 Maximum number of tokens to generate per prompt.

TrainTrackCallback

class TrainTrackCallback(
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    model: Optional[Module] = None,
    tokenizer: Optional[Any] = None,
    eval_every_steps: int = 100,
    capture_anchor: bool = True
)
Argument Description
model Optional. If not provided, tries to use the Trainer's model.
tokenizer Optional. If not provided, tries to use the Trainer's tokenizer.
capture_anchor Default: True Auto-capture anchor in on_train_begin.

BuildTrainTrackTinkerEvaluator

Factory for cookbook-style training configs. Returns an evaluator_builder you pass into evaluator_builders=[...].

def BuildTrainTrackTinkerEvaluator(
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    model_name: Optional[str] = None,
    serial: bool = False,
    ...,
) -> Callable[..., TrainTrackSamplingEvaluator]

Core Arguments

Argument Description
run_name TrainTrack run identifier shown in dashboards.
categories Built-in category names or CategoryConfig entries (custom datasets + metrics).
model_name Tinker model id. Optional in builder mode if the cookbook context already provides it.
serial If False (default), uses futures-based evaluator; if True, uses serial sampling path.
eval_every_steps Internal cadence gate. Set None when training config already controls cadence (e.g. eval_every).
auto_capture_anchor Automatically capture step-0 anchor for pairwise-anchor metrics.
server_url, api_key Optional TrainTrack endpoint + credential overrides.
metadata Extra metadata attached to every ingestion payload.

Sampling Controls

Argument Description
max_new_tokens Maximum generated tokens per evaluation prompt.
temperature, top_p Sampling behavior controls for evaluation generations.
stop_sequences Optional explicit stop tokens; renderer defaults are used when omitted.
num_samples Samples requested per prompt from Tinker sampler.
max_concurrency Concurrent sampling requests per evaluator tick.
sample_batch_size Prompt dispatch chunk size to improve sampler throughput.

Futures / Throughput Controls

Argument Description
max_inflight_requests Hard cap for queued sampling futures.
max_drain_per_tick Max completed futures resolved each evaluator tick.
drop_when_busy Drop new eval submissions when queue is full (instead of waiting).
max_submit_wait_s Optional wait budget to free queue capacity before dropping.
auto_background_drain Enable periodic background draining of completed futures.
background_drain_interval_s Background drain poll interval.
auto_flush_at_exit, exit_flush_timeout_s Best-effort flush on process exit and its timeout.

Integration Controls

Argument Description
auto_patch_supervised_step_passthrough Auto-patches cookbook supervised eval path to pass explicit step to evaluator.
infer_step_from_sampling_client_name Fallback step inference from snapshot names when step is not passed explicitly.
snapshot_name_prefix, checkpoint_prefix Naming controls for saved sampler snapshots and checkpoint tags.

CreateTrainTrackTinkerEvaluator

Direct instance creator for custom loops. Returns an evaluator object you call with run(...), step(...), or evaluate_training_step_async(...).

def CreateTrainTrackTinkerEvaluator(
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    model_name: str,
    serial: bool = False,
    ...,
) -> TrainTrackSamplingEvaluator
Argument Description
run_name, categories, model_name Required core configuration for run id, evaluation dataset categories, and Tinker model.
training_client Optional bound training client, enabling no-boilerplate calls like evaluator.run(step=step).
eval_every_steps Internal cadence controller for run()/step() loop calls.
serial Selects serial evaluator path; futures path remains default.
auto_capture_anchor Captures and sends anchor (step 0) automatically for pairwise-anchor metrics.
capture_anchor_on_init, anchor_training_client Optional init-time anchor capture path; useful when explicit pre-training anchor capture is required.
max_new_tokens, temperature, top_p, stop_sequences, num_samples Sampling behavior controls used for evaluation generations.
max_concurrency, sample_batch_size Throughput controls for prompt sampling.
max_inflight_requests, max_drain_per_tick, drop_when_busy, max_submit_wait_s Futures queue controls (active when serial=False).
auto_background_drain, background_drain_interval_s Background completion polling for futures mode.
auto_flush_at_exit, exit_flush_timeout_s Best-effort flush of pending futures/checkpoints on process exit.
auto_patch_supervised_step_passthrough, infer_step_from_sampling_client_name Step consistency controls when integrating with cookbook evaluators.
snapshot_name_prefix, checkpoint_prefix Naming strategy for Tinker snapshots and TrainTrack checkpoints.
renderer_name, tokenizer Optional renderer/tokenizer overrides for custom runtime setups.
server_url, api_key, metadata TrainTrack transport settings and payload metadata enrichment.

CategoryConfig

@dataclass
class CategoryConfig:
    category: str
    sources: Optional[list[str]] = None
    max_samples: int = 20
    judge_modes: Optional[list[str]] = None
    metrics: Optional[list[Union[str, MetricConfig]]] = None
Field Description
category Name of built-in category (e.g. "reasoning") or custom identifier.
sources Default: None List of file paths (.jsonl/.csv) containing prompts. Required for custom categories. Auto-filled for built-ins.
max_samples Default: 20 Number of prompts to sample.
judge_modes Default: ["criteria"] "criteria" (scoring) and/or "pairwise_anchor" (win rate).
metrics Default: [] List of metric names or MetricConfig objects.

MetricConfig

@dataclass
class MetricConfig:
    name: str
    rubric: Optional[str] = None
Field Description
name Name of the behavior to monitor (e.g. "politeness").
rubric Default: None Plain-text description for the LLM judge to use when scoring.

menu_book Glossary

LLM as a Judge

The practice of using a high-capability model (like GPT-4o) to evaluate the outputs of a smaller or domain-specific model using Semantically meaningful rubrics.

Behavior Curve

A graph showing how a specific behavior (e.g. Reasoning or Hallucination Risk) changes over the course of training, rather than just raw cross-entropy loss.

Anchor Output

The output generated by the model before training begins (step 0). Used as a baseline for pairwise comparisons to measure progress.

Pareto Frontier

Pareto frontier is the set of optimal, non-dominated solutions in multi-objective optimization, where improving one objective requires sacrificing another. TrainTrack identifies the Pareto frontier of your models by measuring multiple behavior dimensions simultaneously, helping you choose the perfect balance for your specific usecase.

help_outline

Need immediate support?

Email us at shavon.thadani@gmail.com or shaylin.thadani@gmail.com. We guarantee a response in less than 1 hour.

Full Reference