Getting Started

TrainTrack integrates into your training loop with just a few lines of code. Follow this guide to get started.

Install TrainTrack

Install the client package via pip:

                        $ pip install
                            traintrack-ai
                    

Get Your API Key

Create an API key in the Projects tab of the TrainTrack UI. Then set it as an environment variable:

                        $ export
                            TRAINTRACK_API_KEY="ttk_your_key_here"
                    

Integrate into Your Code

Add the TrainTrackHook to your existing training loop:

from traintrack import TrainTrackHook

# Configure evaluation once
hook = TrainTrackHook(
    model=model,
    tokenizer=tokenizer,
    run_name="my-run",
    categories=["reasoning", "math"],
)

for step, batch in enumerate(dataloader):
    # ... training code ...
    # Uses eval_every_steps from hook config
    hook.step(step)

anchor Anchor Capture

An anchor is a snapshot of your model's outputs before any training. It serves as the baseline for pairwise comparisons (win rate curves). By default, anchors are captured automatically.

auto_mode

Auto Anchor DEFAULT

The anchor is captured automatically — in __init__ for TrainTrackHook and in on_train_begin for TrainTrackCallback. This guarantees the anchor reflects the model's untouched weights.

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="my-run",
    categories=["reasoning"],
) # ← anchor capture triggers here

block

Opt-out

Pass capture_anchor=False to skip auto-capture. You can then manually call capture_anchor() later if needed.

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="my-run",
    categories=["reasoning"],
    capture_anchor=False,
)

# Manually capture later if needed
hook.capture_anchor()

settings_applications Advanced Configuration

CategoryConfig & MetricConfig

For deeper control, use CategoryConfig and MetricConfig. Override default metrics, judge modes, and rubrics for any category — built-in or custom.

from traintrack import TrainTrackHook, CategoryConfig, MetricConfig

# 1. Add one custom behavior rubric
politeness = MetricConfig(
    name="politeness",
    rubric="Score from 0 (very rude) to 10 (extremely polite)."
)

# 2. Customize one category
reasoning_config = CategoryConfig(
    category="reasoning",
    max_samples=20,
    judge_modes=["criteria", "pairwise_anchor"],
    metrics=["reasoning_quality", politeness]
)

# 3. Pass config directly to hook
hook = TrainTrackHook(
    model=model,
    tokenizer=tokenizer,
    run_name="advanced-run",
    categories=["math", reasoning_config]
)

upload_file Custom Datasets

Bring your own evaluation data with CategoryConfig. Point the sources field to one or more JSONL or CSV files. Each file should contain a prompt field.

Single Source File

from traintrack import TrainTrackHook, CategoryConfig, MetricConfig

my_eval = CategoryConfig(
    category="coding_interview",
    sources=["data/coding_prompts.jsonl"],
    max_samples=50,
    metrics=["correctness", MetricConfig(
        name="code_quality",
        rubric="0=broken, 5=functional, 10=production-ready and optimized"
    )],
    judge_modes=["criteria"],
)

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="coding-eval",
    categories=[my_eval],
)

Multiple Source Files

Aggregate prompts from multiple files into a single evaluation category:

safety_eval = CategoryConfig(
    category="safety",
    sources=[
        "data/toxicity_prompts.jsonl",
        "data/bias_prompts.jsonl",
        "data/adversarial_prompts.csv",
    ],
    max_samples=100,
    metrics=["safety", "helpfulness"],
    judge_modes=["criteria", "pairwise_anchor"],
)

Mix Built-In + Custom

Freely combine built-in categories with custom datasets in a single run:

hook = TrainTrackHook(
    model=model, tokenizer=tokenizer,
    run_name="full-eval",
    categories=[
        "math",             # built-in (auto-configured)
        "reasoning",         # built-in
        "hallucination",     # built-in
        safety_eval,          # custom CategoryConfig
    ],
)

Expected JSONL Format

Each line in your JSONL files should contain at minimum a prompt field:

my_prompts.jsonl

                            {"prompt": "Explain the difference between TCP and UDP."}
{"prompt": "Write a Python function to detect cycles in a linked list."}
{"prompt": "What are the SOLID principles?", "id": "solid-1"}
                        

database Built-in Evaluation Packs

TrainTrack includes pre-configured prompts and metrics for comprehensive evaluation. You can use these by simply passing their string names to the categories argument.

Targeted Behavior Packs

reasoning 🧠

Logic, math, and multi-step reasoning.

reasoning_quality problem_solving

instruction_following 📋

Adherence to formatting and constraints.

instruction_compliance format_adherence

hallucination ⚠️

Truthfulness and factual accuracy.

truthfulness factual_accuracy

creativity 🎨

Originality and open-ended writing.

originality expressiveness

Comprehensive Subject Categories

physics

Graduate level. Source: GPQA, MMLU Pro.

correctness scientific_accuracy quantitative_reasoning

math

Rigorous problem solving benchmarking.

correctness mathematical_rigor problem_solving

computer_science

Coding and technical CS theory.

correctness technical_depth implementation_quality

law

Legal reasoning and precedent analysis.

correctness legal_analysis argumentation

chemistry

Organic, inorganic, and general theory.

scientific_accuracy quantitative_reasoning

biology

Molecular and general biology.

scientific_accuracy conceptual_understanding

engineering

Applied science and implementation.

technical_depth practical_applicability

economics

Market theory and analytical depth.

analytical_depth practical_applicability

business

Management and business logic.

analytical_depth practical_applicability

philosophy

Ethics, logic, and rigorous thought.

philosophical_rigor conceptual_understanding

psychology

Behavioral and clinical knowledge.

clinical_accuracy conceptual_understanding

health

Medical facts and clinical soundess.

clinical_accuracy practical_applicability

history

Historical facts and analytical depth.

analytical_depth conceptual_understanding

hallucination

TruthfulQA adversarial testing.

truthfulness hallucination_resistance

instruction_following

Google IFEval compliance testing.

instruction_compliance format_adherence

creativity

Originality and expressive prose.

originality expressiveness

analytics Concepts

Judge Modes

Determine how evaluations are performed:

format_list_numbered

criteria

Scoring outputs on a 0-10 scale based on the metric's rubric.

compare

pairwise_anchor

Side-by-side comparison between the current output and the "Anchor" (step 0) output. Generates a Win Rate % curve.

API Reference

TrainTrackHook

class TrainTrackHook(
    model: torch.nn.Module,
    tokenizer: Any,
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    eval_every_steps: Optional[int] = None,
    eval_every_epochs: Optional[int] = None,
    capture_anchor: bool = True,
    max_new_tokens: int = 256
)

Argument	Description
model	The PyTorch model to evaluate. Requires a `.generate()` method.
tokenizer	Tokenizer with encode/decode methods.
run_name	Unique identifier for this training run.
categories	List of built-in category names or `CategoryConfig` objects.
eval_every_steps Default: None	Trigger evaluation every N steps.
eval_every_epochs Default: None	Trigger evaluation every N epochs.
capture_anchor Default: True	If True, captures evaluation outputs before training (step 0) as a baseline.
max_new_tokens Default: 256	Maximum number of tokens to generate per prompt.

TrainTrackCallback

class TrainTrackCallback(
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    model: Optional[Module] = None,
    tokenizer: Optional[Any] = None,
    eval_every_steps: int = 100,
    capture_anchor: bool = True
)

Argument	Description
model	Optional. If not provided, tries to use the Trainer's model.
tokenizer	Optional. If not provided, tries to use the Trainer's tokenizer.
capture_anchor Default: True	Auto-capture anchor in `on_train_begin`.

BuildTrainTrackTinkerEvaluator

Factory for cookbook-style training configs. Returns an evaluator_builder you pass into evaluator_builders=[...].

def BuildTrainTrackTinkerEvaluator(
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    model_name: Optional[str] = None,
    serial: bool = False,
    ...,
) -> Callable[..., TrainTrackSamplingEvaluator]

Core Arguments

Argument	Description
run_name	TrainTrack run identifier shown in dashboards.
categories	Built-in category names or `CategoryConfig` entries (custom datasets + metrics).
model_name	Tinker model id. Optional in builder mode if the cookbook context already provides it.
serial	If `False` (default), uses futures-based evaluator; if `True`, uses serial sampling path.
eval_every_steps	Internal cadence gate. Set `None` when training config already controls cadence (e.g. `eval_every`).
auto_capture_anchor	Automatically capture step-0 anchor for pairwise-anchor metrics.
server_url, api_key	Optional TrainTrack endpoint + credential overrides.
metadata	Extra metadata attached to every ingestion payload.

Sampling Controls

Argument	Description
max_new_tokens	Maximum generated tokens per evaluation prompt.
temperature, top_p	Sampling behavior controls for evaluation generations.
stop_sequences	Optional explicit stop tokens; renderer defaults are used when omitted.
num_samples	Samples requested per prompt from Tinker sampler.
max_concurrency	Concurrent sampling requests per evaluator tick.
sample_batch_size	Prompt dispatch chunk size to improve sampler throughput.

Futures / Throughput Controls

Argument	Description
max_inflight_requests	Hard cap for queued sampling futures.
max_drain_per_tick	Max completed futures resolved each evaluator tick.
drop_when_busy	Drop new eval submissions when queue is full (instead of waiting).
max_submit_wait_s	Optional wait budget to free queue capacity before dropping.
auto_background_drain	Enable periodic background draining of completed futures.
background_drain_interval_s	Background drain poll interval.
auto_flush_at_exit, exit_flush_timeout_s	Best-effort flush on process exit and its timeout.

Integration Controls

Argument	Description
auto_patch_supervised_step_passthrough	Auto-patches cookbook supervised eval path to pass explicit step to evaluator.
infer_step_from_sampling_client_name	Fallback step inference from snapshot names when step is not passed explicitly.
snapshot_name_prefix, checkpoint_prefix	Naming controls for saved sampler snapshots and checkpoint tags.

CreateTrainTrackTinkerEvaluator

Direct instance creator for custom loops. Returns an evaluator object you call with run(...), step(...), or evaluate_training_step_async(...).

def CreateTrainTrackTinkerEvaluator(
    run_name: str,
    categories: list[Union[str, CategoryConfig]],
    model_name: str,
    serial: bool = False,
    ...,
) -> TrainTrackSamplingEvaluator

Argument	Description
run_name, categories, model_name	Required core configuration for run id, evaluation dataset categories, and Tinker model.
training_client	Optional bound training client, enabling no-boilerplate calls like `evaluator.run(step=step)`.
eval_every_steps	Internal cadence controller for `run()`/`step()` loop calls.
serial	Selects serial evaluator path; futures path remains default.
auto_capture_anchor	Captures and sends anchor (step 0) automatically for pairwise-anchor metrics.
capture_anchor_on_init, anchor_training_client	Optional init-time anchor capture path; useful when explicit pre-training anchor capture is required.
max_new_tokens, temperature, top_p, stop_sequences, num_samples	Sampling behavior controls used for evaluation generations.
max_concurrency, sample_batch_size	Throughput controls for prompt sampling.
max_inflight_requests, max_drain_per_tick, drop_when_busy, max_submit_wait_s	Futures queue controls (active when `serial=False`).
auto_background_drain, background_drain_interval_s	Background completion polling for futures mode.
auto_flush_at_exit, exit_flush_timeout_s	Best-effort flush of pending futures/checkpoints on process exit.
auto_patch_supervised_step_passthrough, infer_step_from_sampling_client_name	Step consistency controls when integrating with cookbook evaluators.
snapshot_name_prefix, checkpoint_prefix	Naming strategy for Tinker snapshots and TrainTrack checkpoints.
renderer_name, tokenizer	Optional renderer/tokenizer overrides for custom runtime setups.
server_url, api_key, metadata	TrainTrack transport settings and payload metadata enrichment.

CategoryConfig

@dataclass
class CategoryConfig:
    category: str
    sources: Optional[list[str]] = None
    max_samples: int = 20
    judge_modes: Optional[list[str]] = None
    metrics: Optional[list[Union[str, MetricConfig]]] = None

Field	Description
category	Name of built-in category (e.g. "reasoning") or custom identifier.
sources Default: None	List of file paths (.jsonl/.csv) containing prompts. Required for custom categories. Auto-filled for built-ins.
max_samples Default: 20	Number of prompts to sample.
judge_modes Default: ["criteria"]	"criteria" (scoring) and/or "pairwise_anchor" (win rate).
metrics Default: []	List of metric names or `MetricConfig` objects.

MetricConfig

@dataclass
class MetricConfig:
name: str
rubric: Optional[str] = None

Field	Description
name	Name of the behavior to monitor (e.g. "politeness").
rubric Default: None	Plain-text description for the LLM judge to use when scoring.

menu_book Glossary

LLM as a Judge

The practice of using a high-capability model (like GPT-4o) to evaluate the outputs of a smaller or domain-specific model using Semantically meaningful rubrics.

Behavior Curve

A graph showing how a specific behavior (e.g. Reasoning or Hallucination Risk) changes over the course of training, rather than just raw cross-entropy loss.

Anchor Output

The output generated by the model before training begins (step 0). Used as a baseline for pairwise comparisons to measure progress.

Pareto Frontier

Pareto frontier is the set of optimal, non-dominated solutions in multi-objective optimization, where improving one objective requires sacrificing another. TrainTrack identifies the Pareto frontier of your models by measuring multiple behavior dimensions simultaneously, helping you choose the perfect balance for your specific usecase.

help_outline

Need immediate support?

Email us at shavon.thadani@gmail.com or shaylin.thadani@gmail.com. We guarantee a response in less than 1 hour.

Full Reference