Getting Started
TrainTrack integrates into your training loop with just a few lines of code. Follow this guide to get started.
Install TrainTrack
Install the client package via pip:
Get Your API Key
Create an API key in the Projects tab of the TrainTrack UI. Then set it as an environment variable:
Integrate into Your Code
Add the TrainTrackHook to your
existing training loop:
from traintrack import TrainTrackHook # Configure evaluation once hook = TrainTrackHook( model=model, tokenizer=tokenizer, run_name="my-run", categories=["reasoning", "math"], ) for step, batch in enumerate(dataloader): # ... training code ... # Uses eval_every_steps from hook config hook.step(step)
Use the TrainTrackCallback with the
Hugging Face Trainer:
from traintrack import TrainTrackCallback # Callback handles anchor + scheduled evals callback = TrainTrackCallback( run_name="my-hf-run", categories=["reasoning", "math"], model=model, tokenizer=tokenizer, ) trainer = Trainer( model=model, callbacks=[callback], # ... other args )
For Tinker you can use a builder (cookbook config) or create an evaluator directly (custom loop):
from traintrack import BuildTrainTrackTinkerEvaluator # Use builder in tinker_cookbook config traintrack_eval = BuildTrainTrackTinkerEvaluator( run_name="my-tinker-run", model_name="meta-llama/Llama-3.1-8B", categories=["reasoning"], ) config = train.Config( ..., eval_every=10, evaluator_builders=[traintrack_eval], )
from traintrack import CreateTrainTrackTinkerEvaluator # Use directly in custom loops evaluator = CreateTrainTrackTinkerEvaluator( run_name="my-tinker-run", model_name="meta-llama/Llama-3.1-8B", categories=["reasoning"], eval_every_steps=10, ) evaluator.run(training_client=training_client, step=step)
anchor Anchor Capture
An anchor is a snapshot of your model's outputs before any training. It serves as the baseline for pairwise comparisons (win rate curves). By default, anchors are captured automatically.
Auto Anchor DEFAULT
The anchor is captured automatically — in __init__ for TrainTrackHook and in on_train_begin for TrainTrackCallback. This guarantees the anchor reflects the
model's untouched weights.
hook = TrainTrackHook( model=model, tokenizer=tokenizer, run_name="my-run", categories=["reasoning"], ) # ← anchor capture triggers here
Opt-out
Pass capture_anchor=False to skip auto-capture. You can then
manually call capture_anchor() later if needed.
hook = TrainTrackHook( model=model, tokenizer=tokenizer, run_name="my-run", categories=["reasoning"], capture_anchor=False, ) # Manually capture later if needed hook.capture_anchor()
settings_applications Advanced Configuration
CategoryConfig & MetricConfig
For deeper control, use CategoryConfig and MetricConfig. Override default metrics, judge modes, and rubrics
for any category — built-in or custom.
from traintrack import TrainTrackHook, CategoryConfig, MetricConfig # 1. Add one custom behavior rubric politeness = MetricConfig( name="politeness", rubric="Score from 0 (very rude) to 10 (extremely polite)." ) # 2. Customize one category reasoning_config = CategoryConfig( category="reasoning", max_samples=20, judge_modes=["criteria", "pairwise_anchor"], metrics=["reasoning_quality", politeness] ) # 3. Pass config directly to hook hook = TrainTrackHook( model=model, tokenizer=tokenizer, run_name="advanced-run", categories=["math", reasoning_config] )
from traintrack import TrainTrackCallback, CategoryConfig, MetricConfig # 1. Add one custom behavior rubric politeness = MetricConfig( name="politeness", rubric="Score from 0 (very rude) to 10 (extremely polite)." ) # 2. Customize one category reasoning_config = CategoryConfig( category="reasoning", metrics=[politeness] ) # 3. Pass config to callback callback = TrainTrackCallback( run_name="hf-advanced-run", categories=[reasoning_config], model=model, tokenizer=tokenizer, ) trainer = Trainer(model=model, callbacks=[callback])
from traintrack import CategoryConfig, MetricConfig, BuildTrainTrackTinkerEvaluator # 1. Define a custom metric rubric reasoning_clarity = MetricConfig( name="reasoning_clarity", rubric="0=unclear, 10=clear and logically ordered.", ) # 2. Build custom category config reasoning_cfg = CategoryConfig( category="reasoning", judge_modes=["criteria", "pairwise_anchor"], metrics=["reasoning_quality", reasoning_clarity], ) # 3. Register builder in tinker config traintrack_eval = BuildTrainTrackTinkerEvaluator( run_name="tinker-advanced-run", model_name="meta-llama/Llama-3.1-8B", categories=[reasoning_cfg], )
from traintrack import CategoryConfig, MetricConfig, CreateTrainTrackTinkerEvaluator # 1. Define a custom metric rubric reasoning_clarity = MetricConfig( name="reasoning_clarity", rubric="0=unclear, 10=clear and logically ordered.", ) # 2. Configure custom category reasoning_cfg = CategoryConfig( category="reasoning", metrics=[reasoning_clarity], ) # 3. Use evaluator directly in loop evaluator = CreateTrainTrackTinkerEvaluator( run_name="tinker-advanced-run", model_name="meta-llama/Llama-3.1-8B", categories=[reasoning_cfg], eval_every_steps=10, )
upload_file Custom Datasets
Bring your own evaluation data with CategoryConfig. Point the sources field to one or more JSONL or CSV files. Each file should
contain a prompt field.
Single Source File
from traintrack import TrainTrackHook, CategoryConfig, MetricConfig my_eval = CategoryConfig( category="coding_interview", sources=["data/coding_prompts.jsonl"], max_samples=50, metrics=["correctness", MetricConfig( name="code_quality", rubric="0=broken, 5=functional, 10=production-ready and optimized" )], judge_modes=["criteria"], ) hook = TrainTrackHook( model=model, tokenizer=tokenizer, run_name="coding-eval", categories=[my_eval], )
Multiple Source Files
Aggregate prompts from multiple files into a single evaluation category:
safety_eval = CategoryConfig( category="safety", sources=[ "data/toxicity_prompts.jsonl", "data/bias_prompts.jsonl", "data/adversarial_prompts.csv", ], max_samples=100, metrics=["safety", "helpfulness"], judge_modes=["criteria", "pairwise_anchor"], )
Mix Built-In + Custom
Freely combine built-in categories with custom datasets in a single run:
hook = TrainTrackHook( model=model, tokenizer=tokenizer, run_name="full-eval", categories=[ "math", # built-in (auto-configured) "reasoning", # built-in "hallucination", # built-in safety_eval, # custom CategoryConfig ], )
Expected JSONL Format
Each line in your JSONL files should contain at minimum a
prompt field:
{"prompt": "Explain the difference between TCP and UDP."}
{"prompt": "Write a Python function to detect cycles in a linked list."}
{"prompt": "What are the SOLID principles?", "id": "solid-1"}
database Built-in Evaluation Packs
TrainTrack includes pre-configured prompts and metrics for comprehensive evaluation.
You can use these by simply passing their string names to the categories argument.
Targeted Behavior Packs
Logic, math, and multi-step reasoning.
Adherence to formatting and constraints.
Truthfulness and factual accuracy.
Originality and open-ended writing.
Comprehensive Subject Categories
Graduate level. Source: GPQA, MMLU Pro.
Rigorous problem solving benchmarking.
Coding and technical CS theory.
Legal reasoning and precedent analysis.
Organic, inorganic, and general theory.
Molecular and general biology.
Applied science and implementation.
Market theory and analytical depth.
Management and business logic.
Ethics, logic, and rigorous thought.
Behavioral and clinical knowledge.
Medical facts and clinical soundess.
Historical facts and analytical depth.
TruthfulQA adversarial testing.
Google IFEval compliance testing.
Originality and expressive prose.
analytics Concepts
Judge Modes
Determine how evaluations are performed:
criteria
Scoring outputs on a 0-10 scale based on the metric's rubric.
pairwise_anchor
Side-by-side comparison between the current output and the "Anchor" (step 0) output. Generates a Win Rate % curve.
API Reference
TrainTrackHook
model: torch.nn.Module,
tokenizer: Any,
run_name: str,
categories: list[Union[str, CategoryConfig]],
eval_every_steps: Optional[int] = None,
eval_every_epochs: Optional[int] = None,
capture_anchor: bool = True,
max_new_tokens: int = 256
)
| Argument | Description |
|---|---|
| model | The PyTorch model to evaluate. Requires a .generate() method. |
| tokenizer | Tokenizer with encode/decode methods. |
| run_name | Unique identifier for this training run. |
| categories | List of built-in category names or CategoryConfig objects. |
| eval_every_steps Default: None | Trigger evaluation every N steps. |
| eval_every_epochs Default: None | Trigger evaluation every N epochs. |
| capture_anchor Default: True | If True, captures evaluation outputs before training (step 0) as a baseline. |
| max_new_tokens Default: 256 | Maximum number of tokens to generate per prompt. |
TrainTrackCallback
run_name: str,
categories: list[Union[str, CategoryConfig]],
model: Optional[Module] = None,
tokenizer: Optional[Any] = None,
eval_every_steps: int = 100,
capture_anchor: bool = True
)
| Argument | Description |
|---|---|
| model | Optional. If not provided, tries to use the Trainer's model. |
| tokenizer | Optional. If not provided, tries to use the Trainer's tokenizer. |
| capture_anchor Default: True | Auto-capture anchor in on_train_begin. |
BuildTrainTrackTinkerEvaluator
Factory for cookbook-style training configs. Returns an evaluator_builder
you pass into evaluator_builders=[...].
run_name: str,
categories: list[Union[str, CategoryConfig]],
model_name: Optional[str] = None,
serial: bool = False,
...,
) -> Callable[..., TrainTrackSamplingEvaluator]
Core Arguments
| Argument | Description |
|---|---|
| run_name | TrainTrack run identifier shown in dashboards. |
| categories | Built-in category names or CategoryConfig
entries (custom datasets + metrics). |
| model_name | Tinker model id. Optional in builder mode if the cookbook context already provides it. |
| serial | If False (default), uses futures-based
evaluator; if True, uses serial sampling path. |
| eval_every_steps | Internal cadence gate. Set None when
training config already controls cadence (e.g. eval_every). |
| auto_capture_anchor | Automatically capture step-0 anchor for pairwise-anchor metrics. |
| server_url, api_key | Optional TrainTrack endpoint + credential overrides. |
| metadata | Extra metadata attached to every ingestion payload. |
Sampling Controls
| Argument | Description |
|---|---|
| max_new_tokens | Maximum generated tokens per evaluation prompt. |
| temperature, top_p | Sampling behavior controls for evaluation generations. |
| stop_sequences | Optional explicit stop tokens; renderer defaults are used when omitted. |
| num_samples | Samples requested per prompt from Tinker sampler. |
| max_concurrency | Concurrent sampling requests per evaluator tick. |
| sample_batch_size | Prompt dispatch chunk size to improve sampler throughput. |
Futures / Throughput Controls
| Argument | Description |
|---|---|
| max_inflight_requests | Hard cap for queued sampling futures. |
| max_drain_per_tick | Max completed futures resolved each evaluator tick. |
| drop_when_busy | Drop new eval submissions when queue is full (instead of waiting). |
| max_submit_wait_s | Optional wait budget to free queue capacity before dropping. |
| auto_background_drain | Enable periodic background draining of completed futures. |
| background_drain_interval_s | Background drain poll interval. |
| auto_flush_at_exit, exit_flush_timeout_s | Best-effort flush on process exit and its timeout. |
Integration Controls
| Argument | Description |
|---|---|
| auto_patch_supervised_step_passthrough | Auto-patches cookbook supervised eval path to pass explicit step to evaluator. |
| infer_step_from_sampling_client_name | Fallback step inference from snapshot names when step is not passed explicitly. |
| snapshot_name_prefix, checkpoint_prefix | Naming controls for saved sampler snapshots and checkpoint tags. |
CreateTrainTrackTinkerEvaluator
Direct instance creator for custom loops. Returns an evaluator object you call with run(...), step(...), or
evaluate_training_step_async(...).
run_name: str,
categories: list[Union[str, CategoryConfig]],
model_name: str,
serial: bool = False,
...,
) -> TrainTrackSamplingEvaluator
| Argument | Description |
|---|---|
| run_name, categories, model_name | Required core configuration for run id, evaluation dataset categories, and Tinker model. |
| training_client | Optional bound training client, enabling no-boilerplate calls like
evaluator.run(step=step). |
| eval_every_steps | Internal cadence controller for run()/step() loop calls. |
| serial | Selects serial evaluator path; futures path remains default. |
| auto_capture_anchor | Captures and sends anchor (step 0) automatically for pairwise-anchor metrics. |
| capture_anchor_on_init, anchor_training_client | Optional init-time anchor capture path; useful when explicit pre-training anchor capture is required. |
| max_new_tokens, temperature, top_p, stop_sequences, num_samples | Sampling behavior controls used for evaluation generations. |
| max_concurrency, sample_batch_size | Throughput controls for prompt sampling. |
| max_inflight_requests, max_drain_per_tick, drop_when_busy, max_submit_wait_s | Futures queue controls (active when serial=False). |
| auto_background_drain, background_drain_interval_s | Background completion polling for futures mode. |
| auto_flush_at_exit, exit_flush_timeout_s | Best-effort flush of pending futures/checkpoints on process exit. |
| auto_patch_supervised_step_passthrough, infer_step_from_sampling_client_name | Step consistency controls when integrating with cookbook evaluators. |
| snapshot_name_prefix, checkpoint_prefix | Naming strategy for Tinker snapshots and TrainTrack checkpoints. |
| renderer_name, tokenizer | Optional renderer/tokenizer overrides for custom runtime setups. |
| server_url, api_key, metadata | TrainTrack transport settings and payload metadata enrichment. |
CategoryConfig
class CategoryConfig:
category: str
sources: Optional[list[str]] = None
max_samples: int = 20
judge_modes: Optional[list[str]] = None
metrics: Optional[list[Union[str, MetricConfig]]] = None
| Field | Description |
|---|---|
| category | Name of built-in category (e.g. "reasoning") or custom identifier. |
| sources Default: None | List of file paths (.jsonl/.csv) containing prompts. Required for custom categories. Auto-filled for built-ins. |
| max_samples Default: 20 | Number of prompts to sample. |
| judge_modes Default: ["criteria"] | "criteria" (scoring) and/or "pairwise_anchor" (win rate). |
| metrics Default: [] | List of metric names or MetricConfig
objects. |
MetricConfig
class MetricConfig:
name: str
rubric: Optional[str] = None
| Field | Description |
|---|---|
| name | Name of the behavior to monitor (e.g. "politeness"). |
| rubric Default: None | Plain-text description for the LLM judge to use when scoring. |
menu_book Glossary
LLM as a Judge
The practice of using a high-capability model (like GPT-4o) to evaluate the outputs of a smaller or domain-specific model using Semantically meaningful rubrics.
Behavior Curve
A graph showing how a specific behavior (e.g. Reasoning or Hallucination Risk) changes over the course of training, rather than just raw cross-entropy loss.
Anchor Output
The output generated by the model before training begins (step 0). Used as a baseline for pairwise comparisons to measure progress.
Pareto Frontier
Pareto frontier is the set of optimal, non-dominated solutions in multi-objective optimization, where improving one objective requires sacrificing another. TrainTrack identifies the Pareto frontier of your models by measuring multiple behavior dimensions simultaneously, helping you choose the perfect balance for your specific usecase.
Need immediate support?
Email us at shavon.thadani@gmail.com or shaylin.thadani@gmail.com. We guarantee a response in less than 1 hour.