Unit 5: Train a Bimanual Policy with ACT — DK1 Learning Path

Why ACT Excels at Bimanual Tasks

ACT (Action Chunked Transformers) was originally developed specifically for bimanual manipulation research. Its core insight — that predicting sequences of future actions (chunks) rather than single-step actions reduces compounding error — is especially valuable for bimanual tasks, where a small error in one arm's trajectory can cause a cascade failure in the other arm's execution.

The action chunking mechanism effectively gives the policy a planning horizon. Instead of committing to a single joint command at each 50Hz timestep, ACT plans 100 steps ahead and smooths the execution. For a handoff task, this means the policy can "see" the approach of both arms toward the handoff point as part of a planned sequence, rather than reacting to each frame independently. Empirically, this halves the rate of mid-transfer failures compared to non-chunked approaches on bimanual datasets.

One caution: ACT assumes the demonstrations in your dataset represent a consistent strategy. If different demos show fundamentally different ways of executing the handoff — different arm which initiates, different handoff height — the CVAE component will struggle to encode a single style. Your 100 demos should all execute the same motion strategy.

Training Command

source ~/dk1-env/bin/activate

python -m lerobot.scripts.train \
  --policy-type act \
  --dataset-repo-id cube-handoff-v1 \
  --root ~/dk1-datasets \
  --output-dir ~/dk1-policies/cube-handoff-v1 \
  --config-overrides \
    policy.action_dim=14 \
    policy.chunk_size=100 \
    policy.n_action_steps=100 \
    policy.dim_feedforward=3200 \
    policy.n_heads=8 \
    policy.n_encoder_layers=4 \
    policy.n_decoder_layers=7 \
    training.num_steps=80000 \
    training.eval_freq=5000 \
    training.save_freq=5000 \
    training.batch_size=16

# policy.action_dim=14 tells ACT the action space is 14-dimensional (6+6 joints + 2 grippers)
# Run this before sleeping — checkpoints save every 5k steps

GPU required for practical training time: On an RTX 3080 (10GB), 80,000 steps takes approximately 90 minutes. On an RTX 4090, approximately 50 minutes. On CPU, expect 10–14 hours. Use the --device cuda flag if you have a GPU. Cloud GPU options (Lambda Labs, Vast.ai) run about $0.50–1.50/hr for the hardware needed.

Reading Bimanual Training Curves

Bimanual training curves differ from single-arm in one important way: you have two action spaces, and the policy must learn to coordinate them. Watch for these patterns in your loss curves (view in TensorBoard at tensorboard --logdir ~/dk1-policies/):

L_reconstruction (overall action loss)

Should decrease from ~3.0 to below 0.4 by 60,000 steps. A plateau above 0.7 after 40,000 steps indicates dataset quality issues — likely too much variance in the handoff timing or position.

L_kl (CVAE regularization)

Starts near 0 and rises slowly to 5–15. If it rises above 30, the CVAE is struggling to find a compact style embedding. This often means your demonstrations have too much behavioral diversity. Consider culling the bottom 20% least consistent demos and retraining.

Action error: left vs. right

If you enable per-arm action error logging (via the training.log_per_action_dim=true override), you will see separate loss curves for the left and right action dimensions. A large persistent gap between the two indicates one arm's demonstrations are more consistent than the other's — review your Unit 4 quality checklist for the lagging arm.

Bimanual-Specific Hyperparameters

Parameter	Default (single-arm)	DK1 Bimanual Recommended	Why
`action_dim`	7	14	Two 6-DOF arms + 2 grippers = 14 action dimensions
`chunk_size`	100	100	Same — action chunking is already well-suited to bimanual coordination timescales
`dim_feedforward`	3200	3200	No change needed — the larger action space is handled by the action head, not the transformer width
`num_steps`	50000	80000	Bimanual coordination requires more training steps to converge reliably; 80k is the practical minimum for 100 demos
`batch_size`	32	16	Reduced to fit the larger bimanual dataset samples (dual camera feeds) in GPU memory
`kl_weight`	10	10	Default works well; increase to 20 only if L_kl stays near zero after 30k steps (CVAE not learning)

Checkpoint Selection

Save checkpoints every 5,000 steps (training.save_freq=5000). Do not assume the final checkpoint is the best. Bimanual policies can overfit at high step counts — the policy learns to reproduce training demonstrations perfectly but loses generalization to the slight real-world variations you will encounter during evaluation.

Select the checkpoint at the step where L_reconstruction reached its minimum before starting to plateau or slightly increase. Usually this is in the 60,000–80,000 step range for 100-demo bimanual datasets. Deploy two checkpoints (the minimum-loss checkpoint and the final one) and compare their real-world performance in Unit 6.

Unit 5 Complete When...

Training has completed 80,000 steps and checkpoints are saved at ~/dk1-policies/cube-handoff-v1/. The final L_reconstruction value is below 0.5. You have identified your best checkpoint based on the loss curves. You understand why the L_kl curve behaves as it does in your run. You are ready to deploy to real hardware in Unit 6 — target success rate on the cube handoff is >60% (bimanual is harder than single-arm, and this is a strong first-run result).