Logbook
RL Arm Tracking
Learning RL by building a 3D trajectory-tracking policy for a simulated Franka arm. PPO via Stable-Baselines3 + MuJoCo.
10 entries
- May 2026
-
Submission and future scope of work
Polished the GitHub repo for submission (library code into an `arm/` package, dead code removed, proper README pointing at this build log). Then some thoughts on GPU training, ripping out IK, learning physics directly, and comments on RL's observation space.
- reflection
-
v9 has arrived
v9 trains with an action delta penalty (β=0.1) and 5s episode duration. Jerk dropped 25%, mean error dropped 13%, and time in the 5cm band jumped from 54% to 73%. Shipping v9 as the submission model!
- experiment
-
New metrics for v5 and planning v9
Passed on MuJoCo Warp (too much JAX rewrite for the time left), added action and tracking-lag metrics, and used them to characterise v5 as "noisy but on-time" versus classical's "clean but late". Set the stage for v9 with an action delta penalty.
- research
-
Acceleration in obs and accidentally rewarding termination
Audited yesterday's collision penalty, removed it, and the eval mean error dropped by ~3 cm. Added target acceleration to the observation space and v5 beats classical by 4.55 cm with time-in-band more than doubled. Then tried letting RL learn collision avoidance with early termination, which caused a "model races to terminate" failure mode.
- experiment
-
Comparing classical vs RL, then mixing in new trajectories
Got the first side-by-side classical vs RL comparison out the door: RL is ~30% tighter on mean error and spends 15 more points of time in the 5 cm band, but jerkier. Moved the tracked point to a green tip site past the gripper, tried DLS for the IK folding issue (no effect), then added a figure-8 and a random-walk "fly" trajectory and kicked off a 5M-step retrain with a self-collision penalty.
- experiment
- decision
-
Wiring the RL pipeline end-to-end
Finished `trajectory.py` and wired `env.py`, `train.py`, `eval.py` end-to-end. Drilled the RL terminology along the way and watched a PPO policy learn to lead moving targets over a 1M-step training run. Self-collisions and EE visibility flagged for the next session.
- research
- experiment
-
Scoping RL and mapping the arm's reach
Settled on approach iv (RL on top of IK+PD) as the starting scope. Mapped the Panda arm's reachable workspace from 300,000 sampled joint configurations and laid out a plan to get training running.
- decision
- experiment
-
Turning physics on, and a nagging decision
Turned dynamics on for the first time with a classical IK + PD controller tracking a moving circle, and found the Panda's actuators are already PD tuned. Now weighing how much RL to take on before the deadline.
- research
- experiment
-
Implementing inverse kinematics
Built the Jacobian pseudo-inverse IK solver. It drives the Panda's wrist to a hardcoded 3D target, and degrades as expected when the target is out of reach. This is the foundation for both the classical baseline and the RL Cartesian action wrapper.
- research
- experiment
-
Challenge accepted.
Kickoff for an 11-day RL challenge: train a policy to track a 3D Cartesian trajectory on a simulated Franka arm. Scoped the problem, set up the repo with uv, and got Franka loaded in MuJoCo for first signs of life.
- research
- experiment