TidyBot++ Collaborative Robotic Assistant

ME 326: Collaborative Robotics | Stanford University | Winter 2026

We built an intelligent robotic assistant using the TidyBot++ bimanual mobile manipulator. Our system interprets natural language commands — via voice or text — to perceive its environment, plan actions, and autonomously execute manipulation tasks. The robot can retrieve objects on command, perform multi-step sequential tasks, and open articulated furniture like drawers.

1

Object Retrieval

"Find the banana and bring it here." Voice command triggers a full pipeline: scan the scene, locate the object, navigate to it, pick it up, and return.

2

Sequential Task

"Find the banana and place it in the basket." Multi-step natural language commands that chain perception, navigation, and manipulation.

3

Drawer Articulation

Open a drawer in a cabinet using one arm, demonstrating dexterous manipulation of articulated objects.

View on GitHub

Methods

High-Level Architecture

Our system follows a modular perception-planning-control architecture, orchestrated by an LLM-based planner.

System Nodes & Services

The ROS2 node graph showing how the StateManager orchestrates Base, ArmPlanner, Vision (SAM3), and Voice nodes via topics and service calls.

Perception

SAM3 + RealSense D435

We use Segment Anything Model 3 (SAM3) combined with the Intel RealSense D435 depth camera to detect and localize objects in 3D space. Given a text prompt (e.g., "banana"), the vision pipeline segments the object in the RGB image and projects it into a segmented 3D point cloud.

Principal Component Analysis

We use PCA to obtain the segmented pointcloud's major and normal axis, which is used to construct a grasp pose relative to the shortest width of the object and its centroid.

Pan-Tilt Camera

The camera is mounted on a 2-DOF pan-tilt unit, enabling the robot to scan its environment by sweeping across the scene.

Planning

LLM Planner (Google Gemini)

Natural language commands are processed by a Gemini-based planner that decomposes high-level instructions into a sequence of robot actions through automatic function calling. Available tools include scan, navigate_to, pick_up, place_at, open_door, and move_arm.

Voice & Text Input

Users can interact via text or voice (using the Gemini Live API for real-time speech understanding). A ROS2 voice node passively listens for an activation phrase ("Hey Robot!"), then actively records the request via Google Cloud Speech-to-Text.

State Machine

The planner orchestrates task execution through a state machine that sequences perception, navigation, and manipulation actions, with error handling (retry on failure, replan after max retries).

Control

Base Navigation

The 3-DOF holonomic base (Phoenix6 motors over CAN) performs 3-phase position control to goal poses, publishing when the goal is reached. Supports both relative and world-frame commands.

Arm Motion Planning

An action queue service processes high-level commands (Grab, Release, Move) by decomposing them into IK-solved waypoint sequences. Non-blocking service calls allow smooth chaining of multi-step motions.

Gripper & Bimanual Coordination

Coordinated gripper open/close with force feedback for reliable grasping. Both arms can be controlled independently or in coordination for tasks like drawer articulation.

Node Details

Perception Node

ROS2 service: given a text prompt, SAM3 segments the object and PCA analysis on the 3D point cloud yields a grasp pose. For Task 3, an additional "shiny gold knob" prompt localizes the drawer handle.

LLM Planner

Google Gemini with automatic function calling dispatches actions to a state machine that waits for hardware-ready signals at each step. Error handling retries failed actions and replans after max retries.

Arm Node

An action queue service processes high-level commands (Grab, Release, Move) by decomposing them into IK-solved waypoint sequences. Non-blocking service calls allow smooth chaining of multi-step motions.

Base Node

The 3-DOF holonomic base performs 3-phase position control (rotate to heading, translate, rotate to final orientation) and publishes when the goal is reached.

Voice Control

A ROS2 node with Google Cloud Speech-to-Text passively listens for an activation phrase ("Hey Robot!"), then actively records the user's request and matches it to a task for the State Manager.

Software Stack

ROS2 Humble Message passing, service calls, launch system

MuJoCo Physics simulation (500Hz physics, 100Hz state publishing, 30fps camera)

8 Custom ROS2 Packages tidybot_bringup, tidybot_control, tidybot_ik, tidybot_description, tidybot_msgs, tidybot_mujoco_bridge, tidybot_client, tidybot_network_bridge

Task 2 State Machine (Pick & Place)

The state machine for sequential pick-and-place tasks, with event-driven transitions based on base_ready and arms_ready signals and grasp retry logic.

Task 2 state machine: far search to Moving to object to Grasping to Searching target to Moving to target to Releasing to Returning Home to idle

Task 3 State Machine (Drawer Articulation)

The state machine for opening drawers, with handle detection retry and grasp failure recovery.

Task 3 state machine: far search to Moving to door to Finding handle to Grasping handle to Releasing handle to idle

Results

Task 1: Object Retrieval

The robot successfully interprets voice commands to locate, navigate to, and retrieve target objects in a cluttered environment.

Banana on table - object retrieval target

Key Results

Successfully identifies and localizes objects using SAM3 + depth
Navigates to objects and executes reliable grasps
Returns to starting position with object in hand

Task 2: Sequential Task

Given multi-step natural language instructions, the robot decomposes and executes chained actions.

Task 2 State Machine

Key Results

LLM correctly decomposes multi-step commands into atomic actions
Chains perception → navigation → pick → navigation → place
Handles different object/container combinations

Task 3: Drawer Articulation

The robot opens a drawer using one arm, demonstrating dexterous manipulation of articulated objects. Key design considerations include handle geometry, static/kinetic friction during pulling, and constraining the end-effector along the drawer rail axis to avoid lateral torque.

Task 3 design considerations: handle geometry, pull force, direction of motion

Arm Control Waypoint Sequence

Five non-blocking service calls from the high-level planner, with the action queue continually publishing queue length to indicate completion.

Task 3 State Machine

Task 3 drawer articulation state machine

Key Results

SAM3 localizes drawer handle via "shiny gold knob" prompt + PCA for grasp pose
Action queue decomposes pull into 5 waypoints sent as non-blocking service calls
Grasp retry logic (max 3x) handles initial grasp failures

Future Work

Robust Grasping

Integrate GraspAnything for 6-DOF grasp pose prediction. Add force feedback and tactile sensors for more reliable manipulation.

Dynamic Replanning

Real-time failure detection via memory or a VLA planner to auto-recover if actions fail, reducing the need for manual intervention.

SLAM & Mapping

Safety-critical and robust SLAM with onboard LiDAR for obstacle avoidance and persistent environment mapping.

Bimanual Complexity

Extend dual-arm coordination to folding, packing, and tool use tasks.

Human-Robot Collaboration

Hand objects to humans, respond to gestures, operate in shared spaces. Intent estimation for anticipating human needs.

Learning from Demonstration

Learn manipulation skills from human demonstrations via behavior cloning, imitation learning, ACT, and diffusion policy.

Team

Aarya Sumuk

Max Burns

Maisha Khanum

Julien Buist-Thuillier

Rohan Garg

Marie Imad

Christopher Luey

Brian LaBlanc

Course: ME 326 Collaborative Robotics, Stanford University, Winter 2026

Instructor: Professor Monroe Kennedy

GitHub: github.com/ChristopherLuey/collaborative-robotics-2026