Kuan-Chia Chen

Research

Humanoid Hanoi: Investigating Shared Whole-Body Control for Skill-Based Box Rearrangement

We investigate a skill-based framework for humanoid box rearrangement that enables long-horizon execution by sequencing reusable skills at the task level. In our architecture, all skills execute through a shared, task-agnostic whole-body controller (WBC), providing a consistent closed-loop interface for skill composition, in contrast to non-shared designs that use separate low-level controllers per skill.

We find that naively reusing the same pretrained WBC can reduce robustness over long horizons, as new skills and their compositions induce shifted state and command distributions. We address this with a simple data aggregation procedure that augments shared-WBC training with rollouts from closed-loop skill execution under domain randomization.

To evaluate the approach, we introduce Humanoid Hanoi, a long-horizon Tower-of-Hanoi box rearrangement benchmark, and report results in simulation and on the Digit V3 humanoid robot, demonstrating fully autonomous rearrangement over extended horizons and quantifying the benefits of the shared-WBC approach over non-shared baselines.

Projects

Point Cloud to Action

The project is an hierarchical learning system for robotic manipulation using raw 3D point cloud observations. The goal is to train a robot to robustly grasp boxes across a wide range of positions, orientations, and sizes without relying on predefined object models or QR codes.

To achieve this, I developed a system trained in the Isaac Lab simulation environment using a Digit V3 humanoid robot. At the core of the system is a policy trained with Proximal Policy Optimization (PPO), combined with two-layer Long Short-Term Memory(LSTM) network. This design allows the model to capture temporal dependencies and maintain internal state across time steps, which is particularly important for sequential manipulation tasks.

The extracted point cloud features are concatenated with the robot state and fed into the LSTM policy network. The policy then outputs continuous control commands for the robot.

Diffusion Policy for Robot Control

This project is a diffusion-based policy for robotic manipulation, where the model learns to generate control actions conditioned on object properties such as initial position, orientation, and size. Training data was collected from a VR interface in the real world. The total dataset includes 100 human-demonstrated trajectories. The learned policy enables the robot to robustly pick up boxes from a small dataset.

The inputs to the model include box position, size, and robot state. These inputs are passed through a diffusion U-Net, which generates 8-dimensional control actions, but only the first 4 actions are used to control the robot.

Results show that the learned policy successfully picks up the box in the MuJoCo simulation environment.

Keypoint Mimic Policy for VR Control

I developed a keypoint mimic framework for robot teleoperation using a VR headset. The system maps human hand keypoints and locomotion inputs, captured via a VR headset and controllers, into robot control commands. This enables the robot to execute complex, coordinated behaviors. The setup also serves as an efficient pipeline for collecting high-quality demonstration data.

The policy inputs include two hand XYZ keypoints, a height command, and locomotion velocities (x, y, and turning).

We obtain real-world hand positions from the VR headset. This policy helps us explore what types of tasks the robot can perform and enables efficient dataset collection for supervised learning.

Object-Aware Real-Time Video Stylization

In this project, we propose a framework that combines zero-shot video object segmentation with fast neural style transfer to have real-time, object-aware stylization. Our method uses a Transformer-based model to segment and preserve foreground objects, while a feed-forward network applies artistic style transformations to the background.

We successfully handle videos with people moving randomly, and the system runs at interactive speeds (~15 FPS), making it good for applications such as augmented reality, virtual content creation, and live video processing.

Paper

Performance Comparison of CPU, Multithreaded CPU, and GPU (CUDA) in Real-Time Gaming

In this project, I developed a shooting game to evaluate the performance differences between three computation modes: normal CPU, multithreaded CPU, and GPU parallelization using CUDA.

The goal was to analyze how parallel computing impacts real-time rendering and compute performance. I separate pipelines for each mode and compared their efficiency in handling intensive game logic and rendering tasks.

Results show that GPU parallelization have 58 FPS, significantly outperforming both standard CPU execution 4 FPS and multithreaded CPU 10 FPS, making the game smoother and more responsive.

About Me