AI robotics startup Physical Intelligence says it saw improvements in its vision-language-action model by including human video data in the fine-tuning process
human data Mehdi / @bettercallmedhi : Physical Intelligence may have just triggered the real inflection point for robotics π0.5 shows emergent human to robot transfer: once the model is big enough, it can align human videos with robot actions without being explicitly taught that means all the useless human videos Josh Wolfe / @wolfejosh : Robots can learn by watching humans do stuff Lux family co @physical_int with MAJOR game changing insight Ali / @aliuahma : this is some really great work to quantify how egocentric human data can help boost policy performance, and it's a really interesting finding that as the pre-training data scales, the model is able to align human and robot data in the post-training stage. i'm not super caught @bercankilic : If you are a robotics lab and you are looking to acquire real-world, diverse, in-the-wild egocentric (RGB-D) and wrist cam data with tons of annotations, feel free to DM. Surely, we can scale our operations for you too. @physical_int : If we use our full pre-trained pi05 model, simply finetuning with human video data can double the performance on tasks that are depicted in the human videos! [image] Naveen / @naveen_ing : what kind of data is actually useful for robot learning? the answer has been changing every few weeks within the community. having been considered as the lowest tier of data until now, this is egocentric demonstration data's week! Karl Pertsch / @karlpertsch : I tried to get human-to-robot transfer via simple co-training to work as a student researcher with @hausman_k back in RT-1 days. @simar_kareer and @SurajNair_1 finally show some exciting signs of transfer — turns out a different level of scale was needed to get it to work! :) Matthew Gunton / @matthewjgunton : @physical_int Seeing the activations between the human and robot actions get closer and closer is incredible Hats off to the team for presenting their generalization evidence in a visually compelling way! [image] @physical_int : We set out with the goal of understanding what it would take to make human data useful for VLAs like π0.5. We record egocentric human data with wearable cameras, and then include it in a co-training recipe with hand poses serving as actions. [image] @tylerwillis : Now seems likely that there will billions of dollars spent on paying humans to record training videos for robots. @physical_int : We were surprised, and wanted to understand why. What about π0.5 enabled emergent human-robot transfer? We ran an experiment to test if it only appears above a certain scale. Turns out human transfer scales with the amount & diversity of robot data in VLA pre-training! [image] Danfei Xu / @danfei_xu : Most past work throws human data into a pretraining mix. EgoMimic showed that, with proper alignment, you can co-train with human data. In his internship project at Pi, @simar_kareer took this a step further and showed that human data can “post-train” VLAs. This enables robots Kun Lei / @kunlei15 : Very impressive results. Robot data grounds actions/motion, while human ego video adds semantics—together enabling better scene/environment generalization for VLA policies. [image] Fangzhou Hong / @hongfz16 : Encouraging to see this. As VLAs scale, human video and robot data begin to align, making human experience a usable training signal. This is exactly what I've believed in, and what we've been building at @ropedia_ai . Jie Wang / @jiewang_zjui : Very cool emergent capability of human-robot co-training! But I have to point out: We haven't got a free lunch of learning in the wild YouTube video. 1) It's still human augmentation. 2) Secret of VLAs is always the wrist camera. 3) Teleoperators have to shape their hands like grippers. This restricts the flexibility and dexterity. How it works may be because VLA resolution is low enough... Oliver / @xos9000 : Basically an even more scalable way to collect data than Sunday. Need to fine tune model on robot specific data at some point but still an even cheaper way to collected large amounts of data. U don't get the same quality data but u get more general data.. @yash_347 : we are finally gonna normalize recording as a sake for data generation, everything is gonna be recorded to the last bit Jamin Ball / @jaminball : Awesome to see the team at @physical_int continue to push the boundaries. Pumped to be partnered with them at @AltimeterCap! @rhizonymph : This is so cool. A big problem with getting data to train VLAs is that recording efficient trajectories is nontrivial because of the robot and teleop setup needed for each collector, which requires skill to control well or a complex setup like VR with FBT. Kun Lei / @kunlei15 : Interesting scaling trend: robot-only saturates, while human+robot keeps improving with more robot pretraining. Would love an additional baseline with semantic-matched robot demos (e.g., robot color-sorting) or human demos from different scenes to disentangle task-semantic [image] Suraj Nair / @surajnair_1 : At the start of @simar_kareer's internship this year we set out to use human data to make VLAs better. It turned out that once your VLA is pre-trained on enough diverse robot data...the simple thing just works! Karol Hausman / @hausman_k : In https://pi.website/..., we show that, at the scale of robot data, human data acts as another embodiment. This animation shows how human and robot data align with enough robot data diversity 🤯 Full thread: https://x.com/... [video] Jon Miller Schwartz / @jonmschwartz : People in the robotics industry like to argue about what type of data collection will get us to generally capable robots. It's starting to look like the answer is all of it—teleop, direct capture (UMI), human data, and sim. In hindsight, feels a bit obvious? Russell Mendonca / @mendonca_rl : Sensorized human collection is the most promising path for data scaling to enable generalist robots. Interesting insights on critical mass of robot diversity needed for the model to start using human data! Donald / @donaldjewkes : more research should be beautiful @physical_int and @thinkymachines have well crafted blogs, who else? [image] Tarasha Khurana / @tarashakhurana : A human is just another robot with a different form factor 🤷🏻♀️ Brian Cheung / @thisismyhat : There's no such thing as bad data, only bad learning. @physical_int : We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video. [video] Tom Zhang / @tom_jiahao : Stunning discovery. What's more exciting: only 3D hand positions from egocentric videos are used as the “action” by @physical_int. So this is only the beginning. Way more values are to be unpacked from egocentric videos. Also it seems that @eddybuild has made the right bet! [image]