/vt/ - Virtual YouTubers » Searching for posts with the image hash ‘MzC/ieQ/PhfLoDyvV2hzYw==’.

Anonymous

View Same Google ImgOps iqdb SauceNAO split inference.png, 131KiB, 1199x439

Anonymous Wed 12 Mar 2025 22:24:40 No.97240958 View View Report

Quoted By:

>>97238335
https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf
>The first is Gemini Robotics, an advanced vision-language-action (VLA) model that was built on Gemini 2.0 with the addition of physical actions as a new output modality for the purpose of directly controlling robots.
>The second is Gemini Robotics-ER, a Gemini model with advanced spatial understanding, enabling roboticists to run their own programs using Gemini’s embodied reasoning (ER) abilities.
The ER model is capable of plotting grasp points and angles, and can map points between 2 views of the same scene. They distilled the ER model into a smaller model that can be inferenced fast enough to be useful.
They do this weird 'split inference' trick where they do most of the inference in the cloud, and then do the decoding locally. Apparently this reduces latency? picrel suggests the decoding considers the most recent robot data. They say there's a query-to-response latency of 250ms, yet they also say that by chunking the actions together, they obtain an 'effective' robot control of 50 Hz. Could this solve the latency bottleneck that Vedal talked about regarding LLMs playing time-sensitive games by making pipelining work?
>robot modality
Contrary to what I expected it just generates single-use python snippets using an API provided by the robot. They don't mention modality in the paper.
>Gemini Robotics-ER and Gemini Robotics were trained on datasets comprised of images, text, and robot sensor and action data
I wonder if there are any open source datasets for these specific challenges (e.g. finding grasp points, 3d bounding box estimation). They seem rather specific-purpose (as opposed to regular LLMs which engorge themselves on pure streams of text).
>Example robot control output:
banana_name = "banana"
grasp_position, grasp_orientation = robot.get_grasp_position_and_euler_orientation(gripper=Gripper.LEFT,
object_name=banana_name)
robot.move_gripper_to(position=grasp_position, orientation=grasp_orientation, gripper=Gripper.LEFT)
robot.close_gripper(gripper=Gripper.LEFT)

Capcode	All Only User Posts Only Verified Posts Only Moderator Posts Only Manager Posts Only Admin Posts Only Developer Posts Only Founder Posts
Show Posts	All Only With Images Only Without Images Only Spoiler Images Only Non-Spoiler Images
Deleted Posts	All Only Deleted Posts Only Non-Deleted Posts
Ghost Posts	All Only Ghost Posts Only Non-Ghost Posts
Post Type	All Only Sticky Threads Only Opening Posts Only Reply Posts
Results	All Grouped By Threads
Order	Latest Posts First Oldest Posts First

Searching for posts with the image hash ‘MzC/ieQ/PhfLoDyvV2hzYw==’. 1 results found.

On these archives

Your latest searches