The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletop manipulation task and releases a simulation benchmark, LoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.
It has been a mainstream method to use LLMs as the planner for a robot's execution. However, how to incorporate real-time visual observation feedback into the LLM's input is still an under-explored problem. This modality gap is especially severe for long-horizon robotic tasks because an execution error in each of the robot's steps can affect all the following steps. To solve this modality bridging problem, we propose two baseline methods to translate the visual observation into feedback that the LLM can understand for its closed-loop planning. We use the Planner-Actor-Reporter paradigm to unify our two baselines. The feedback generation models of the two baselines are working as the Reporter module.
Inner Monologue demonstrated that human-provided language feedback can significantly improve high-level instruction completion on robotic manipulation tasks. But human-written language feedback is too expensive to scale. We therefore explore a caption generation based model as an automatic way to generate language feedback without training.
As shown in the above figure, we use Llama 2 and the trained pick-and-place CLIPort primitive as the Planner and Actor, respectively. For the Reporter, we use the VLM OpenFlamingo with few-shot prompting to generate the following two types of feedback: Observation state feedback which is the information about the objects on the table and their potential changes, and Action success state feedback which is the description whether the last instruction is executed successfully or not.
When a step's action has executed, there will be a top-down RGB image rendered by the simulator. The VLM as the Reporter module will generate the caption feedback based on the current image or the whole image history. This caption feedback is sent to the LLM for its next-step planning. The Planner-Actor-Reporter closed-loop process will be iteratively executed until the high-level goal is achieved or the maximum number of trial steps has been exceeded.
Explicitly converting an image to language captions is straightforward and simple. However, it typically causes information loss and exaggerates bias present in training data. On the other hand, training an end-to-end multimodal LLM would be too expensive. Thus another common solution used in many VLMs is to use a learnable interface such as a projection-based interface or a group of learnable query tokens to connect vision and language modalities while freezing parameters of the LLM and the visual encoder.
We use LLaVA for this second baseline. LLaVA uses the simple projection-based scheme as the learnable interface between the vision model and the pretrained LLM. As shown in the above figure, the pretrained CLIP visual encoder ViT-L/14 encodes the observation image to visual embeddings. A single-layer MLP as the learnable interface then translates the visual embeddings to the LLM's token embedding space. The LLM will generate the next-step plan conditioned on the language instruction prompts and the translated visual embeddings. LLaVA uses LLaMA as the LLM.
To fine-tune LLaVA, for each step of the task instances in the train set, we use the oracle program of the simulator to generate the image before the step and the language instruction for the step as the pair of train data. For the inference process, LLaVA receives the generated images after each step's execution (just as the caption generation based model does). LLaVA then outputs the next-step language instruction to CLIPort for execution.