RoboTHOR comes packaged with AI2-THOR, which can be installed in Python using pip:
pip install ai2thor
After installing AI2-THOR, we can initialize a controller, which will allow us to execute actions in the environment.
from ai2thor.controller import Controller
controller = Controller(
agentMode="locobot",
visibilityDistance=1.5,
scene="FloorPlan_Train1_3",
gridSize=0.25,
movementGaussianSigma=0.005,
rotateStepDegrees=90,
rotateGaussianSigma=0.5,
renderDepthImage=False,
renderInstanceSegmentation=False,
width=300,
height=300,
fieldOfView=60
)
Always set this to
A RuntimeError is raised if
Used to set the maximum goal distance in meters between the agent and an object. This affects each object's visible property, where when
The name of which scene to initialize. Valid RoboTHOR scenes are in the form
The mean amount, in meters, that the agent moves after calling a move action (i.e.,
The standard deviation, in meters, controlling the amount of noise in a move action (i.e.,
The mean amount, in degrees, that the agent rotates after calling a rotate action (i.e.,
The standard deviation, in degrees, controlling the amount of noise in a rotate action (i.e.,
When
We require this to be explicitly passed in because rendering depth takes longer than rendering RGB alone.
When
We require this to be explicitly passed in because rendering instance segmentation takes longer than rendering RGB alone.
The number of horizontal sampled pixels for each frame. This affects every rendered image frame (e.g., RGB, depth, BGR).
The number of vertical sampled pixels for each frame. This affects every rendered image frame (e.g., RGB, depth, BGR).
Changes the camera's optical field of view, in degrees. Valid values are in the domain . The default field of view for the LoCoBot is 60 degrees.
Not all agent modes have the same default field of view. For instance, the drone's field of view is 150 degrees, by default.
Any
controller.reset(scene="FloorPlan_Train7_5", rotateStepDegrees=30)
The values will default to what they were upon the most recent reset or initialization. For instance, if you initialized with
The movement and rotation action mimic what is possible from a physical LoCoBot. We also provide the ability to
Since the LoCoBot can only
controller.step(
action="MoveAhead",
moveMagnitude=0.25
)
# Other supported direction
controller.step("MoveBack")
Deterministically overrides the initialized step size that is defaulted to sample from . The units are specified in meters.
Changes the yaw rotation of the agent.
controller.step(
action="RotateRight",
degrees=90
)
# Other supported direction
controller.step("RotateLeft")
Deterministically overrides the initialized rotation increment without noise. Defaults to sampling from
We can adjust the camera's pitch by utilizing the
controller.step("LookUp")
controller.step("LookDown")
controller.step(
action="Teleport",
position=dict(x=0.999, y=1.01, z=-0.3541),
rotation=dict(x=0, y=90, z=0),
horizon=30
)
The rotation of the agent's body in global 3D space. Here, the rotation changes the agent's yaw rotation.
If
The LoCoBot's body cannot changes its in pitch or roll rotations; hence, why the .
The
If
Since the agent looks up and down in increments, it most common for the horizon to be in .
As illustrated in the figure below, negative camera
In addition to being able to
controller.step(
action="TeleportFull",
position=dict(x=0.999, y=1.01, z=-0.3541),
rotation=dict(x=0, y=90, z=0),
horizon=30
)
The rotation of the agent's body in global 3D space. Here, the rotation changes the agent's yaw rotation.
The LoCoBot's body cannot change in pitch or roll; hence, why the .
The
Since the agent looks up and down in increments, it most common for the horizon to be in .
As illustrated in the figure below, negative camera
It is often useful to randomize the position of the agent in the agent in the scene, before starting an episode. Here, we can use:
The process is illustrated below:
positions = controller.step(
action="GetReachablePositions"
).metadata["actionReturn"]
[
dict(x=(...), y=(...), z=(...)),
dict(x=(...), y=(...), z=(...)),
dict(x=(...), y=(...), z=(...)),
{...}
dict(x=(...), y=(...), z=(...)),
]
import random
position = random.choice(positions)
controller.step(
action="Teleport",
position=position
)
A list of each position that the agent can be at, without colliding into any other objects or going beyond the walls.
The coordinate space for each RoboTHOR scene is shown below. Notice that corresponds to the upward coordinate in 3D space.
The
controller.step(action="Done")
It is often used in the definition of a successful navigation task (see Anderson et al.), where the agent must call the done action to signal that it knows that it's done, rather than arbitrarily or biasedly guessing.
The
It is often used to return a cleaned up version of the metadata.
The figure above shows AI2-THOR's agent-simulator interaction loop. Our backend scenes, actions, and observations are stored within Unity, a powerful real-time game engine.
From the Python side, we sequentially interact with Unity by taking actions using AI2-THOR's Python
The returned data after executing any action is an
event = controller.step(...)
controller.last_event
<ai2thor.server.Event
.metadata: {...}
.frame: {...}
.cv2img: {...}
.depth_frame: {...}
.instance_segmentation_frame: {...}
.instance_masks: {...}
.instance_detections2D: {...}
.color_to_object_id: {...}
.object_id_to_color: {...}
/>
Contains information about the action and how it affected the environment. See the Metadata section for more.
Same as the frame, except image channels are in BGR ordering. This is often useful with Python's OpenCV module (i.e., cv2), which expects images with BGR orderings.
Measures the distance, in meters, from the agent to each pixel in the frame. The size of the frame is (height, width) stored in the numpy.float32 format.
Upon initialization or reset, renderDepthImage must be set to
Segments every object instance in the scene. The size of the frame is (height, width, 3) stored in the numpy.uint8 format. Each unique pixel color corresponds to a different object, which are indexable with color_to_object_id or object_id_to_color.
Upon initialization or reset, renderInstanceSegmentation must be set to
Segments every object instance in the scene, storing each segmentation as a boolean image mask.
The keys are object IDs visible in the frame and each value is the boolean numpy.ndarray of sizeheight, width) that segments the image.
Upon initialization or reset, renderInstanceSegmentation must be set to
Provides the 2D bounding box around every object instance in the scene.
The keys are object IDs and the values are [Upper Left , Upper Left , Lower Right , Lower Right ], where each element is the number of pixels it is from the top left corner of the image.
Upon initialization or reset, renderInstanceSegmentation must be set to
Provides a mapping from each pixel's color, represented as a hashable tuple, in the instance_segmentation_frame to what object it represents.
Same as color_to_object_id, except for the keys and values being swapped.
The metadata dictionary contains summary information about the state of the environment.
event.metadata
{
"errorMessage": {...}
"lastActionSuccess": {...}
"actionReturn": {...}
"agent": {...}
"objects": {...}
{...}
}
If an action fails, feedback is provided as to what caused the failure.
States whether the last action was able to successfully execute.
An exception will not be raised upon invalid states. For instance, if an agent is standing right in-front of a wall and tries to
If an action is unsuccessful, the state of the environment will not have changed.
Holds the resulting data for actions that query the environment, such as Get Reachable Positions.
Within the metadata dictionary, the agent's key contains the pose of the agent after the action has executed.
event.metadata["agent"]
{
cameraHorizon: {...},
position: {...},
rotation: {...},
{...}
}
The angle in degrees that the camera's pitch is rotated.
As illustrated in the figure below, negative camera
The global position of the agent, with keys for , , and .
The coordinate space for each RoboTHOR scene is shown below. Notice that corresponds to the upward coordinate in 3D space.
The local rotation of the agent's body, with keys for (pitch), (yaw), and (roll). Since the LoCoBot's body can only change its yaw rotation, both and will always be approximately 0.
Object Metadata
event.metadata["objects"]
{
"objectId": {...},
"objectType": {...}
"distance": {...},
"visible": {...},
"position": {...},
"rotation": {...},
{...}
}
The unique ID of each object in the scene. It is generated at runtime and composed of an object's
Example:
The annotated type of the object. Each time is specified in the Object Types section.
The Euclidean distance from near the center-point of the object to the agent's camera.
Indicates whether the object is visible, and within the initialized visibility distance of the agent.
The visible property does not mean the object is literally visible in the frame. Rather, the object both has to be visible and at a distance of less than visibilityDistance away. For more information, refer to iTHOR's concepts page.
The global position of the object, with keys for , , and .
The coordinate space for each RoboTHOR scene is shown below. Notice that corresponds to the upward coordinate in 3D space.
The local rotation of the object, with keys for (pitch), (yaw), and (roll).
This section details several utility functions that may help when evaluating navigation agents in RoboTHOR. We also provide all object types that appear in RoboTHOR. For ObjectNav, object types often appear as navigation target.
TARGET_OBJECT_TYPES = {
"AlarmClock,"
"Apple,"
"BaseballBat,"
"BasketBall,"
"Bowl,"
"GarbageCan,"
"HousePlant,"
"Laptop,"
"Mug,"
"RemoteControl,"
"SprayBottle,"
"Television,"
"Vase"
}
Target object types are guaranteed to have exactly one instance of that type in each scene. Thus, they are well suited to use as target objects in an ObjectNav task.
BACKGROUND_OBJECT_TYPES = {
"ArmChair",
"Bed",
"Book",
"Bottle",
"Box",
"ButterKnife",
"Candle",
"CD",
"CellPhone",
"Chair",
"CoffeeTable",
"Cup",
"DeskLamp",
"Desk",
"DiningTable",
"Drawer",
"Dresser",
"FloorLamp",
"Fork",
"Newspaper",
"Painting",
"Pencil",
"Pen",
"PepperShaker",
"Pillow",
"Plate",
"Pot",
"SaltShaker",
"Shelf",
"SideTable",
"Sofa",
"Statue",
"TeddyBear",
"TennisRacket",
"TVStand",
"Watch"
}
Background object types are then defined as the set of object types that are not target objects. Put differently, these objects appear in at least 1 scene, but do not appear in all of them.
We can use
from ai2thor.util.metrics import (
get_shortest_path_to_object_type
)
path = get_shortest_path_to_object_type(
controller=controller,
object_type="Apple",
initial_position=dict(
x=0,
y=0.9,
z=0.25
)
)
[
dict(x={...}, y={...}, z={...}),
dict(x={...}, y={...}, z={...}),
dict(x={...}, y={...}, z={...}),
{...}
dict(x={...}, y={...}, z={...}),
]
An initialized AI2-THOR
The name of the target object type. A
The starting position of the agent. Positions can be accessed from
We can use
from ai2thor.util.metrics import (
path_distance
)
path_distance(path)
A list of dictionaries with keys for , , and agent positions along the path. This may also be the response from Shortest Path, for instance.
SPL, Success weighted by (normalized inverse) Path Length, is a quick and common navigation metric (see Anderson et al. and Batra et al.) as
where, for a single episode , is the binary indicator variable denoting if the episode was successful, is the shortest path length (in meters) from the agent's starting position to the target, and is the path length (in meters) that the agent took. The metric ranges inclusively from .
It may look weird that we compute . Since is defined as the shortest path length, one may think for all paths . But, is defined as the shortest path length directly to the object, without the minimum cushion of distance, whereas for the agent's path, , it may act optimally and then stop, say, half a meter short of the target with the episode still be considered successful. So, here, we'd have .
from ai2thor.util.metrics import (
compute_single_spl
)
compute_single_spl(
path,
shortest_path,
successful_path
)
The term in the SPL definition.
A list of dictionaries with keys for , , and agent positions along the path.
The term in the SPL definition.
For ObjectNav, it is most easily calculated using the Shortest Path utility.
The term in the SPL definition.
For instance, with ObjectNav, we often define an episode of navigation to be successful if all of the following criteria are met:
We can use the visible object metadata property along with initializing visibilityDistance to 1 meter, to ensure that (2) and (3) have been satisfied.