RoboTHOR API Reference

: Optional[dict[str, float]] = None

required

Always set this to

locobot

in order to use the LoCoBot agent, which was designed for RoboTHOR.

Warning

If unset, the agent will be the default iTHOR agent, which is not the LoCoBot, and thus has different properties than these docs.

A RuntimeError is raised if

agentMode

is not set to "default", "locobot", "drone", or "arm".

visibilityDistance

: float = 1.5

Used to set the maximum goal distance in meters between the agent and an object. This affects each object's visible property, where when

True

, the agent is within

visibilityDistance

of that object and the object appears in the agent's current egocentric frame.

Remark

Visibility is defined on iTHOR's concepts page.

scene

: Optional[str] = None

The name of which scene to initialize. Valid RoboTHOR scenes are in the form

FloorPlan_Train{1:12}_{1:5}

FloorPlan_Val{1:3}_{1:5}

, such as

FloorPlan_Train3_5

FloorPlan_Val1_2

gridSize

: float = 0.25

The mean amount, in meters, that the agent moves after calling a move action (i.e.,

MoveAhead

and

MoveBack

). The full amount that the agent moves is sampled from

\mathcal N(\mu=\texttt{gridSize}, \sigma=\texttt{movementGaussianSigma})

movementGaussianSigma

: float = 0.005

The standard deviation, in meters, controlling the amount of noise in a move action (i.e.,

MoveAhead

and

MoveBack

). The full amount that the agent moves is sampled from

\mathcal N(\mu=\texttt{gridSize}, \sigma=\texttt{movementGaussianSigma})

rotateStepDegrees

: float = 90

The mean amount, in degrees, that the agent rotates after calling a rotate action (i.e.,

RotateLeft

RotateRight

). The full amount that the agent rotates is then sampled from

\mathcal N(\mu=\texttt{rotateStepDegrees}, \sigma=\texttt{rotateGaussianSigma})

rotateGaussianSigma

: float = 0.5

The standard deviation, in degrees, controlling the amount of noise in a rotate action (i.e.,

RotateLeft

RotateRight

). The full amount that the agent rotates is then sampled from

\mathcal N(\mu=\texttt{rotateStepDegrees}, \sigma=\texttt{rotateGaussianSigma})

renderDepthImage

: bool = False

When

True

, a depth frame is rendered and made available as an event.depth_frame attribute.

We require this to be explicitly passed in because rendering depth takes longer than rendering RGB alone.

renderInstanceSegmentation

: bool = False

When

True

, a instance segmentation frame is rendered and made available as an

Event.instance_segmentation_frame

attribute. Instance segmentation provides segmentation by object instance, where all objects in view are distinguishable.

We require this to be explicitly passed in because rendering instance segmentation takes longer than rendering RGB alone.

width

: int = 300

The number of horizontal sampled pixels for each frame. This affects every rendered image frame (e.g., RGB, depth, BGR).

height

: int = 300

The number of vertical sampled pixels for each frame. This affects every rendered image frame (e.g., RGB, depth, BGR).

fieldOfView

: int = 60

Changes the camera's optical field of view, in degrees. Valid values are in the domain $(0:180)$ . The default field of view for the LoCoBot is 60 degrees.

Remark

Not all agent modes have the same default field of view. For instance, the drone's field of view is 150 degrees, by default.

Reset

Any

Controller

Initialization parameter can later be changed by calling the reset method. For instance, we can call:

controller.reset(scene="FloorPlan_Train7_5", rotateStepDegrees=30)

The values will default to what they were upon the most recent reset or initialization. For instance, if you initialized with

fieldOfView=45

, then called

reset

with

fieldOfView=60

, calling

reset

again, without passing in

fieldOfView

will keep

fieldOfView=60

Navigation

The movement and rotation action mimic what is possible from a physical LoCoBot. We also provide the ability to

Teleport

and query the environment with

GetReachablePositions

Movement

Since the LoCoBot can only

MoveAhead

and

MoveBack

in real life, its simulated counterpart does not support actions like

MoveLeft

and

MoveRight

controller.step(
    action="MoveAhead",
    moveMagnitude=0.25
)

# Other supported direction
controller.step("MoveBack")

Movement Parameters

moveMagnitude

: Optional[float] = None

Deterministically overrides the initialized step size that is defaulted to sample from $\mathcal N(\mu=\texttt{gridSize}, \sigma=\texttt{movementGaussianSigma})$ . The units are specified in meters.

Agent Rotation

Changes the yaw rotation of the agent.

controller.step(
    action="RotateRight",
    degrees=90
)

# Other supported direction
controller.step("RotateLeft")

Agent Rotation Parameters

degrees

: Optional[float] = None

Deterministically overrides the initialized rotation increment without noise. Defaults to sampling from $\mathcal N(\mu=\texttt{rotateStepDegrees}, \sigma=\texttt{rotateGaussianSigma})$

Camera Rotation

We can adjust the camera's pitch by utilizing the

LookUp

and

LookDown

commands. Both increment in

30^\circ

intervals, with angles being clamped between

30^\circ

in the downward direction and

30^\circ

in the upward direction.

controller.step("LookUp")
controller.step("LookDown")

Teleport

allows the agent to set its pose to anywhere in the scene in a single step. Valid poses do not place the agent outside the outer boundaries of the scene or in an area that it collides with an object.

controller.step(
    action="Teleport",
    position=dict(x=0.999, y=1.01, z=-0.3541),
    rotation=dict(x=0, y=90, z=0),
    horizon=30
)

Teleport Parameters

position

The global position of the agent, with keys for

x

y

, and

z

position

is unspecified, the agent's position will remain unchanged.

The coordinate space for each RoboTHOR scene is shown below. Notice that

y

corresponds to the upward coordinate in 3D space.

rotation

: Optional[dict[str, float]] = None

The rotation of the agent's body in global 3D space. Here, the rotation changes the agent's yaw rotation.

rotation

is unspecified, the agent's rotation will remain unchanged.

Remark

The LoCoBot's body cannot changes its in pitch or roll rotations; hence, why the $x, z=0$ .

horizon

: Optional[float] = None

The

horizon

change's the camera's rotation. Values are clamped between

[-30:30]

horizon

is not specified, the horizon will remain unchanged.

Since the agent looks up and down in $30^\circ$ increments, it most common for the horizon to be in $\lbrace -30, 0, 30\rbrace$ .

Warning

As illustrated in the figure below, negative camera

horizon

values correspond to agent looking up, whereas positive

horizon

values correspond to the agent looking down.

In addition to being able to

Teleport

, we also have an equivalent action

TeleportFull

that strictly requires every degree of freedom to be specified. The strictness of this action is often useful in large projects, where we don't want to any part of the agent implied implicitly.

Warning

TeleportFull

does not guarantee backwards compatibility in future releases. If a new degree of freedom is added to the agent, it will be added as a required parameter. If this is an issue, we recommend using

Teleport

controller.step(
    action="TeleportFull",
    position=dict(x=0.999, y=1.01, z=-0.3541),
    rotation=dict(x=0, y=90, z=0),
    horizon=30
)

TeleportFull Parameters

position

required

The global position of the agent, with keys for

x

y

, and

z

position

is unspecified, the agent's position will remain unchanged.

The coordinate space for each RoboTHOR scene is shown below. Notice that

y

corresponds to the upward coordinate in 3D space.

rotation

required

The rotation of the agent's body in global 3D space. Here, the rotation changes the agent's yaw rotation.

The LoCoBot's body cannot change in pitch or roll; hence, why the $x, z=0$ .

horizon

: float

required

The

horizon

change's the camera's rotation. Values are clamped between [-30:30].

Since the agent looks up and down in $30^\circ$ increments, it most common for the horizon to be in $\lbrace -30, 0, 30 \rbrace$ .

Warning

As illustrated in the figure below, negative camera

horizon

values correspond to agent looking up, whereas positive

horizon

values correspond to the agent looking down.

Get Reachable Positions

It is often useful to randomize the position of the agent in the agent in the scene, before starting an episode. Here, we can use:

GetReachablePositions
, which does an optimized BFS over the in a deterministic grid spaced out by the initialized
gridSize
. The valid positions are then added and returned in a list.
Teleport
, which can take a given position, and transform our agent to that position.

The process is illustrated below:

Step 1: Get the Positions

positions = controller.step(
    action="GetReachablePositions"
).metadata["actionReturn"]

Response

[
    dict(x=(...), y=(...), z=(...)),
    dict(x=(...), y=(...), z=(...)),
    dict(x=(...), y=(...), z=(...)),
    {...}
    dict(x=(...), y=(...), z=(...)),
]

Step 2: Teleport to a Position

import random
position = random.choice(positions)
controller.step(
    action="Teleport",
    position=position
)

GetReachablePositions Response

actionReturn

: Optional[numpy.ndarray]

A list of each position that the agent can be at, without colliding into any other objects or going beyond the walls.

The coordinate space for each RoboTHOR scene is shown below. Notice that $y$ corresponds to the upward coordinate in 3D space.

Done

The

Done

action nothing to the state of the environment. But, it returns a cleaned up event with respect to the metadata.

controller.step(action="Done")

It is often used in the definition of a successful navigation task (see Anderson et al.), where the agent must call the done action to signal that it knows that it's done, rather than arbitrarily or biasedly guessing.

Warning

The

Done

action does literally nothing to the state of the environment. For instance, if the agent calls

Done

and then

MoveAhead

, the

Done

action will have no affect on preventing a

MoveAhead

action from executing.

It is often used to return a cleaned up version of the metadata.

Environment State

The figure above shows AI2-THOR's agent-simulator interaction loop. Our backend scenes, actions, and observations are stored within Unity, a powerful real-time game engine.

From the Python side, we sequentially interact with Unity by taking actions using AI2-THOR's Python

Controller

. We have already seen several actions, such as

MoveAhead

RotateRight

, and

LookUp

Events

The returned data after executing any action is an

Event

. Each

Event

contains a ton of information to encode the state of the environment after the action has been executed:

event = controller.step(...)

References the last executed Event

controller.last_event

Response

<ai2thor.server.Event
    .metadata: {...}

    .frame: {...}
    .cv2img: {...}

    .depth_frame: {...}

    .instance_segmentation_frame: {...}
    .instance_masks: {...}
    .instance_detections2D: {...}

    .color_to_object_id: {...}
    .object_id_to_color: {...}
/>

Event Response

metadata

: dict[str, any]

Contains information about the action and how it affected the environment. See the Metadata section for more.

frame

: numpy.ndarray

RGB frame of the scene from the agent egocentric perspective. The size of the frame is (height, width, 3) stored in the numpy.uint8 format.

cv2img

: numpy.ndarray

Same as the frame, except image channels are in BGR ordering. This is often useful with Python's OpenCV module (i.e., cv2), which expects images with BGR orderings.

depth_frame

Measures the distance, in meters, from the agent to each pixel in the frame. The size of the frame is (height, width) stored in the numpy.float32 format.

Remark

Upon initialization or reset, renderDepthImage must be set to

True

in order for the depth frame to appear.

instance_segmentation_frame

: Optional[numpy.ndarray]

Segments every object instance in the scene. The size of the frame is (height, width, 3) stored in the numpy.uint8 format. Each unique pixel color corresponds to a different object, which are indexable with color_to_object_id or object_id_to_color.

Remark

Upon initialization or reset, renderInstanceSegmentation must be set to

True

in order for the instance segmentation frame to appear.

instance_masks

: Optional[dict[str, numpy.ndarray]]

Segments every object instance in the scene, storing each segmentation as a boolean image mask.

The keys are object IDs visible in the frame and each value is the boolean numpy.ndarray of sizeheight, width) that segments the image.

Remark

Upon initialization or reset, renderInstanceSegmentation must be set to

True

in order for the instance masks to appear.

instance_detections2D

: Optional[dict[str, numpy.array]]

Provides the 2D bounding box around every object instance in the scene.

The keys are object IDs and the values are [Upper Left $x$ , Upper Left $y$ , Lower Right $x$ , Lower Right $y$ ], where each element is the number of pixels it is from the top left corner of the image.

Remark

Upon initialization or reset, renderInstanceSegmentation must be set to

True

in order for the 2D instance detections to appear.

color_to_object_id

: dict[tuple[int, int, int], str]

Provides a mapping from each pixel's color, represented as a hashable tuple, in the instance_segmentation_frame to what object it represents.

object_id_to_color

: dict[str, tuple[int, int, int]]

Same as color_to_object_id, except for the keys and values being swapped.

Metadata

The metadata dictionary contains summary information about the state of the environment.

event.metadata

Response

{
    "errorMessage": {...}
    "lastActionSuccess": {...}
    "actionReturn": {...}

    "agent": {...}
    "objects": {...}

    {...}
}

Metadata Response

errorMessage

If an action fails, feedback is provided as to what caused the failure.

lastActionSuccess

: bool

States whether the last action was able to successfully execute.

An exception will not be raised upon invalid states. For instance, if an agent is standing right in-front of a wall and tries to

MoveAhead

, it will fail, but an exception will not be raised.

If an action is unsuccessful, the state of the environment will not have changed.

actionReturn

: any

Holds the resulting data for actions that query the environment, such as Get Reachable Positions.

Agent Metadata

Within the metadata dictionary, the agent's key contains the pose of the agent after the action has executed.

event.metadata["agent"]

Response

{
    cameraHorizon: {...},
    position: {...},
    rotation: {...},
    {...}
}

Agent Metadata Response

cameraHorizon

: float

The angle in degrees that the camera's pitch is rotated.

Warning

As illustrated in the figure below, negative camera

horizon

values correspond to agent looking up, whereas positive

horizon

values correspond to the agent looking down.

position

The global position of the agent, with keys for $x$ , $y$ , and $z$ .

The coordinate space for each RoboTHOR scene is shown below. Notice that $y$ corresponds to the upward coordinate in 3D space.

rotation

The local rotation of the agent's body, with keys for $x$ (pitch), $y$ (yaw), and $z$ (roll). Since the LoCoBot's body can only change its yaw rotation, both $x$ and $z$ will always be approximately 0.

Object Metadata

event.metadata["objects"]

Response

{
    "objectId": {...},
    "objectType": {...}

    "distance": {...},
    "visible": {...},

    "position": {...},
    "rotation": {...},

    {...}
}

Object Metadata Response

objectId

The unique ID of each object in the scene. It is generated at runtime and composed of an object's

objectType

and

position

Example:

AlarmClock|-02.08|+00.94|-03.62

objectType

The annotated type of the object. Each time is specified in the Object Types section.

distance

: float

The Euclidean distance from near the center-point of the object to the agent's camera.

visible

: bool

Indicates whether the object is visible, and within the initialized visibility distance of the agent.

Warning

The visible property does not mean the object is literally visible in the frame. Rather, the object both has to be visible and at a distance of less than visibilityDistance away. For more information, refer to iTHOR's concepts page.

position

The global position of the object, with keys for $x$ , $y$ , and $z$ .

The coordinate space for each RoboTHOR scene is shown below. Notice that $y$ corresponds to the upward coordinate in 3D space.

rotation

: ai2thor.controller.Controller

The local rotation of the object, with keys for $x$ (pitch), $y$ (yaw), and $z$ (roll).

Evaluation

This section details several utility functions that may help when evaluating navigation agents in RoboTHOR. We also provide all object types that appear in RoboTHOR. For ObjectNav, object types often appear as navigation target.

Object Types

TARGET_OBJECT_TYPES = {
    "AlarmClock,"
    "Apple,"
    "BaseballBat,"
    "BasketBall,"
    "Bowl,"
    "GarbageCan,"
    "HousePlant,"
    "Laptop,"
    "Mug,"
    "RemoteControl,"
    "SprayBottle,"
    "Television,"
    "Vase"
}

Target Object Types

Target object types are guaranteed to have exactly one instance of that type in each scene. Thus, they are well suited to use as target objects in an ObjectNav task.

BACKGROUND_OBJECT_TYPES = {
    "ArmChair",
    "Bed",
    "Book",
    "Bottle",
    "Box",
    "ButterKnife",
    "Candle",
    "CD",
    "CellPhone",
    "Chair",
    "CoffeeTable",
    "Cup",
    "DeskLamp",
    "Desk",
    "DiningTable",
    "Drawer",
    "Dresser",
    "FloorLamp",
    "Fork",
    "Newspaper",
    "Painting",
    "Pencil",
    "Pen",
    "PepperShaker",
    "Pillow",
    "Plate",
    "Pot",
    "SaltShaker",
    "Shelf",
    "SideTable",
    "Sofa",
    "Statue",
    "TeddyBear",
    "TennisRacket",
    "TVStand",
    "Watch"
}

Background Object Types

Background object types are then defined as the set of object types that are not target objects. Put differently, these objects appear in at least 1 scene, but do not appear in all of them.

Shortest Path

We can use

get_shortest_path_to_object_type

to get the shortest path to a target object type. It returns a list of positions, representing the corner points on the path.

from ai2thor.util.metrics import (
    get_shortest_path_to_object_type
)

path = get_shortest_path_to_object_type(
   controller=controller,
    object_type="Apple",
    initial_position=dict(
        x=0,
        y=0.9,
        z=0.25
    )
)

Response

[
    dict(x={...}, y={...}, z={...}),
    dict(x={...}, y={...}, z={...}),
    dict(x={...}, y={...}, z={...}),
    {...}
    dict(x={...}, y={...}, z={...}),
]

get_shortest_path_to_object_type Parameters

controller

required

An initialized AI2-THOR

Controller

object.

object_type

required

The name of the target object type. A

ValueError

is thrown if

object_type

is not a target object type.

initial_position

: dict

required

The starting position of the agent. Positions can be accessed from

event.metadata["agent"]["position"]

Path Distance

We can use

path_distance

to compute the summation over all the edges in our

path

. An edge's value is defined as the L2 norm between 2 consecutive corner points.

from ai2thor.util.metrics import (
    path_distance
)

path_distance(path)

path_distance Parameters

path

required

A list of dictionaries with keys for $x$ , $y$ , and $z$ agent positions along the path. This may also be the response from Shortest Path, for instance.

SPL

SPL, Success weighted by (normalized inverse) Path Length, is a quick and common navigation metric (see Anderson et al. and Batra et al.) as

\text{SPL}_i = S_i \cdot \frac{\ell_i}{\max(p_i, \ell_i)},

where, for a single episode $i$ , $S_i$ is the binary indicator variable denoting if the episode was successful, $\ell_i$ is the shortest path length (in meters) from the agent's starting position to the target, and $p_i$ is the path length (in meters) that the agent took. The metric ranges inclusively from $[0:1]$ .

Remark

It may look weird that we compute $\max(p_i, \ell_i)$ . Since $p_i$ is defined as the shortest path length, one may think $\max(p_i, \ell_i) = \ell_i$ for all paths $\ell_i$ . But, $p_i$ is defined as the shortest path length directly to the object, without the minimum cushion of distance, whereas for the agent's path, $\ell_i$ , it may act optimally and then stop, say, half a meter short of the target with the episode still be considered successful. So, here, we'd have $\ell_i \gt p_i$ .

from ai2thor.util.metrics import (
    compute_single_spl
)

compute_single_spl(
    path,
    shortest_path,
    successful_path
)

compute_single_spl Parameters

path

required

The $\ell_i$ term in the SPL definition.

A list of dictionaries with keys for $x$ , $y$ , and $z$ agent positions along the path.

shortest_path