The figure above shows AI2-THOR's agent-simulator interaction loop. Our backend scenes, actions, and observations are stored within Unity, a powerful real-time game engine. From the Python side, we sequentially interact with Unity by taking actions using AI2-THOR's Python
The returned data after executing any action is an
event = controller.step(...)
controller.last_event
<ai2thor.server.Event
.metadata: {...}
.frame: {...}
.cv2img: {...}
.depth_frame: {...}
.instance_segmentation_frame: {...}
.instance_masks: {...}
.instance_detections2D: {...}
.color_to_object_id: {...}
.object_id_to_color: {...}
/>
Contains information about the action and how it affected the environment. See the
RGB frame of the scene from the agent egocentric perspective. The size of the frame is the initialized (
Same as the
Measures the distance, in meters, from the agent to each pixel in the frame. The size of the frame is the initialized (
Upon
Segments every object instance in the scene. The size of the frame is (
Upon
Segments every object instance in the scene, storing each segmentation as a boolean image mask.
The keys are
Upon
Provides the 2D bounding box around every object instance in the scene.
The keys are object IDs and the values are [Upper Left , Upper Left , Lower Right , Lower Right ], where each element is the number of pixels it is from the top left corner of the image.
Upon
Provides a mapping from each pixel's color, represented as a hashable tuple, in the
Same as
The metadata dictionary contains summary information about the state of the environment.
event.metadata
{
"errorMessage": {...}
"lastActionSuccess": {...}
"actionReturn": {...}
"lastAction": {...}
"sceneName": {...}
"sceneBounds": {...}
"agent": {...}
"objects": {...}
"fov": {...}
"screenWidth": {...}
"screenHeight": {...}
{...}
}
If an action fails, feedback is provided as to what caused the failure.
States whether the last action was able to successfully execute.
An exception will not be raised upon invalid states. For instance, if an agent is standing right in-front of a wall and tries to
If an action is unsuccessful, the state of the environment will not have changed.
Holds the resulting data for actions that query the environment, such as
The name of the action passed into the
The name of the scene that the agent is currently located in.
This object provides all coordinates that are within bounds of the scene. This can be used in tandem with actions like
{
"center": {
'x': -1.45,
'y': 1.407,
'z': 0.2
},
"cornerPoints": [
[ 1.5, 2.92, 3.2],
[ 1.5, 2.92, -2.8],
[ 1.5, -0.106, 3.2],
[ 1.5, -0.106, -2.8],
[-4.4, 2.92, 3.2],
[-4.4, 2.92, -2.8],
[-4.4, -0.106, 3.2],
[-4.4, -0.106, -2.8]
],
"size": {
"x": 5.9,
"y": 3.02,
"z": 6.0
}
}
Information about the pose of the agent. See Agent Metadata for more.
Within the metadata dictionary, the
event.metadata["agent"]
{
cameraHorizon: {...},
isStanding: {...},
position: {...},
rotation: {...},
{...}
}
The angle in degrees that the camera's pitch is rotated.
Negative camera
The
The global position of the agent, with keys for , , and .
The local rotation of the agent's body, with keys for (pitch), (yaw), and (roll). Since the
Each object has a plethora of information exposed about it in each event.
event.metadata["objects"][i]
{
"objectId": {...},
"objectType": {...},
"name": {...},
"distance": {...},
"visible": {...},
"position": {...},
"rotation": {...},
"axisAlignedBoundingBox": {...},
"objectOrientedBoundingBox": {...},
"mass": {...},
"salientMaterials": {...},
"parentReceptacles": {...},
"receptacle": {...},
"receptacleObjectIds": {...},
"ObjectTemperature": {...},
"canChangeTempToHot": {...},
"canChangeTempToCold": {...},
"moveable": {...},
"isMoving": {...},
"pickupable": {...},
"isPickedUp": {...},
{...Object State Changes...}
}
The unique ID of each object in the scene. It is generated at runtime and composed of an object's
Example:
The annotated type of the object. Each time is specified in the
Name of the object in Unity Scene. These names are unique within any individual scene.
The Euclidean distance from near the center-point of the object to the agent's camera.
Indicates whether the object is visible, and within the initialized visibility distance of the agent.
The visible property does not mean the object is literally visible in the frame. Rather, the object both has to be visible and at a distance of less than the initialized
The global position of the object, with keys for , , and .
Note that corresponds to the upward coordinate in 3D space.
The local rotation of the object, with keys for (pitch), (yaw), and (roll).
Returns an axisAlignedBoundingBox object that includes an 8x3 matrix of xyz coordinates that represent the 8 corners of the box, an xyz dictionary for the coordinates of the center of the box, and an xyz dictionary for the size (extents) of the box. This axis aligned bounding box is the smallest box that can completely encloses the sim object that is aligned to the world axis. This means that if the object is rotated or moved, the overall size and volume of this axis aligned box will change in order to remain aligned relative to the static, world-axis. This is best used for rough approximations of the area/volume an object takes up. Do note that large, oddly shaped objects like CounterTops that wrap around kitchens may have awkwardly defined axis-aligned bounds. Example:
{
"center": {
"x": -1.336,
"y": 1.098,
"z": 0.221
},
"cornerPoints": [
[-1.232, 1.277, 0.319],
[-1.232, 1.277, 0.124],
[-1.232, 0.919, 0.319],
[-1.232, 0.919, 0.124],
[ -1.44, 1.277, 0.319],
[ -1.44, 1.277, 0.124],
[ -1.44, 0.919, 0.319],
[ -1.44, 0.919, 0.124],
],
"size": {
"x": 0.208,
"y": 0.358,
"z": 0.195
}
}
Returns an objectOrientedBoundingBox object that includes an 8x3 matrix of xyz coordinates that represent the 8 corners of the object oriented box. This object oriented bounding box is a box that completely encloses a sim object. The difference between this object oriented box and the axis aligned box above is this box’s dimensions are static relative to the object’s rotation and position. This means this object oriented box will always have the same total volume regardless of how the object is manipulated/rotated/moved. Note that only Pickupable objects have an objectOrientedBoundingBox at this time. Example:
{
'cornerPoints': [
[-1.445, 0.910, 0.115],
[-1.228, 0.910, 0.115],
[-1.228, 0.910, 0.328],
[-1.445, 0.910, 0.328],
[-1.445, 1.284, 0.115],
[-1.228, 1.284, 0.115],
[-1.228, 1.284, 0.328],
[-1.445, 1.284, 0.328]
]
}
The mass of a Pickupable or Moveable sim object in Kilograms
Array of strings listing the salient materials a pickupable object is composed of. Valid strings are: Metal, Wood, Plastic, Glass, Ceramic, Stone, Fabric, Rubber, Food, Paper, Wax, Soap, Sponge, and Organic.
A list of objectId strings of all receptacles that contain this object.
If True, this object has the Receptacle property and can contain other objects.
If the object is a receptacle, this is an array of objectIds that the receptacle contains.
String that indicates what this object’s current abstracted temperature is. Valid strings are: Hot, Cold, RoomTemp.
If True, this object is a source of Heat and can contextually change other object’s Temperature to Hot.
If True, this object is a source of Cold and can contextually change other object’s Temperature to Cold.
If True, this object has the Moveable property and can be moved around the environment with actions like PushObject.
A bool tracking if this object is actively in motion. This can be useful when tracking object behaviors while using PausePhysicsAutoSim and AdvancePhysicsStep actions.
If True, this object has the Pickupable property and can be picked up by the agent with the PickupObject action.
Only used if this object is pickupable = True. If True, this object is currently being picked up by the agent. If False, the object is not being held by the agent.
Environment queries are actions that query the environment to extract additional metadata regarding the current state. Since there is a performance cost that comes from calculating each environment query, they are not automatically provided in the metadata after each action.
Query actions do not alter the state of the environment. Thus, they are often substantially faster than non-query actions (e.g., MoveAhead, RotateRight), since image frames and object metadata can be reused from the previous Event.
This action can be used in tandem with object interaction actions, where
The coordinate from the current image frame, corresponding to the relative distance from the left of the frame. Valid values are in .
The coordinate from the current image frame, corresponding to the relative distance from the top of the frame. Valid values are in .
If
The coordinate from the current image frame, corresponding to the relative distance from the left of the frame. Valid values are in .
The coordinate from the current image frame, corresponding to the relative distance from the top of the frame. Valid values are in .
positions = controller.step(
action="GetReachablePositions"
).metadata["actionReturn"]
[
dict(x=(...), y=(...), z=(...)),
dict(x=(...), y=(...), z=(...)),
dict(x=(...), y=(...), z=(...)),
{...}
dict(x=(...), y=(...), z=(...)),
]
A list of positions that the agent can reach in the scene.
In order for an object to be
The action also provides the ability to restrict which poses should be searched, for each degree of freedom. For instance, one can restrict the poses they want to poses where
This action is expected to solely be used in tandem with TeleportFull. Hence, in future releases, if the agent is given extra degrees of freedom, those degrees of freedom will be included in each pose, and thus change the returned poses. Therefore, we recommend not indexing into any pose (i.e.,
[
dict(
x=(...),
y=(...),
z=(...),
horizon=(...),
rotation=(...),
standing=(...)
),
{...},
dict(
x=(...),
y=(...),
z=(...),
horizon=(...),
rotation=(...),
standing=(...)
)
]
import random
pose = random.choice(poses)
controller.step("TeleportFull", **pose)
The objectId of the object with which the interactable poses will be queried.
Restricts which positions should be searched. If not specified, all positions from
Restricts which rotation values should appear in the returned response. For instance, if
where is the current rotation of the agent and is the initialized
An exception is thrown if 360 mod the initialized
Restricts which horizons should be searched. For instance, if
Each horizon must be in .
Restricts which
The global position of where the camera will be placed.
The global rotation of where the camera.
Changes the camera's optical field of view, in degrees. Valid values are in the domain .
Image frames are then accessible with the
<ai2thor.server.Event
.third_party_camera_frames
{...}
/>
A list of RGB frames from rendered from each third party camera. Images are stored as numpy.uint8 ndarrays of size (height, width, 3), where the height and width are the same as the agent's egocentric image height and width.
Camera frames in the list appear in the order that they were added to the scene.
Targets which third party camera to modify. Third party camera ids are based on the order that they were added to the scene.
Valid values are in .
The global position of where the camera will be placed.
The global rotation of how the camera will be orientated.
Changes the camera's optical field of view, in degrees. Valid values are in the domain .