Skip to content

Catalog

rerun.catalog

IndexValuesLike: TypeAlias = npt.NDArray[np.int_] | npt.NDArray[np.datetime64] | pa.Int64Array module-attribute

A type alias for index values.

This can be any numpy-compatible array of integers, or a pyarrow.Int64Array

VectorDistanceMetricLike: TypeAlias = VectorDistanceMetric | Literal['L2', 'Cosine', 'Dot', 'Hamming'] module-attribute

A type alias for vector distance metrics.

class AlreadyExistsError

Bases: Exception

Raised when trying to create a resource that already exists.

class DataframeQueryView

View into a remote dataset acting as DataFusion table provider.

def df()

Register this view to the global DataFusion context and return a DataFrame.

def fill_latest_at()

Populate any null values in a row with the latest valid data according to the index.

RETURNS DESCRIPTION
RecordingView

A new view with the null values filled in.

The original view will not be modified.

def filter_index_values(values)

Filter the view to only include data at the provided index values.

The index values returned will be the intersection between the provided values and the original index values.

This requires index values to be a precise match. Index values in Rerun are represented as i64 sequence counts or nanoseconds. This API does not expose an interface in floating point seconds, as the numerical conversion would risk false mismatches.

PARAMETER DESCRIPTION
values

The index values to filter by.

TYPE: IndexValuesLike

RETURNS DESCRIPTION
RecordingView

A new view containing only the data at the specified index values.

The original view will not be modified.

def filter_is_not_null(column)

Filter the view to only include rows where the given component column is not null.

This corresponds to rows for index values where this component was provided to Rerun explicitly via .log() or .send_columns().

PARAMETER DESCRIPTION
column

The component column to filter by.

TYPE: AnyComponentColumn

RETURNS DESCRIPTION
RecordingView

A new view containing only the data where the specified component column is not null.

The original view will not be modified.

def filter_partition_id(partition_id, *args)

Filter by one or more partition ids. All partition ids are included if not specified.

def filter_range_nanos(start, end)

Filter the view to only include data between the given index values expressed as nanoseconds.

This range is inclusive and will contain both the value at the start and the value at the end.

The view must be of a temporal index type to use this method.

PARAMETER DESCRIPTION
start

The inclusive start of the range.

TYPE: int

end

The inclusive end of the range.

TYPE: int

RETURNS DESCRIPTION
RecordingView

A new view containing only the data within the specified range.

The original view will not be modified.

def filter_range_secs(start, end)

Filter the view to only include data between the given index values expressed as seconds.

This range is inclusive and will contain both the value at the start and the value at the end.

The view must be of a temporal index type to use this method.

PARAMETER DESCRIPTION
start

The inclusive start of the range.

TYPE: int

end

The inclusive end of the range.

TYPE: int

RETURNS DESCRIPTION
RecordingView

A new view containing only the data within the specified range.

The original view will not be modified.

def filter_range_sequence(start, end)

Filter the view to only include data between the given index sequence numbers.

This range is inclusive and will contain both the value at the start and the value at the end.

The view must be of a sequential index type to use this method.

PARAMETER DESCRIPTION
start

The inclusive start of the range.

TYPE: int

end

The inclusive end of the range.

TYPE: int

RETURNS DESCRIPTION
RecordingView

A new view containing only the data within the specified range.

The original view will not be modified.

def to_arrow_reader()

Convert this view to a pyarrow.RecordBatchReader.

def using_index_values(values)

Create a new view that contains the provided index values.

If they exist in the original data they are selected, otherwise empty rows are added to the view.

The output view will always have the same number of rows as the provided values, even if those rows are empty. Use with .fill_latest_at() to populate these rows with the most recent data.

PARAMETER DESCRIPTION
values

The index values to use.

TYPE: IndexValuesLike

RETURNS DESCRIPTION
RecordingView

A new view containing the provided index values.

The original view will not be modified.

class DatasetEntry

Bases: Entry

A dataset entry in the catalog.

catalog: CatalogClientInternal property

The catalog client that this entry belongs to.

created_at: datetime property

The entry's creation date and time.

id: EntryId property

The entry's id.

kind: EntryKind property

The entry's kind.

manifest_url: str property

Return the dataset manifest URL.

name: str property

The entry's name.

updated_at: datetime property

The entry's last updated date and time.

def arrow_schema()

Return the Arrow schema of the data contained in the dataset.

def blueprint_dataset()

The associated blueprint dataset, if any.

def blueprint_dataset_id()

The ID of the associated blueprint dataset, if any.

def create_fts_index(*, column, time_index, store_position=False, base_tokenizer='simple')

Create a full-text search index on the given column.

def create_vector_index(*, column, time_index, num_partitions=None, target_partition_num_rows=None, num_sub_vectors=16, distance_metric=...)

Create a vector index on the given column.

This will enable indexing and build the vector index over all existing values in the specified component column.

Results can be retrieved using the search_vector API, which will include the time-point on the indexed timeline.

Only one index can be created per component column -- executing this a second time for the same component column will replace the existing index.

PARAMETER DESCRIPTION
column

The component column to create the index on.

TYPE: AnyComponentColumn

time_index

Which timeline this index will map to.

TYPE: IndexColumnSelector

num_partitions

The number of partitions to create for the index. (Deprecated, use target_partition_num_rows instead)

TYPE: int | None DEFAULT: None

target_partition_num_rows

The target size (in number of rows) for each partition. Defaults to 4096 if neither this nor num_partitions is specified.

TYPE: int | None DEFAULT: None

num_sub_vectors

The number of sub-vectors to use when building the index.

TYPE: int DEFAULT: 16

distance_metric

The distance metric to use for the index. ("L2", "Cosine", "Dot", "Hamming")

TYPE: VectorDistanceMetricLike DEFAULT: ...

def dataframe_query_view(*, index, contents, include_semantically_empty_columns=False, include_tombstone_columns=False)

Create a DataframeQueryView of the recording according to a particular index and content specification.

The only type of index currently supported is the name of a timeline, or None (see below for details).

The view will only contain a single row for each unique value of the index that is associated with a component column that was included in the view. Component columns that are not included via the view contents will not impact the rows that make up the view. If the same entity / component pair was logged to a given index multiple times, only the most recent row will be included in the view, as determined by the row_id column. This will generally be the last value logged, as row_ids are guaranteed to be monotonically increasing when data is sent from a single process.

If None is passed as the index, the view will contain only static columns (among those specified) and no index columns. It will also contain a single row per partition.

PARAMETER DESCRIPTION
index

The index to use for the view. This is typically a timeline name. Use None to query static data only.

TYPE: str | None

contents

The content specification for the view.

This can be a single string content-expression such as: "world/cameras/**", or a dictionary specifying multiple content-expressions and a respective list of components to select within that expression such as {"world/cameras/**": ["ImageBuffer", "PinholeProjection"]}.

TYPE: ViewContentsLike

include_semantically_empty_columns

Whether to include columns that are semantically empty, by default False.

Semantically empty columns are components that are null or empty [] for every row in the recording.

TYPE: bool DEFAULT: False

include_tombstone_columns

Whether to include tombstone columns, by default False.

Tombstone columns are components used to represent clears. However, even without the clear tombstone columns, the view will still apply the clear semantics when resolving row contents.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
DataframeQueryView

The view of the dataset.

def default_blueprint_partition_id()

The default blueprint partition ID for this dataset, if any.

def delete()

Delete this entry from the catalog.

def delete_indexes(column)

Deletes all user-defined indexes for the specified column.

def do_maintenance(optimize_indexes=False, retrain_indexes=False, compact_fragments=False, cleanup_before=None, unsafe_allow_recent_cleanup=False)

Perform maintenance tasks on the datasets.

def download_partition(partition_id)

Download a partition from the dataset.

def list_indexes()

List all user-defined indexes in this dataset.

def manifest()

Return the dataset manifest as a Datafusion table provider.

def partition_ids()

Returns a list of partitions IDs for the dataset.

def partition_table()

Return the partition table as a Datafusion table provider.

def partition_url(partition_id, timeline=None, start=None, end=None)

Return the URL for the given partition.

PARAMETER DESCRIPTION
partition_id

The ID of the partition to get the URL for.

TYPE: str

timeline

The name of the timeline to display.

TYPE: str | None DEFAULT: None

start

The start time for the partition. Integer for ticks, or datetime/nanoseconds for timestamps.

TYPE: datetime | int | None DEFAULT: None

end

The end time for the partition. Integer for ticks, or datetime/nanoseconds for timestamps.

TYPE: datetime | int | None DEFAULT: None

Examples:

With ticks
>>> start_tick, end_time = 0, 10
>>> dataset.partition_url("some_id", "log_tick", start_tick, end_time)
With timestamps
>>> start_time, end_time = datetime.now() - timedelta(seconds=4), datetime.now()
>>> dataset.partition_url("some_id", "real_time", start_time, end_time)
RETURNS DESCRIPTION
str

The URL for the given partition.

def register(recording_uri, *, recording_layer='base', timeout_secs=60)

Register a RRD URI to the dataset and wait for completion.

This method registers a single recording to the dataset and blocks until the registration is complete, or after a timeout (in which case, a TimeoutError is raised).

PARAMETER DESCRIPTION
recording_uri

The URI of the RRD to register.

TYPE: str

recording_layer

The layer to which the recording will be registered to.

TYPE: str DEFAULT: 'base'

timeout_secs

The timeout after which this method raises a TimeoutError if the task is not completed.

TYPE: int DEFAULT: 60

RETURNS DESCRIPTION
partition_id

The partition ID of the registered RRD.

TYPE: str

def register_batch(recording_uris, *, recording_layers=[])

Register a batch of RRD URIs to the dataset and return a handle to the tasks.

This method initiates the registration of multiple recordings to the dataset, and returns the corresponding task ids in a [Tasks] object.

PARAMETER DESCRIPTION
recording_uris

The URIs of the RRDs to register.

TYPE: list[str]

recording_layers

The layers to which the recordings will be registered to: * When empty, this defaults to ["base"]. * If longer than recording_uris, recording_layers will be truncated. * If shorter than recording_uris, recording_layers will be extended by repeating its last value. I.e. an empty recording_layers will result in "base" begin repeated len(recording_layers) times.

TYPE: list[str] DEFAULT: []

def register_prefix(recordings_prefix, layer_name=None)

Register all RRDs under a given prefix to the dataset and return a handle to the tasks.

A prefix is a directory-like path in an object store (e.g. an S3 bucket or ABS container). All RRDs that are recursively found under the given prefix will be registered to the dataset.

This method initiates the registration of the recordings to the dataset, and returns the corresponding task ids in a [Tasks] object.

PARAMETER DESCRIPTION
recordings_prefix

The prefix under which to register all RRDs.

TYPE: str

layer_name

The layer to which the recordings will be registered to. If None, this defaults to "base".

TYPE: str | None DEFAULT: None

def schema()

Return the schema of the data contained in the dataset.

def search_fts(query, column)

Search the dataset using a full-text search query.

def search_vector(query, column, top_k)

Search the dataset using a vector search query.

def set_default_blueprint_partition_id(partition_id)

Set the default blueprint partition ID for this dataset.

Pass None to clear the bluprint. This fails if the change cannot be made to the remote server.

def update(*, name=None)

Update this entry's properties.

PARAMETER DESCRIPTION
name

New name for the entry

TYPE: str | None DEFAULT: None

class CatalogClient

Client for a remote Rerun catalog server.

Note: the datafusion package is required to use this client. Initialization will fail with an error if the package is not installed.

ctx: datafusion.SessionContext property

Returns a DataFusion session context for querying the catalog.

url: str property

Returns the catalog URL.

def all_entries()

Returns a list of all entries in the catalog.

def append_to_table(table_name, **named_params)

Convert Python objects into columns of data and append them to a table.

This is a convenience method to quickly turn Python objects into rows of data. You may pass in any parameter name which will be used for the column name. If you need more control over the data written to the server, you can also use [CatalogClient.write_table] to write record batches to the server.

If you wish to send multiple rows at once, then all parameters should be a list of the same length. This function will query the table to determine the schema and attempt to coerce data types as appropriate.

PARAMETER DESCRIPTION
table_name

The name of the table entry to write to. This table must already exist.

TYPE: str

named_params

Pairwise combinations of column names and the data to write. For example if you pass age=3 it will attempt to create a column named age and cast the value 3 to the appropriate type.

TYPE: Any DEFAULT: {}

def create_dataset(name)

Creates a new dataset with the given name.

def create_table_entry(name, schema, url)

Create and register a new table.

PARAMETER DESCRIPTION
name

The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.

TYPE: str

schema

The schema of the table to create.

TYPE: Schema

url

The URL of the directory for where to store the Lance table.

TYPE: str

def dataset_entries()

Returns a list of all dataset entries in the catalog.

def dataset_names()

Returns a list of all dataset names in the catalog.

def datasets()

Returns a DataFrame containing all dataset entries in the catalog.

def do_global_maintenance()

Perform maintenance tasks on the whole system.

def entries()

Returns a DataFrame containing all entries in the catalog.

def entry_names()

Returns a list of all entry names in the catalog.

def get_dataset(*, id=None, name=None)

Returns a dataset by its ID or name.

Note: This is currently an alias for get_dataset_entry. In the future, it will return a data-oriented dataset object instead.

def get_dataset_entry(*, id=None, name=None)

Returns a dataset by its ID or name.

def get_table(*, id=None, name=None)

Returns a table by its ID or name.

def get_table_entry(*, id=None, name=None)

Returns a table by its ID or name.

def register_table(name, url)

Registers a foreign Lance table (identified by its URL) as a new table entry with the given name.

PARAMETER DESCRIPTION
name

The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.

TYPE: str

url

The URL of the Lance table to register.

TYPE: str

def table_entries()

Returns a list of all dataset entries in the catalog.

def table_names()

Returns a list of all table names in the catalog.

def tables()

Returns a DataFrame containing all table entries in the catalog.

def write_table(name, batches, insert_mode)

Writes record batches into an existing table.

PARAMETER DESCRIPTION
name

The name of the table entry to write to. This table must already exist.

TYPE: str

batches

One or more record batches to write into the table. For convenience, you can pass in a record batch, list of record batches, list of list of batches, or a [pyarrow.RecordBatchReader].

TYPE: RecordBatchReader | RecordBatch | Sequence[RecordBatch] | Sequence[Sequence[RecordBatch]]

insert_mode

Determines how rows should be added to the existing table.

TYPE: TableInsertMode

class Entry

An entry in the catalog.

catalog: CatalogClientInternal property

The catalog client that this entry belongs to.

created_at: datetime property

The entry's creation date and time.

id: EntryId property

The entry's id.

kind: EntryKind property

The entry's kind.

name: str property

The entry's name.

updated_at: datetime property

The entry's last updated date and time.

def delete()

Delete this entry from the catalog.

def update(*, name=None)

Update this entry's properties.

PARAMETER DESCRIPTION
name

New name for the entry

TYPE: str | None DEFAULT: None

class EntryId

A unique identifier for an entry in the catalog.

def __init__(id)

Create a new EntryId from a string.

def __str__()

Return str(self).

class EntryKind

The kinds of entries that can be stored in the catalog.

def __int__()

int(self)

def __str__()

Return str(self).

class NotFoundError

Bases: Exception

Raised when the requested resource is not found.

class TableEntry

Bases: Entry

A table entry in the catalog.

Note: this object acts as a table provider for DataFusion.

catalog: CatalogClientInternal property

The catalog client that this entry belongs to.

created_at: datetime property

The entry's creation date and time.

id: EntryId property

The entry's id.

kind: EntryKind property

The entry's kind.

name: str property

The entry's name.

storage_url: str property

The table's storage URL.

updated_at: datetime property

The entry's last updated date and time.

def __datafusion_table_provider__()

Returns a DataFusion table provider capsule.

def delete()

Delete this entry from the catalog.

def df()

Registers the table with the DataFusion context and return a DataFrame.

def to_arrow_reader()

Convert this table to a pyarrow.RecordBatchReader.

def update(*, name=None)

Update this entry's properties.

PARAMETER DESCRIPTION
name

New name for the entry

TYPE: str | None DEFAULT: None

class Task

A handle on a remote task.

id: str property

The task id.

def wait(timeout_secs)

Block until the task is completed or the timeout is reached.

A TimeoutError is raised if the timeout is reached.

class VectorDistanceMetric

Bases: Enum

Which distance metric for use for vector index.