Catalog
rerun.catalog
IndexValuesLike: TypeAlias = npt.NDArray[np.int_] | npt.NDArray[np.datetime64] | pa.Int64Array
module-attribute
A type alias for index values.
This can be any numpy-compatible array of integers, or a pyarrow.Int64Array
VectorDistanceMetricLike: TypeAlias = VectorDistanceMetric | Literal['L2', 'Cosine', 'Dot', 'Hamming']
module-attribute
A type alias for vector distance metrics.
class AlreadyExistsError
Bases: Exception
Raised when trying to create a resource that already exists.
class DataframeQueryView
View into a remote dataset acting as DataFusion table provider.
def df()
Register this view to the global DataFusion context and return a DataFrame.
def fill_latest_at()
Populate any null values in a row with the latest valid data according to the index.
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view with the null values filled in. The original view will not be modified. |
def filter_index_values(values)
Filter the view to only include data at the provided index values.
The index values returned will be the intersection between the provided values and the original index values.
This requires index values to be a precise match. Index values in Rerun are represented as i64 sequence counts or nanoseconds. This API does not expose an interface in floating point seconds, as the numerical conversion would risk false mismatches.
| PARAMETER | DESCRIPTION |
|---|---|
values
|
The index values to filter by.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data at the specified index values. The original view will not be modified. |
def filter_is_not_null(column)
Filter the view to only include rows where the given component column is not null.
This corresponds to rows for index values where this component was provided to Rerun explicitly
via .log() or .send_columns().
| PARAMETER | DESCRIPTION |
|---|---|
column
|
The component column to filter by.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data where the specified component column is not null. The original view will not be modified. |
def filter_range_nanos(start, end)
Filter the view to only include data between the given index values expressed as nanoseconds.
This range is inclusive and will contain both the value at the start and the value at the end.
The view must be of a temporal index type to use this method.
| PARAMETER | DESCRIPTION |
|---|---|
start
|
The inclusive start of the range.
TYPE:
|
end
|
The inclusive end of the range.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data within the specified range. The original view will not be modified. |
def filter_range_secs(start, end)
Filter the view to only include data between the given index values expressed as seconds.
This range is inclusive and will contain both the value at the start and the value at the end.
The view must be of a temporal index type to use this method.
| PARAMETER | DESCRIPTION |
|---|---|
start
|
The inclusive start of the range.
TYPE:
|
end
|
The inclusive end of the range.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data within the specified range. The original view will not be modified. |
def filter_range_sequence(start, end)
Filter the view to only include data between the given index sequence numbers.
This range is inclusive and will contain both the value at the start and the value at the end.
The view must be of a sequential index type to use this method.
| PARAMETER | DESCRIPTION |
|---|---|
start
|
The inclusive start of the range.
TYPE:
|
end
|
The inclusive end of the range.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data within the specified range. The original view will not be modified. |
def filter_segment_id(segment_id, *args)
Filter by one or more segment ids. All segment ids are included if not specified.
def to_arrow_reader()
Convert this view to a pyarrow.RecordBatchReader.
def using_index_values(values)
Create a new view that contains the provided index values.
If they exist in the original data they are selected, otherwise empty rows are added to the view.
The output view will always have the same number of rows as the provided values, even if
those rows are empty. Use with .fill_latest_at()
to populate these rows with the most recent data.
| PARAMETER | DESCRIPTION |
|---|---|
values
|
The index values to use.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing the provided index values. The original view will not be modified. |
class DatasetEntry
Bases: Entry[DatasetEntryInternal]
A dataset entry in the catalog.
catalog: CatalogClient
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
manifest_url: str
property
Return the dataset manifest URL.
name: str
property
The entry's name.
updated_at: datetime
property
The entry's last updated date and time.
def arrow_schema()
Return the Arrow schema of the data contained in the dataset.
def blueprint_dataset()
The associated blueprint dataset, if any.
def blueprint_dataset_id()
The ID of the associated blueprint dataset, if any.
def create_fts_index(*, column, time_index, store_position=False, base_tokenizer='simple')
Create a full-text search index on the given column.
def create_vector_index(*, column, time_index, target_partition_num_rows=None, num_sub_vectors=16, distance_metric='Cosine')
Create a vector index on the given column.
This will enable indexing and build the vector index over all existing values in the specified component column.
Results can be retrieved using the search_vector API, which will include
the time-point on the indexed timeline.
Only one index can be created per component column -- executing this a second time for the same component column will replace the existing index.
| PARAMETER | DESCRIPTION |
|---|---|
column
|
The component column to create the index on.
TYPE:
|
time_index
|
Which timeline this index will map to.
TYPE:
|
target_partition_num_rows
|
The target size (in number of rows) for each partition. The underlying indexer (lance) will pick a default when no value is specified - today this is 8192. It will also cap the maximum number of partitions independently of this setting - currently 4096.
TYPE:
|
num_sub_vectors
|
The number of sub-vectors to use when building the index.
TYPE:
|
distance_metric
|
The distance metric to use for the index. ("L2", "Cosine", "Dot", "Hamming")
TYPE:
|
def dataframe_query_view(*, index, contents, include_semantically_empty_columns=False, include_tombstone_columns=False)
Create a DataframeQueryView of the recording according to a particular index and content specification.
The only type of index currently supported is the name of a timeline, or None (see below
for details).
The view will only contain a single row for each unique value of the index
that is associated with a component column that was included in the view.
Component columns that are not included via the view contents will not
impact the rows that make up the view. If the same entity / component pair
was logged to a given index multiple times, only the most recent row will be
included in the view, as determined by the row_id column. This will
generally be the last value logged, as row_ids are guaranteed to be
monotonically increasing when data is sent from a single process.
If None is passed as the index, the view will contain only static columns (among those
specified) and no index columns. It will also contain a single row per segment.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index to use for the view. This is typically a timeline name. Use
TYPE:
|
contents
|
The content specification for the view. This can be a single string content-expression such as:
TYPE:
|
include_semantically_empty_columns
|
Whether to include columns that are semantically empty, by default Semantically empty columns are components that are
TYPE:
|
include_tombstone_columns
|
Whether to include tombstone columns, by default Tombstone columns are components used to represent clears. However, even without the clear tombstone columns, the view will still apply the clear semantics when resolving row contents.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataframeQueryView
|
The view of the dataset. |
def default_blueprint_partition_id()
deprecated
Deprecated
Use default_blueprint_segment_id() instead
The default blueprint partition ID for this dataset, if any.
def default_blueprint_segment_id()
The default blueprint segment ID for this dataset, if any.
def delete()
Delete this entry from the catalog.
def delete_indexes(column)
Deletes all user-defined indexes for the specified column.
def do_maintenance(optimize_indexes=False, retrain_indexes=False, compact_fragments=False, cleanup_before=None, unsafe_allow_recent_cleanup=False)
Perform maintenance tasks on the datasets.
def download_partition(partition_id)
deprecated
Deprecated
Use download_segment() instead
Download a partition from the dataset.
def download_segment(segment_id)
Download a segment from the dataset.
def list_indexes()
List all user-defined indexes in this dataset.
def manifest()
Return the dataset manifest as a Datafusion table provider.
def partition_ids()
deprecated
Deprecated
Use segment_ids() instead
Returns a list of partition IDs for the dataset.
def partition_table()
deprecated
Deprecated
Use segment_table() instead
Return the partition table as a Datafusion table provider.
def partition_url(partition_id, timeline=None, start=None, end=None)
deprecated
Deprecated
Use segment_url() instead
Return the URL for the given partition.
def register(recording_uri, *, recording_layer='base', timeout_secs=60)
Register a RRD URI to the dataset and wait for completion.
This method registers a single recording to the dataset and blocks until the registration is
complete, or after a timeout (in which case, a TimeoutError is raised).
| PARAMETER | DESCRIPTION |
|---|---|
recording_uri
|
The URI of the RRD to register.
TYPE:
|
recording_layer
|
The layer to which the recording will be registered to.
TYPE:
|
timeout_secs
|
The timeout after which this method raises a
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
segment_id
|
The segment ID of the registered RRD.
TYPE:
|
def register_batch(recording_uris, *, recording_layers=None)
Register a batch of RRD URIs to the dataset and return a handle to the tasks.
This method initiates the registration of multiple recordings to the dataset, and returns
the corresponding task ids in a [Tasks] object.
| PARAMETER | DESCRIPTION |
|---|---|
recording_uris
|
The URIs of the RRDs to register. |
recording_layers
|
The layers to which the recordings will be registered to:
* When empty, this defaults to |
def register_prefix(recordings_prefix, layer_name=None)
Register all RRDs under a given prefix to the dataset and return a handle to the tasks.
A prefix is a directory-like path in an object store (e.g. an S3 bucket or ABS container). All RRDs that are recursively found under the given prefix will be registered to the dataset.
This method initiates the registration of the recordings to the dataset, and returns
the corresponding task ids in a [Tasks] object.
| PARAMETER | DESCRIPTION |
|---|---|
recordings_prefix
|
The prefix under which to register all RRDs.
TYPE:
|
layer_name
|
The layer to which the recordings will be registered to.
If
TYPE:
|
def schema()
Return the schema of the data contained in the dataset.
def search_fts(query, column)
Search the dataset using a full-text search query.
def search_vector(query, column, top_k)
Search the dataset using a vector search query.
def segment_ids()
Returns a list of segment IDs for the dataset.
def segment_table()
Return the segment table as a Datafusion table provider.
def segment_url(segment_id, timeline=None, start=None, end=None)
Return the URL for the given segment.
| PARAMETER | DESCRIPTION |
|---|---|
segment_id
|
The ID of the segment to get the URL for.
TYPE:
|
timeline
|
The name of the timeline to display.
TYPE:
|
start
|
The start time for the segment. Integer for ticks, or datetime/nanoseconds for timestamps. |
end
|
The end time for the segment. Integer for ticks, or datetime/nanoseconds for timestamps. |
Examples:
With ticks
>>> start_tick, end_time = 0, 10
>>> dataset.segment_url("some_id", "log_tick", start_tick, end_time)
With timestamps
>>> start_time, end_time = datetime.now() - timedelta(seconds=4), datetime.now()
>>> dataset.segment_url("some_id", "real_time", start_time, end_time)
| RETURNS | DESCRIPTION |
|---|---|
str
|
The URL for the given segment. |
def set_default_blueprint_partition_id(partition_id)
deprecated
Deprecated
Use set_default_blueprint_segment_id() instead
Set the default blueprint partition ID for this dataset.
def set_default_blueprint_segment_id(segment_id)
Set the default blueprint segment ID for this dataset.
Pass None to clear the blueprint. This fails if the change cannot be made to the remote server.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
class CatalogClient
Client for a remote Rerun catalog server.
Note: the datafusion package is required to use this client. Initialization will fail with an error if the package
is not installed.
ctx: datafusion.SessionContext
property
Returns a DataFusion session context for querying the catalog.
url: str
property
Returns the catalog URL.
def all_entries()
deprecated
Deprecated
Use entries() instead
Returns a list of all entries in the catalog.
def append_to_table(table_name, batches=None, **named_params)
deprecated
Deprecated
Use TableEntry.append() instead
Append record batches to an existing table.
| PARAMETER | DESCRIPTION |
|---|---|
table_name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
batches
|
One or more record batches to write into the table.
TYPE:
|
**named_params
|
Named parameters to write to the table as columns.
TYPE:
|
def create_dataset(name)
Creates a new dataset with the given name.
def create_table(name, schema, url)
Create and register a new table.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.
TYPE:
|
schema
|
The schema of the table to create.
TYPE:
|
url
|
The URL of the directory for where to store the Lance table.
TYPE:
|
def create_table_entry(name, schema, url)
deprecated
Deprecated
Use create_table() instead
Create and register a new table.
def dataset_entries()
deprecated
Deprecated
Use datasets() instead
Returns a list of all dataset entries in the catalog.
def dataset_names(*, include_hidden=False)
Returns a list of all dataset names in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include blueprint datasets.
TYPE:
|
def datasets(*, include_hidden=False)
Returns a list of all dataset entries in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include blueprint datasets.
TYPE:
|
def do_global_maintenance()
Perform maintenance tasks on the whole system.
def entries(*, include_hidden=False)
Returns a list of all entries in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include hidden entries (blueprint datasets and system tables like
TYPE:
|
def entry_names(*, include_hidden=False)
Returns a list of all entry names in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include hidden entries (blueprint datasets and system tables like
TYPE:
|
def get_dataset(name=None, *, id=None)
Returns a dataset by its ID or name.
Exactly one of id or name must be provided.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the dataset.
TYPE:
|
id
|
The unique identifier of the dataset. Can be an |
def get_dataset_entry(*, id=None, name=None)
deprecated
Deprecated
Use get_dataset() instead
Returns a dataset by its ID or name.
def get_table(name=None, *, id=None)
def get_table_entry(*, id=None, name=None)
deprecated
Deprecated
Use get_table() instead
Returns a table by its ID or name.
def register_table(name, url)
Registers a foreign Lance table (identified by its URL) as a new table entry with the given name.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.
TYPE:
|
url
|
The URL of the Lance table to register.
TYPE:
|
def table_entries()
deprecated
Deprecated
Use tables() instead
Returns a list of all dataset entries in the catalog.
def table_names(*, include_hidden=False)
Returns a list of all table names in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include system tables (e.g.,
TYPE:
|
def tables(*, include_hidden=False)
Returns a list of all table entries in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include system tables (e.g.,
TYPE:
|
def update_table(table_name, batches=None, **named_params)
deprecated
Deprecated
Use TableEntry.upsert() instead
Upsert record batches to an existing table.
| PARAMETER | DESCRIPTION |
|---|---|
table_name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
batches
|
One or more record batches to write into the table.
TYPE:
|
**named_params
|
Named parameters to write to the table as columns.
TYPE:
|
def write_table(name, batches, insert_mode)
deprecated
Deprecated
Use TableEntry.append(), overwrite(), or upsert() instead
Writes record batches into an existing table.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
batches
|
One or more record batches to write into the table. For convenience, you can
pass in a record batch, list of record batches, list of list of batches, or
a [
TYPE:
|
insert_mode
|
Determines how rows should be added to the existing table.
TYPE:
|
class Entry
Bases: ABC, Generic[InternalEntryT]
An entry in the catalog.
catalog: CatalogClient
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
name: str
property
The entry's name.
updated_at: datetime
property
The entry's last updated date and time.
def delete()
Delete this entry from the catalog.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
class EntryId
A unique identifier for an entry in the catalog.
def __init__(id)
Create a new EntryId from a string.
def __str__()
Return str(self).
class EntryKind
The kinds of entries that can be stored in the catalog.
def __int__()
int(self)
def __str__()
Return str(self).
class NotFoundError
Bases: Exception
Raised when the requested resource is not found.
class TableEntry
Bases: Entry[TableEntryInternal]
A table entry in the catalog.
Note: this object acts as a table provider for DataFusion.
catalog: CatalogClient
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
name: str
property
The entry's name.
storage_url: str
property
The table's storage URL.
updated_at: datetime
property
The entry's last updated date and time.
def __datafusion_table_provider__()
Returns a DataFusion table provider capsule.
def append(batches=None, **named_params)
Append to the Table.
| PARAMETER | DESCRIPTION |
|---|---|
batches
|
Arrow data to append to the table. Can be a RecordBatchReader, a single RecordBatch, a list of
RecordBatches, or a list of lists of RecordBatches (as returned by
TYPE:
|
**named_params
|
Each named parameter corresponds to a column in the table.
TYPE:
|
def arrow_schema()
Returns the Arrow schema of the table.
def delete()
Delete this entry from the catalog.
def df()
Registers the table with the DataFusion context and return a DataFrame.
def overwrite(batches=None, **named_params)
Overwrite the Table with new data.
| PARAMETER | DESCRIPTION |
|---|---|
batches
|
Arrow data to overwrite the table with. Can be a RecordBatchReader, a single RecordBatch, a list of
RecordBatches, or a list of lists of RecordBatches (as returned by
TYPE:
|
**named_params
|
Each named parameter corresponds to a column in the table.
TYPE:
|
def to_arrow_reader()
Convert this table to a pyarrow.RecordBatchReader.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
def upsert(batches=None, **named_params)
Upsert data into the Table.
To use upsert, the table must contain a column with the metadata:
{"rerun:is_table_index" = "true"}
Any row with a matching index value will have the new data inserted. Any row without a matching index value will be appended as a new row.
| PARAMETER | DESCRIPTION |
|---|---|
batches
|
Arrow data to upsert into the table. Can be a RecordBatchReader, a single RecordBatch, a list of
RecordBatches, or a list of lists of RecordBatches (as returned by
TYPE:
|
**named_params
|
Each named parameter corresponds to a column in the table
TYPE:
|
class Task
A handle on a remote task.
id: str
property
The task id.
def wait(timeout_secs)
Block until the task is completed or the timeout is reached.
A TimeoutError is raised if the timeout is reached.