Catalog
rerun.catalog
IndexValuesLike: TypeAlias = npt.NDArray[np.int_] | npt.NDArray[np.datetime64] | pa.Int64Array
module-attribute
A type alias for index values.
This can be any numpy-compatible array of integers, or a pyarrow.Int64Array
VectorDistanceMetricLike: TypeAlias = VectorDistanceMetric | Literal['L2', 'Cosine', 'Dot', 'Hamming']
module-attribute
A type alias for vector distance metrics.
class AlreadyExistsError
Bases: Exception
Raised when trying to create a resource that already exists.
class DataframeQueryView
View into a remote dataset acting as DataFusion table provider.
def df()
Register this view to the global DataFusion context and return a DataFrame.
def fill_latest_at()
Populate any null values in a row with the latest valid data according to the index.
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view with the null values filled in. The original view will not be modified. |
def filter_index_values(values)
Filter the view to only include data at the provided index values.
The index values returned will be the intersection between the provided values and the original index values.
This requires index values to be a precise match. Index values in Rerun are represented as i64 sequence counts or nanoseconds. This API does not expose an interface in floating point seconds, as the numerical conversion would risk false mismatches.
| PARAMETER | DESCRIPTION |
|---|---|
values
|
The index values to filter by.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data at the specified index values. The original view will not be modified. |
def filter_is_not_null(column)
Filter the view to only include rows where the given component column is not null.
This corresponds to rows for index values where this component was provided to Rerun explicitly
via .log() or .send_columns().
| PARAMETER | DESCRIPTION |
|---|---|
column
|
The component column to filter by.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data where the specified component column is not null. The original view will not be modified. |
def filter_partition_id(partition_id, *args)
Filter by one or more partition ids. All partition ids are included if not specified.
def filter_range_nanos(start, end)
Filter the view to only include data between the given index values expressed as nanoseconds.
This range is inclusive and will contain both the value at the start and the value at the end.
The view must be of a temporal index type to use this method.
| PARAMETER | DESCRIPTION |
|---|---|
start
|
The inclusive start of the range.
TYPE:
|
end
|
The inclusive end of the range.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data within the specified range. The original view will not be modified. |
def filter_range_secs(start, end)
Filter the view to only include data between the given index values expressed as seconds.
This range is inclusive and will contain both the value at the start and the value at the end.
The view must be of a temporal index type to use this method.
| PARAMETER | DESCRIPTION |
|---|---|
start
|
The inclusive start of the range.
TYPE:
|
end
|
The inclusive end of the range.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data within the specified range. The original view will not be modified. |
def filter_range_sequence(start, end)
Filter the view to only include data between the given index sequence numbers.
This range is inclusive and will contain both the value at the start and the value at the end.
The view must be of a sequential index type to use this method.
| PARAMETER | DESCRIPTION |
|---|---|
start
|
The inclusive start of the range.
TYPE:
|
end
|
The inclusive end of the range.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing only the data within the specified range. The original view will not be modified. |
def to_arrow_reader()
Convert this view to a pyarrow.RecordBatchReader.
def using_index_values(values)
Create a new view that contains the provided index values.
If they exist in the original data they are selected, otherwise empty rows are added to the view.
The output view will always have the same number of rows as the provided values, even if
those rows are empty. Use with .fill_latest_at()
to populate these rows with the most recent data.
| PARAMETER | DESCRIPTION |
|---|---|
values
|
The index values to use.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RecordingView
|
A new view containing the provided index values. The original view will not be modified. |
class DatasetEntry
Bases: Entry
A dataset entry in the catalog.
catalog: CatalogClientInternal
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
manifest_url: str
property
Return the dataset manifest URL.
name: str
property
The entry's name.
updated_at: datetime
property
The entry's last updated date and time.
def arrow_schema()
Return the Arrow schema of the data contained in the dataset.
def blueprint_dataset()
The associated blueprint dataset, if any.
def blueprint_dataset_id()
The ID of the associated blueprint dataset, if any.
def create_fts_index(*, column, time_index, store_position=False, base_tokenizer='simple')
Create a full-text search index on the given column.
def create_vector_index(*, column, time_index, num_partitions=None, target_partition_num_rows=None, num_sub_vectors=16, distance_metric=...)
Create a vector index on the given column.
This will enable indexing and build the vector index over all existing values in the specified component column.
Results can be retrieved using the search_vector API, which will include
the time-point on the indexed timeline.
Only one index can be created per component column -- executing this a second time for the same component column will replace the existing index.
| PARAMETER | DESCRIPTION |
|---|---|
column
|
The component column to create the index on.
TYPE:
|
time_index
|
Which timeline this index will map to.
TYPE:
|
num_partitions
|
The number of partitions to create for the index. (Deprecated, use target_partition_num_rows instead)
TYPE:
|
target_partition_num_rows
|
The target size (in number of rows) for each partition. Defaults to 4096 if neither this nor num_partitions is specified.
TYPE:
|
num_sub_vectors
|
The number of sub-vectors to use when building the index.
TYPE:
|
distance_metric
|
The distance metric to use for the index. ("L2", "Cosine", "Dot", "Hamming")
TYPE:
|
def dataframe_query_view(*, index, contents, include_semantically_empty_columns=False, include_tombstone_columns=False)
Create a DataframeQueryView of the recording according to a particular index and content specification.
The only type of index currently supported is the name of a timeline, or None (see below
for details).
The view will only contain a single row for each unique value of the index
that is associated with a component column that was included in the view.
Component columns that are not included via the view contents will not
impact the rows that make up the view. If the same entity / component pair
was logged to a given index multiple times, only the most recent row will be
included in the view, as determined by the row_id column. This will
generally be the last value logged, as row_ids are guaranteed to be
monotonically increasing when data is sent from a single process.
If None is passed as the index, the view will contain only static columns (among those
specified) and no index columns. It will also contain a single row per partition.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index to use for the view. This is typically a timeline name. Use
TYPE:
|
contents
|
The content specification for the view. This can be a single string content-expression such as:
TYPE:
|
include_semantically_empty_columns
|
Whether to include columns that are semantically empty, by default Semantically empty columns are components that are
TYPE:
|
include_tombstone_columns
|
Whether to include tombstone columns, by default Tombstone columns are components used to represent clears. However, even without the clear tombstone columns, the view will still apply the clear semantics when resolving row contents.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataframeQueryView
|
The view of the dataset. |
def default_blueprint_partition_id()
The default blueprint partition ID for this dataset, if any.
def delete()
Delete this entry from the catalog.
def delete_indexes(column)
Deletes all user-defined indexes for the specified column.
def do_maintenance(optimize_indexes=False, retrain_indexes=False, compact_fragments=False, cleanup_before=None, unsafe_allow_recent_cleanup=False)
Perform maintenance tasks on the datasets.
def download_partition(partition_id)
Download a partition from the dataset.
def list_indexes()
List all user-defined indexes in this dataset.
def manifest()
Return the dataset manifest as a Datafusion table provider.
def partition_ids()
Returns a list of partitions IDs for the dataset.
def partition_table()
Return the partition table as a Datafusion table provider.
def partition_url(partition_id, timeline=None, start=None, end=None)
Return the URL for the given partition.
| PARAMETER | DESCRIPTION |
|---|---|
partition_id
|
The ID of the partition to get the URL for.
TYPE:
|
timeline
|
The name of the timeline to display.
TYPE:
|
start
|
The start time for the partition. Integer for ticks, or datetime/nanoseconds for timestamps. |
end
|
The end time for the partition. Integer for ticks, or datetime/nanoseconds for timestamps. |
Examples:
With ticks
>>> start_tick, end_time = 0, 10
>>> dataset.partition_url("some_id", "log_tick", start_tick, end_time)
With timestamps
>>> start_time, end_time = datetime.now() - timedelta(seconds=4), datetime.now()
>>> dataset.partition_url("some_id", "real_time", start_time, end_time)
| RETURNS | DESCRIPTION |
|---|---|
str
|
The URL for the given partition. |
def register(recording_uri, *, recording_layer='base', timeout_secs=60)
Register a RRD URI to the dataset and wait for completion.
This method registers a single recording to the dataset and blocks until the registration is
complete, or after a timeout (in which case, a TimeoutError is raised).
| PARAMETER | DESCRIPTION |
|---|---|
recording_uri
|
The URI of the RRD to register.
TYPE:
|
recording_layer
|
The layer to which the recording will be registered to.
TYPE:
|
timeout_secs
|
The timeout after which this method raises a
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
partition_id
|
The partition ID of the registered RRD.
TYPE:
|
def register_batch(recording_uris, *, recording_layers=[])
Register a batch of RRD URIs to the dataset and return a handle to the tasks.
This method initiates the registration of multiple recordings to the dataset, and returns
the corresponding task ids in a [Tasks] object.
| PARAMETER | DESCRIPTION |
|---|---|
recording_uris
|
The URIs of the RRDs to register. |
recording_layers
|
The layers to which the recordings will be registered to:
* When empty, this defaults to |
def register_prefix(recordings_prefix, layer_name=None)
Register all RRDs under a given prefix to the dataset and return a handle to the tasks.
A prefix is a directory-like path in an object store (e.g. an S3 bucket or ABS container). All RRDs that are recursively found under the given prefix will be registered to the dataset.
This method initiates the registration of the recordings to the dataset, and returns
the corresponding task ids in a [Tasks] object.
| PARAMETER | DESCRIPTION |
|---|---|
recordings_prefix
|
The prefix under which to register all RRDs.
TYPE:
|
layer_name
|
The layer to which the recordings will be registered to.
If
TYPE:
|
def schema()
Return the schema of the data contained in the dataset.
def search_fts(query, column)
Search the dataset using a full-text search query.
def search_vector(query, column, top_k)
Search the dataset using a vector search query.
def set_default_blueprint_partition_id(partition_id)
Set the default blueprint partition ID for this dataset.
Pass None to clear the bluprint. This fails if the change cannot be made to the remote server.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
class CatalogClient
Client for a remote Rerun catalog server.
Note: the datafusion package is required to use this client. Initialization will fail with an error if the package
is not installed.
ctx: datafusion.SessionContext
property
Returns a DataFusion session context for querying the catalog.
url: str
property
Returns the catalog URL.
def all_entries()
Returns a list of all entries in the catalog.
def append_to_table(table_name, **named_params)
Convert Python objects into columns of data and append them to a table.
This is a convenience method to quickly turn Python objects into rows
of data. You may pass in any parameter name which will be used for the
column name. If you need more control over the data written to the
server, you can also use [CatalogClient.write_table] to write record
batches to the server.
If you wish to send multiple rows at once, then all parameters should be a list of the same length. This function will query the table to determine the schema and attempt to coerce data types as appropriate.
| PARAMETER | DESCRIPTION |
|---|---|
table_name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
named_params
|
Pairwise combinations of column names and the data to write.
For example if you pass
TYPE:
|
def create_dataset(name)
Creates a new dataset with the given name.
def create_table_entry(name, schema, url)
Create and register a new table.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.
TYPE:
|
schema
|
The schema of the table to create.
TYPE:
|
url
|
The URL of the directory for where to store the Lance table.
TYPE:
|
def dataset_entries()
Returns a list of all dataset entries in the catalog.
def dataset_names()
Returns a list of all dataset names in the catalog.
def datasets()
Returns a DataFrame containing all dataset entries in the catalog.
def do_global_maintenance()
Perform maintenance tasks on the whole system.
def entries()
Returns a DataFrame containing all entries in the catalog.
def entry_names()
Returns a list of all entry names in the catalog.
def get_dataset(*, id=None, name=None)
Returns a dataset by its ID or name.
Note: This is currently an alias for get_dataset_entry. In the future, it will return a data-oriented dataset
object instead.
def get_dataset_entry(*, id=None, name=None)
Returns a dataset by its ID or name.
def get_table(*, id=None, name=None)
Returns a table by its ID or name.
def get_table_entry(*, id=None, name=None)
Returns a table by its ID or name.
def register_table(name, url)
Registers a foreign Lance table (identified by its URL) as a new table entry with the given name.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.
TYPE:
|
url
|
The URL of the Lance table to register.
TYPE:
|
def table_entries()
Returns a list of all dataset entries in the catalog.
def table_names()
Returns a list of all table names in the catalog.
def tables()
Returns a DataFrame containing all table entries in the catalog.
def write_table(name, batches, insert_mode)
Writes record batches into an existing table.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
batches
|
One or more record batches to write into the table. For convenience, you can
pass in a record batch, list of record batches, list of list of batches, or
a [
TYPE:
|
insert_mode
|
Determines how rows should be added to the existing table.
TYPE:
|
class Entry
An entry in the catalog.
catalog: CatalogClientInternal
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
name: str
property
The entry's name.
updated_at: datetime
property
The entry's last updated date and time.
def delete()
Delete this entry from the catalog.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
class EntryId
A unique identifier for an entry in the catalog.
def __init__(id)
Create a new EntryId from a string.
def __str__()
Return str(self).
class EntryKind
The kinds of entries that can be stored in the catalog.
def __int__()
int(self)
def __str__()
Return str(self).
class NotFoundError
Bases: Exception
Raised when the requested resource is not found.
class TableEntry
Bases: Entry
A table entry in the catalog.
Note: this object acts as a table provider for DataFusion.
catalog: CatalogClientInternal
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
name: str
property
The entry's name.
storage_url: str
property
The table's storage URL.
updated_at: datetime
property
The entry's last updated date and time.
def __datafusion_table_provider__()
Returns a DataFusion table provider capsule.
def delete()
Delete this entry from the catalog.
def df()
Registers the table with the DataFusion context and return a DataFrame.
def to_arrow_reader()
Convert this table to a pyarrow.RecordBatchReader.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
class Task
A handle on a remote task.
id: str
property
The task id.
def wait(timeout_secs)
Block until the task is completed or the timeout is reached.
A TimeoutError is raised if the timeout is reached.