Skip to content

Catalog

rerun.catalog

IndexValuesLike: TypeAlias = npt.NDArray[np.int_] | npt.NDArray[np.datetime64] | pa.Int64Array module-attribute

A type alias for index values.

This can be any numpy-compatible array of integers, or a pyarrow.Int64Array

VectorDistanceMetricLike: TypeAlias = VectorDistanceMetric | Literal['L2', 'Cosine', 'Dot', 'Hamming'] module-attribute

A type alias for vector distance metrics.

class Schema

The schema representing a set of available columns for a dataset.

A schema contains both index columns (timelines) and component columns (entity/component data).

def __eq__(other)

Check equality with another Schema.

def __init__(inner)

Create a new Schema wrapper.

PARAMETER DESCRIPTION
inner

The internal schema object from the bindings.

TYPE: SchemaInternal

def __iter__()

Iterate over all column descriptors in the schema (index columns first, then component columns).

def __repr__()

Return a string representation of the schema.

def column_for(entity_path, component)

Look up the column descriptor for a specific entity path and component.

PARAMETER DESCRIPTION
entity_path

The entity path to look up.

TYPE: str

component

The component to look up. Example: Points3D:positions.

TYPE: str

RETURNS DESCRIPTION
ComponentColumnDescriptor | None

The column descriptor, if it exists.

def column_for_selector(selector)

Look up the column descriptor for a specific selector.

PARAMETER DESCRIPTION
selector

The selector to look up.

String arguments are expected to follow the format: "<entity_path>:<component_type>"

TYPE: str | ComponentColumnSelector | ComponentColumnDescriptor

RETURNS DESCRIPTION
ComponentColumnDescriptor

The column descriptor.

RAISES DESCRIPTION
LookupError

If the column is not found.

ValueError

If the string selector format is invalid or the input type is unsupported.

Note: if the input is already a `ComponentColumnDescriptor`, it is
returned directly without checking for existence.
def column_names()

Return a list of all column names in the schema.

RETURNS DESCRIPTION
The names of all columns (index columns first, then component columns).
def component_columns()

Return a list of all the component columns in the schema.

Component columns contain the data for a specific component of an entity.

def index_columns()

Return a list of all the index columns in the schema.

Index columns contain the index values for when the data was updated. They generally correspond to Rerun timelines.

class ComponentColumnDescriptor

The descriptor of a component column.

Component columns contain the data for a specific component of an entity.

Column descriptors are used to describe the columns in a Schema. They are read-only. To select a component column, use ComponentColumnSelector.

archetype: str property

The archetype name, if any.

This property is read-only.

component: str property

The component.

This property is read-only.

component_type: str | None property

The component type, if any.

This property is read-only.

entity_path: str property

The entity path.

This property is read-only.

is_static: bool property

Whether the column is static.

This property is read-only.

name: str property

The name of this column.

This property is read-only.

class ComponentColumnSelector

A selector for a component column.

Component columns contain the data for a specific component of an entity.

component: str property

The component.

This property is read-only.

entity_path: str property

The entity path.

This property is read-only.

def __init__(entity_path, component)

Create a new ComponentColumnSelector.

PARAMETER DESCRIPTION
entity_path

The entity path to select.

TYPE: str

component

The component to select. Example: Points3D:positions.

TYPE: str

class IndexColumnDescriptor

The descriptor of an index column.

Index columns contain the index values for when the data was updated. They generally correspond to Rerun timelines.

Column descriptors are used to describe the columns in a Schema. They are read-only. To select an index column, use IndexColumnSelector.

is_static: bool property

Part of generic ColumnDescriptor interface: always False for Index.

name: str property

The name of the index.

This property is read-only.

class IndexColumnSelector

A selector for an index column.

Index columns contain the index values for when the data was updated. They generally correspond to Rerun timelines.

name: str property

The name of the index.

This property is read-only.

def __init__(index)

Create a new IndexColumnSelector.

PARAMETER DESCRIPTION
index

The name of the index to select. Usually the name of a timeline.

TYPE: str

class AlreadyExistsError

Bases: Exception

Raised when trying to create a resource that already exists.

class CatalogClient

Client for a remote Rerun catalog server.

Note: the datafusion package is required to use this client. Initialization will fail with an error if the package is not installed.

ctx: datafusion.SessionContext property

Returns a DataFusion session context for querying the catalog.

url: str property

Returns the catalog URL.

def all_entries() deprecated
Deprecated

Use entries() instead

Returns a list of all entries in the catalog.

def append_to_table(table_name, batches=None, **named_params) deprecated
Deprecated

Use TableEntry.append() instead

Append record batches to an existing table.

PARAMETER DESCRIPTION
table_name

The name of the table entry to write to. This table must already exist.

TYPE: str

batches

One or more record batches to write into the table.

TYPE: RecordBatchReader | RecordBatch | Sequence[RecordBatch] | Sequence[Sequence[RecordBatch]] | None DEFAULT: None

**named_params

Named parameters to write to the table as columns.

TYPE: Any DEFAULT: {}

def create_dataset(name)

Creates a new dataset with the given name.

def create_table(name, schema, url)

Create and register a new table.

PARAMETER DESCRIPTION
name

The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.

TYPE: str

schema

The schema of the table to create.

TYPE: Schema

url

The URL of the directory for where to store the Lance table.

TYPE: str

def create_table_entry(name, schema, url) deprecated
Deprecated

Use create_table() instead

Create and register a new table.

def dataset_entries() deprecated
Deprecated

Use datasets() instead

Returns a list of all dataset entries in the catalog.

def dataset_names(*, include_hidden=False)

Returns a list of all dataset names in the catalog.

PARAMETER DESCRIPTION
include_hidden

If True, include blueprint datasets.

TYPE: bool DEFAULT: False

def datasets(*, include_hidden=False)

Returns a list of all dataset entries in the catalog.

PARAMETER DESCRIPTION
include_hidden

If True, include blueprint datasets.

TYPE: bool DEFAULT: False

def do_global_maintenance()

Perform maintenance tasks on the whole system.

def entries(*, include_hidden=False)

Returns a list of all entries in the catalog.

PARAMETER DESCRIPTION
include_hidden

If True, include hidden entries (blueprint datasets and system tables like __entries).

TYPE: bool DEFAULT: False

def entry_names(*, include_hidden=False)

Returns a list of all entry names in the catalog.

PARAMETER DESCRIPTION
include_hidden

If True, include hidden entries (blueprint datasets and system tables like __entries).

TYPE: bool DEFAULT: False

def get_dataset(name=None, *, id=None)

Returns a dataset by its ID or name.

Exactly one of id or name must be provided.

PARAMETER DESCRIPTION
name

The name of the dataset.

TYPE: str | None DEFAULT: None

id

The unique identifier of the dataset. Can be an EntryId object or its string representation.

TYPE: EntryId | str | None DEFAULT: None

def get_dataset_entry(*, id=None, name=None) deprecated
Deprecated

Use get_dataset() instead

Returns a dataset by its ID or name.

def get_table(name=None, *, id=None)

Returns a table by its ID or name.

Exactly one of id or name must be provided.

PARAMETER DESCRIPTION
name

The name of the table.

TYPE: str | None DEFAULT: None

id

The unique identifier of the table. Can be an EntryId object or its string representation.

TYPE: EntryId | str | None DEFAULT: None

def get_table_entry(*, id=None, name=None) deprecated
Deprecated

Use get_table() instead

Returns a table by its ID or name.

def register_table(name, url)

Registers a foreign Lance table (identified by its URL) as a new table entry with the given name.

PARAMETER DESCRIPTION
name

The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.

TYPE: str

url

The URL of the Lance table to register.

TYPE: str

def table_entries() deprecated
Deprecated

Use tables() instead

Returns a list of all dataset entries in the catalog.

def table_names(*, include_hidden=False)

Returns a list of all table names in the catalog.

PARAMETER DESCRIPTION
include_hidden

If True, include system tables (e.g., __entries).

TYPE: bool DEFAULT: False

def tables(*, include_hidden=False)

Returns a list of all table entries in the catalog.

PARAMETER DESCRIPTION
include_hidden

If True, include system tables (e.g., __entries).

TYPE: bool DEFAULT: False

def update_table(table_name, batches=None, **named_params) deprecated
Deprecated

Use TableEntry.upsert() instead

Upsert record batches to an existing table.

PARAMETER DESCRIPTION
table_name

The name of the table entry to write to. This table must already exist.

TYPE: str

batches

One or more record batches to write into the table.

TYPE: RecordBatchReader | RecordBatch | Sequence[RecordBatch] | Sequence[Sequence[RecordBatch]] | None DEFAULT: None

**named_params

Named parameters to write to the table as columns.

TYPE: Any DEFAULT: {}

def write_table(name, batches, insert_mode) deprecated
Deprecated

Use TableEntry.append(), overwrite(), or upsert() instead

Writes record batches into an existing table.

PARAMETER DESCRIPTION
name

The name of the table entry to write to. This table must already exist.

TYPE: str

batches

One or more record batches to write into the table. For convenience, you can pass in a record batch, list of record batches, list of list of batches, or a [pyarrow.RecordBatchReader].

TYPE: RecordBatchReader | RecordBatch | Sequence[RecordBatch] | Sequence[Sequence[RecordBatch]]

insert_mode

Determines how rows should be added to the existing table.

TYPE: TableInsertMode

class DatasetEntry

Bases: Entry[DatasetEntryInternal]

A dataset entry in the catalog.

catalog: CatalogClient property

The catalog client that this entry belongs to.

created_at: datetime property

The entry's creation date and time.

id: EntryId property

The entry's id.

kind: EntryKind property

The entry's kind.

manifest_url: str property

Return the dataset manifest URL.

name: str property

The entry's name.

updated_at: datetime property

The entry's last updated date and time.

def __eq__(other)

Compare this entry to another object.

Supports comparison with str and EntryId to enable the following patterns:

"entry_name" in client.entries()
entry_id in client.entries()

def arrow_schema()

Return the Arrow schema of the data contained in the dataset.

def blueprint_dataset()

The associated blueprint dataset, if any.

def blueprints()

Lists all blueprints currently registered with this dataset.

def create_fts_index(*, column, time_index, store_position=False, base_tokenizer='simple') deprecated
Deprecated

use create_fts_search_index

def create_fts_search_index(*, column, time_index, store_position=False, base_tokenizer='simple')

Create a full-text search index on the given column.

def create_vector_index(*, column, time_index, target_partition_num_rows=None, num_sub_vectors=16, distance_metric='Cosine') deprecated
Deprecated

use create_vector_search_index instead

def create_vector_search_index(*, column, time_index, target_partition_num_rows=None, num_sub_vectors=16, distance_metric='Cosine')

Create a vector index on the given column.

This will enable indexing and build the vector index over all existing values in the specified component column.

Results can be retrieved using the search_vector API, which will include the time-point on the indexed timeline.

Only one index can be created per component column -- executing this a second time for the same component column will replace the existing index.

PARAMETER DESCRIPTION
column

The component column to create the index on.

TYPE: str | ComponentColumnSelector | ComponentColumnDescriptor

time_index

Which timeline this index will map to.

TYPE: IndexColumnSelector

target_partition_num_rows

The target size (in number of rows) for each partition. The underlying indexer (lance) will pick a default when no value is specified - today this is 8192. It will also cap the maximum number of partitions independently of this setting - currently 4096.

TYPE: int | None DEFAULT: None

num_sub_vectors

The number of sub-vectors to use when building the index.

TYPE: int DEFAULT: 16

distance_metric

The distance metric to use for the index. ("L2", "Cosine", "Dot", "Hamming")

TYPE: VectorDistanceMetric | str DEFAULT: 'Cosine'

def default_blueprint()

Return the name currently set blueprint.

def default_blueprint_partition_id() deprecated
Deprecated

Use default_blueprint_segment_id() instead

The default blueprint partition ID for this dataset, if any.

def delete()

Delete this entry from the catalog.

def delete_indexes(column) deprecated
Deprecated

use delete_search_indexes instead

def delete_search_indexes(column)

Deletes all user-defined indexes for the specified column.

def do_maintenance(optimize_indexes=False, retrain_indexes=False, compact_fragments=False, cleanup_before=None, unsafe_allow_recent_cleanup=False)

Perform maintenance tasks on the datasets.

def download_partition(partition_id) deprecated
Deprecated

Use download_segment() instead

Download a partition from the dataset.

def download_segment(segment_id)

Download a segment from the dataset.

def filter_contents(exprs)

Return a new DatasetView filtered to the given entity paths.

Entity path expressions support wildcards: - "/points/**" matches all entities under /points - "-/text/**" excludes all entities under /text

PARAMETER DESCRIPTION
exprs

Entity path expression or list of entity path expressions.

TYPE: str | Sequence[str]

RETURNS DESCRIPTION
DatasetView

A new view filtered to the matching entity paths.

Examples:

# Filter to a single entity path
view = dataset.filter_contents("/points/**")

# Filter to specific entity paths
view = dataset.filter_contents(["/points/**"])

# Exclude certain paths
view = dataset.filter_contents(["/points/**", "-/text/**"])

# Chain with segment filters
view = dataset.filter_segments(["recording_0"]).filter_contents("/points/**")
def filter_segments(segment_ids)

Return a new DatasetView filtered to the given segment IDs.

PARAMETER DESCRIPTION
segment_ids

A segment ID string, a list of segment ID strings, or a DataFusion DataFrame with a column named 'rerun_segment_id'. When passing a DataFrame, if there are additional columns, they will be ignored.

TYPE: str | Sequence[str] | DataFrame

RETURNS DESCRIPTION
DatasetView

A new view filtered to the given segments.

Examples:

# Filter to a single segment
view = dataset.filter_segments("recording_0")

# Filter to specific segments
view = dataset.filter_segments(["recording_0", "recording_1"])

# Filter using a DataFrame
good_segments = segment_table.filter(col("success"))
view = dataset.filter_segments(good_segments)

# Read data from the filtered view
df = view.reader(index="timeline")
def list_indexes() deprecated
Deprecated

use list_search_indexes instead

def list_search_indexes()

List all user-defined indexes in this dataset.

def manifest()

Return the dataset manifest as a DataFusion DataFrame.

def partition_ids() deprecated
Deprecated

Use segment_ids() instead

Returns a list of partition IDs for the dataset.

def partition_url(partition_id, timeline=None, start=None, end=None) deprecated
Deprecated

Use segment_url() instead

Return the URL for the given partition.

def reader(index, *, include_semantically_empty_columns=False, include_tombstone_columns=False, fill_latest_at=False, using_index_values=None)

Create a reader over this dataset.

Returns a DataFusion DataFrame.

Server side filters

The returned DataFrame supports server side filtering for both rerun_segment_id and the index (timeline) column, which can greatly improve performance. For example, the following filters will effectively be handled by the Rerun server.

dataset.reader(index="real_time").filter(col("rerun_segment_id") == "aabbccddee")
dataset.reader(index="real_time").filter(col("real_time") == "1234567890")
dataset.reader(index="real_time").filter(
    (col("rerun_segment_id") == "aabbccddee") & (col("real_time") == "1234567890")
)
PARAMETER DESCRIPTION
index

The index (timeline) to use for the view. Pass None to read only static data.

TYPE: str | None

include_semantically_empty_columns

Whether to include columns that are semantically empty.

TYPE: bool DEFAULT: False

include_tombstone_columns

Whether to include tombstone columns.

TYPE: bool DEFAULT: False

fill_latest_at

Whether to fill null values with the latest valid data.

TYPE: bool DEFAULT: False

using_index_values

If provided, specifies the exact index values to sample for all segments. Can be a numpy array (datetime64[ns] or int64), a pyarrow Array, or a sequence. Use with fill_latest_at=True to populate rows with the most recent data.

TYPE: IndexValuesLike | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame

A DataFusion DataFrame.

def register(recording_uri, *, layer_name='base')

Register RRD URIs to the dataset and return a handle to track progress.

This method initiates the registration of recordings to the dataset, and returns a handle that can be used to wait for completion or iterate over results.

PARAMETER DESCRIPTION
recording_uri

The URI(s) of the RRD(s) to register. Can be a single URI string or a sequence of URIs.

TYPE: str | Sequence[str]

layer_name

The layer(s) to which the recordings will be registered to. Can be a single layer name (applied to all recordings) or a sequence of layer names (must match the length of recording_uri). Defaults to "base".

TYPE: str | Sequence[str] DEFAULT: 'base'

RETURNS DESCRIPTION
RegistrationHandle

A handle to track and wait on the registration tasks.

def register_blueprint(uri, set_default=True)

Register an existing .rbl visible to the server.

By default, also set this blueprint as default.

def register_prefix(recordings_prefix, layer_name=None)

Register all RRDs under a given prefix to the dataset and return a handle to track progress.

A prefix is a directory-like path in an object store (e.g. an S3 bucket or ABS container). All RRDs that are recursively found under the given prefix will be registered to the dataset.

This method initiates the registration of the recordings to the dataset, and returns a handle that can be used to wait for completion or iterate over results.

PARAMETER DESCRIPTION
recordings_prefix

The prefix under which to register all RRDs.

TYPE: str

layer_name

The layer to which the recordings will be registered to. If None, this defaults to "base".

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
A handle to track and wait on the registration tasks.
def schema()

Return the schema of the data contained in the dataset.

def search_fts(query, column)

Search the dataset using a full-text search query.

def search_vector(query, column, top_k)

Search the dataset using a vector search query.

def segment_ids()

Returns a list of segment IDs for the dataset.

def segment_table(join_meta=None, join_key='rerun_segment_id')

Return the segment table as a DataFusion DataFrame.

The segment table contains metadata about each segment in the dataset, including segment IDs, layer names, storage URLs, and size information.

PARAMETER DESCRIPTION
join_meta

Optional metadata table or DataFrame to join with the segment table. If a TableEntry is provided, it will be converted to a DataFrame using reader().

TYPE: TableEntry | DataFrame | None DEFAULT: None

join_key

The column name to use for joining, defaults to "rerun_segment_id". Both the segment table and join_meta must contain this column.

TYPE: str DEFAULT: 'rerun_segment_id'

RETURNS DESCRIPTION
DataFrame

The segment metadata table, optionally joined with join_meta.

def segment_url(segment_id, timeline=None, start=None, end=None)

Return the URL for the given segment.

PARAMETER DESCRIPTION
segment_id

The ID of the segment to get the URL for.

TYPE: str

timeline

The name of the timeline to display.

TYPE: str | None DEFAULT: None

start

The start selected time for the segment. Integer for ticks, or datetime/nanoseconds for timestamps.

TYPE: datetime | int | None DEFAULT: None

end

The end selected time for the segment. Integer for ticks, or datetime/nanoseconds for timestamps.

TYPE: datetime | int | None DEFAULT: None

Examples:

With ticks
>>> start_tick, end_time = 0, 10
>>> dataset.segment_url("some_id", "log_tick", start_tick, end_time)
With timestamps
>>> start_time, end_time = datetime.now() - timedelta(seconds=4), datetime.now()
>>> dataset.segment_url("some_id", "real_time", start_time, end_time)
RETURNS DESCRIPTION
str

The URL for the given segment.

def set_default_blueprint(blueprint_name)

Set an already-registered blueprint as default for this dataset.

def set_default_blueprint_partition_id(partition_id) deprecated
Deprecated

Use set_default_blueprint_segment_id() instead

Set the default blueprint partition ID for this dataset.

def update(*, name=None)

Update this entry's properties.

PARAMETER DESCRIPTION
name

New name for the entry

TYPE: str | None DEFAULT: None

class DatasetView

A filtered view over a dataset in the catalog.

A DatasetView provides lazy filtering over a dataset's segments and entity paths. Filters are composed lazily and only applied when data is actually read.

Create a DatasetView by calling filter_segments() or filter_contents() on a DatasetEntry.

Examples:

# Filter to specific segments
view = dataset.filter_segments(["recording_0", "recording_1"])

# Filter to specific entity paths
view = dataset.filter_contents(["/points/**"])

# Chain filters
view = dataset.filter_segments(["recording_0"]).filter_contents(["/points/**"])

# Read data
df = view.reader(index="timeline")
def __init__(internal)

Create a new DatasetView wrapper.

PARAMETER DESCRIPTION
internal

The internal Rust-side DatasetView object.

TYPE: DatasetViewInternal

def __repr__()

Return a string representation of the DatasetView.

def arrow_schema()

Return the filtered Arrow schema for this view.

RETURNS DESCRIPTION
Schema

The filtered Arrow schema.

def download_segment(segment_id) deprecated
Deprecated

This method is deprecated and will be removed in a future release

Download a specific segment from the dataset.

PARAMETER DESCRIPTION
segment_id

The ID of the segment to download.

TYPE: str

RETURNS DESCRIPTION
Recording

The downloaded recording.

def filter_contents(exprs)

Return a new DatasetView filtered to the given entity paths.

Entity path expressions support wildcards: - "/points/**" matches all entities under /points - "-/text/**" excludes all entities under /text

PARAMETER DESCRIPTION
exprs

Entity path expression or list of entity path expressions.

TYPE: str | Sequence[str]

RETURNS DESCRIPTION
DatasetView

A new view filtered to the matching entity paths.

def filter_segments(segment_ids)

Return a new DatasetView filtered to the given segment IDs.

Filters are composed: if this view already has a segment filter, the result is the intersection of both filters.

PARAMETER DESCRIPTION
segment_ids

A segment ID string, a list of segment ID strings, or a DataFusion DataFrame with a column named 'rerun_segment_id'.

TYPE: str | Sequence[str] | DataFrame

RETURNS DESCRIPTION
DatasetView

A new view filtered to the given segments.

def reader(index, *, include_semantically_empty_columns=False, include_tombstone_columns=False, fill_latest_at=False, using_index_values=None)

Create a reader over this DatasetView.

Returns a DataFusion DataFrame.

Server side filters

The returned DataFrame supports server side filtering for both rerun_segment_id and the index (timeline) column, which can greatly improve performance. For example, the following filters will effectively be handled by the Rerun server.

dataset.reader(index="real_time").filter(col("rerun_segment_id") == "aabbccddee")
dataset.reader(index="real_time").filter(col("real_time") == "1234567890")
dataset.reader(index="real_time").filter(
    (col("rerun_segment_id") == "aabbccddee") & (col("real_time") == "1234567890")
)
PARAMETER DESCRIPTION
index

The index (timeline) to use for the view. Pass None to read only static data.

TYPE: str | None

include_semantically_empty_columns

Whether to include columns that are semantically empty.

TYPE: bool DEFAULT: False

include_tombstone_columns

Whether to include tombstone columns.

TYPE: bool DEFAULT: False

fill_latest_at

Whether to fill null values with the latest valid data.

TYPE: bool DEFAULT: False

using_index_values

If provided, specifies the exact index values to sample for all segments. Can be a numpy array (datetime64[ns] or int64), a pyarrow Array, or a sequence. Use with fill_latest_at=True to populate rows with the most recent data.

TYPE: IndexValuesLike | None DEFAULT: None

RETURNS DESCRIPTION
A DataFusion DataFrame.
def schema()

Return the filtered schema for this view.

The schema reflects any content filters applied to the view.

RETURNS DESCRIPTION
Schema

The filtered schema.

def segment_ids()

Return the segment IDs for this view.

If segment filters have been applied, only matching segments are returned.

RETURNS DESCRIPTION
list[str]

The list of segment IDs.

def segment_table(join_meta=None, join_key='rerun_segment_id')

Return the segment table as a DataFusion DataFrame.

The segment table contains metadata about each segment in the dataset, including segment IDs, layer names, storage URLs, and size information.

Only segments matching this view's filters are included.

PARAMETER DESCRIPTION
join_meta

Optional metadata table or DataFrame to join with the segment table. If a TableEntry is provided, it will be converted to a DataFrame using reader().

TYPE: TableEntry | DataFrame | None DEFAULT: None

join_key

The column name to use for joining, defaults to "rerun_segment_id". Both the segment table and join_meta must contain this column.

TYPE: str DEFAULT: 'rerun_segment_id'

RETURNS DESCRIPTION
DataFrame

The segment metadata table, optionally joined with join_meta.

class Entry

Bases: ABC, Generic[InternalEntryT]

An entry in the catalog.

catalog: CatalogClient property

The catalog client that this entry belongs to.

created_at: datetime property

The entry's creation date and time.

id: EntryId property

The entry's id.

kind: EntryKind property

The entry's kind.

name: str property

The entry's name.

updated_at: datetime property

The entry's last updated date and time.

def __eq__(other)

Compare this entry to another object.

Supports comparison with str and EntryId to enable the following patterns:

"entry_name" in client.entries()
entry_id in client.entries()

def delete()

Delete this entry from the catalog.

def update(*, name=None)

Update this entry's properties.

PARAMETER DESCRIPTION
name

New name for the entry

TYPE: str | None DEFAULT: None

class EntryId

A unique identifier for an entry in the catalog.

def __init__(id)

Create a new EntryId from a string.

def __str__()

Return str(self).

class EntryKind

The kinds of entries that can be stored in the catalog.

def __int__()

int(self)

def __str__()

Return str(self).

class NotFoundError

Bases: Exception

Raised when the requested resource is not found.

class RegistrationHandle

Handle to track and wait on segment registration tasks.

def iter_results(timeout_secs=None)

Stream completed registrations as they finish.

Uses the server's streaming API to yield results as tasks complete. Each result is yielded exactly once when its task completes (success or error).

PARAMETER DESCRIPTION
timeout_secs

Timeout in seconds. None for blocking. Note that using None doesn't guarantee that a TimeoutError will never be eventually raised for long-running tasks. Setting a timeout and polling is recommended for monitoring very large registration batches.

TYPE: int | None DEFAULT: None

YIELDS DESCRIPTION
SegmentRegistrationResult

The result of each completed registration.

RAISES DESCRIPTION
TimeoutError

If the timeout is reached before all tasks complete.

def wait(timeout_secs=None)

Block until all registrations complete.

PARAMETER DESCRIPTION
timeout_secs

Timeout in seconds. None for blocking. Note that using None doesn't guarantee that a TimeoutError will never be eventually raised for long-running tasks. Setting a timeout and polling is recommended for monitoring very large registration batches.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
RegistrationResult

The result containing the list of segment IDs in registration order.

RAISES DESCRIPTION
ValueError

If any registration fails.

TimeoutError

If the timeout is reached before all tasks complete.

class SegmentRegistrationResult dataclass

Result of a completed segment registration.

error: str | None instance-attribute

Error message if registration failed, or None if successful.

is_error: bool property

Returns True if the registration failed.

is_success: bool property

Returns True if the registration was successful.

segment_id: str | None instance-attribute

The resulting segment ID. May be None if registration failed.

uri: str instance-attribute

The source URI that was registered.

class TableEntry

Bases: Entry[TableEntryInternal]

A table entry in the catalog.

Note: this object acts as a table provider for DataFusion.

catalog: CatalogClient property

The catalog client that this entry belongs to.

created_at: datetime property

The entry's creation date and time.

id: EntryId property

The entry's id.

kind: EntryKind property

The entry's kind.

name: str property

The entry's name.

storage_url: str property

The table's storage URL.

updated_at: datetime property

The entry's last updated date and time.

def __datafusion_table_provider__()

Returns a DataFusion table provider capsule.

def __eq__(other)

Compare this entry to another object.

Supports comparison with str and EntryId to enable the following patterns:

"entry_name" in client.entries()
entry_id in client.entries()

def append(batches=None, **named_params)

Append to the Table.

PARAMETER DESCRIPTION
batches

Arrow data to append to the table. Can be a RecordBatchReader, a single RecordBatch, a list of RecordBatches, or a list of lists of RecordBatches (as returned by datafusion.DataFrame.collect()).

TYPE: _BatchesType | None DEFAULT: None

**named_params

Each named parameter corresponds to a column in the table.

TYPE: Any DEFAULT: {}

def arrow_schema()

Returns the Arrow schema of the table.

def delete()

Delete this entry from the catalog.

def overwrite(batches=None, **named_params)

Overwrite the Table with new data.

PARAMETER DESCRIPTION
batches

Arrow data to overwrite the table with. Can be a RecordBatchReader, a single RecordBatch, a list of RecordBatches, or a list of lists of RecordBatches (as returned by datafusion.DataFrame.collect()).

TYPE: _BatchesType | None DEFAULT: None

**named_params

Each named parameter corresponds to a column in the table.

TYPE: Any DEFAULT: {}

def reader()

Registers the table with the DataFusion context and return a DataFrame.

def to_arrow_reader()

Convert this table to a pyarrow.RecordBatchReader.

def update(*, name=None)

Update this entry's properties.

PARAMETER DESCRIPTION
name

New name for the entry

TYPE: str | None DEFAULT: None

def upsert(batches=None, **named_params)

Upsert data into the Table.

To use upsert, the table must contain a column with the metadata:

    {"rerun:is_table_index" = "true"}

Any row with a matching index value will have the new data inserted. Any row without a matching index value will be appended as a new row.

PARAMETER DESCRIPTION
batches

Arrow data to upsert into the table. Can be a RecordBatchReader, a single RecordBatch, a list of RecordBatches, or a list of lists of RecordBatches (as returned by datafusion.DataFrame.collect()).

TYPE: _BatchesType | None DEFAULT: None

**named_params

Each named parameter corresponds to a column in the table

TYPE: Any DEFAULT: {}

class VectorDistanceMetric

Bases: Enum

Which distance metric for use for vector index.