Catalog
rerun.catalog
IndexValuesLike: TypeAlias = npt.NDArray[np.int_] | npt.NDArray[np.datetime64] | pa.Int64Array
module-attribute
A type alias for index values.
This can be any numpy-compatible array of integers, or a pyarrow.Int64Array
VectorDistanceMetricLike: TypeAlias = VectorDistanceMetric | Literal['L2', 'Cosine', 'Dot', 'Hamming']
module-attribute
A type alias for vector distance metrics.
class Schema
The schema representing a set of available columns for a dataset.
A schema contains both index columns (timelines) and component columns (entity/component data).
def __eq__(other)
Check equality with another Schema.
def __init__(inner)
Create a new Schema wrapper.
| PARAMETER | DESCRIPTION |
|---|---|
inner
|
The internal schema object from the bindings.
TYPE:
|
def __iter__()
Iterate over all column descriptors in the schema (index columns first, then component columns).
def __repr__()
Return a string representation of the schema.
def column_for(entity_path, component)
Look up the column descriptor for a specific entity path and component.
| PARAMETER | DESCRIPTION |
|---|---|
entity_path
|
The entity path to look up.
TYPE:
|
component
|
The component to look up. Example:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ComponentColumnDescriptor | None
|
The column descriptor, if it exists. |
def column_for_selector(selector)
Look up the column descriptor for a specific selector.
| PARAMETER | DESCRIPTION |
|---|---|
selector
|
The selector to look up. String arguments are expected to follow the format:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ComponentColumnDescriptor
|
The column descriptor. |
| RAISES | DESCRIPTION |
|---|---|
LookupError
|
If the column is not found. |
ValueError
|
If the string selector format is invalid or the input type is unsupported. |
Note: if the input is already a `ComponentColumnDescriptor`, it is
|
|
returned directly without checking for existence.
|
|
def column_names()
Return a list of all column names in the schema.
| RETURNS | DESCRIPTION |
|---|---|
The names of all columns (index columns first, then component columns).
|
|
def component_columns()
Return a list of all the component columns in the schema.
Component columns contain the data for a specific component of an entity.
def index_columns()
Return a list of all the index columns in the schema.
Index columns contain the index values for when the data was updated. They generally correspond to Rerun timelines.
class ComponentColumnDescriptor
The descriptor of a component column.
Component columns contain the data for a specific component of an entity.
Column descriptors are used to describe the columns in a
Schema. They are read-only. To select a component
column, use ComponentColumnSelector.
archetype: str
property
The archetype name, if any.
This property is read-only.
component: str
property
The component.
This property is read-only.
component_type: str | None
property
The component type, if any.
This property is read-only.
entity_path: str
property
The entity path.
This property is read-only.
is_static: bool
property
Whether the column is static.
This property is read-only.
name: str
property
The name of this column.
This property is read-only.
class ComponentColumnSelector
A selector for a component column.
Component columns contain the data for a specific component of an entity.
component: str
property
The component.
This property is read-only.
entity_path: str
property
The entity path.
This property is read-only.
class IndexColumnDescriptor
The descriptor of an index column.
Index columns contain the index values for when the data was updated. They generally correspond to Rerun timelines.
Column descriptors are used to describe the columns in a
Schema. They are read-only. To select an index
column, use IndexColumnSelector.
is_static: bool
property
Part of generic ColumnDescriptor interface: always False for Index.
name: str
property
The name of the index.
This property is read-only.
class IndexColumnSelector
A selector for an index column.
Index columns contain the index values for when the data was updated. They generally correspond to Rerun timelines.
name: str
property
The name of the index.
This property is read-only.
def __init__(index)
Create a new IndexColumnSelector.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The name of the index to select. Usually the name of a timeline.
TYPE:
|
class AlreadyExistsError
Bases: Exception
Raised when trying to create a resource that already exists.
class CatalogClient
Client for a remote Rerun catalog server.
Note: the datafusion package is required to use this client. Initialization will fail with an error if the package
is not installed.
ctx: datafusion.SessionContext
property
Returns a DataFusion session context for querying the catalog.
url: str
property
Returns the catalog URL.
def all_entries()
deprecated
Deprecated
Use entries() instead
Returns a list of all entries in the catalog.
def append_to_table(table_name, batches=None, **named_params)
deprecated
Deprecated
Use TableEntry.append() instead
Append record batches to an existing table.
| PARAMETER | DESCRIPTION |
|---|---|
table_name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
batches
|
One or more record batches to write into the table.
TYPE:
|
**named_params
|
Named parameters to write to the table as columns.
TYPE:
|
def create_dataset(name)
Creates a new dataset with the given name.
def create_table(name, schema, url)
Create and register a new table.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.
TYPE:
|
schema
|
The schema of the table to create.
TYPE:
|
url
|
The URL of the directory for where to store the Lance table.
TYPE:
|
def create_table_entry(name, schema, url)
deprecated
Deprecated
Use create_table() instead
Create and register a new table.
def dataset_entries()
deprecated
Deprecated
Use datasets() instead
Returns a list of all dataset entries in the catalog.
def dataset_names(*, include_hidden=False)
Returns a list of all dataset names in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include blueprint datasets.
TYPE:
|
def datasets(*, include_hidden=False)
Returns a list of all dataset entries in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include blueprint datasets.
TYPE:
|
def do_global_maintenance()
Perform maintenance tasks on the whole system.
def entries(*, include_hidden=False)
Returns a list of all entries in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include hidden entries (blueprint datasets and system tables like
TYPE:
|
def entry_names(*, include_hidden=False)
Returns a list of all entry names in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include hidden entries (blueprint datasets and system tables like
TYPE:
|
def get_dataset(name=None, *, id=None)
Returns a dataset by its ID or name.
Exactly one of id or name must be provided.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the dataset.
TYPE:
|
id
|
The unique identifier of the dataset. Can be an |
def get_dataset_entry(*, id=None, name=None)
deprecated
Deprecated
Use get_dataset() instead
Returns a dataset by its ID or name.
def get_table(name=None, *, id=None)
def get_table_entry(*, id=None, name=None)
deprecated
Deprecated
Use get_table() instead
Returns a table by its ID or name.
def register_table(name, url)
Registers a foreign Lance table (identified by its URL) as a new table entry with the given name.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to create. It must be unique within all entries in the catalog. An exception will be raised if an entry with the same name already exists.
TYPE:
|
url
|
The URL of the Lance table to register.
TYPE:
|
def table_entries()
deprecated
Deprecated
Use tables() instead
Returns a list of all dataset entries in the catalog.
def table_names(*, include_hidden=False)
Returns a list of all table names in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include system tables (e.g.,
TYPE:
|
def tables(*, include_hidden=False)
Returns a list of all table entries in the catalog.
| PARAMETER | DESCRIPTION |
|---|---|
include_hidden
|
If True, include system tables (e.g.,
TYPE:
|
def update_table(table_name, batches=None, **named_params)
deprecated
Deprecated
Use TableEntry.upsert() instead
Upsert record batches to an existing table.
| PARAMETER | DESCRIPTION |
|---|---|
table_name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
batches
|
One or more record batches to write into the table.
TYPE:
|
**named_params
|
Named parameters to write to the table as columns.
TYPE:
|
def write_table(name, batches, insert_mode)
deprecated
Deprecated
Use TableEntry.append(), overwrite(), or upsert() instead
Writes record batches into an existing table.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The name of the table entry to write to. This table must already exist.
TYPE:
|
batches
|
One or more record batches to write into the table. For convenience, you can
pass in a record batch, list of record batches, list of list of batches, or
a [
TYPE:
|
insert_mode
|
Determines how rows should be added to the existing table.
TYPE:
|
class DatasetEntry
Bases: Entry[DatasetEntryInternal]
A dataset entry in the catalog.
catalog: CatalogClient
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
manifest_url: str
property
Return the dataset manifest URL.
name: str
property
The entry's name.
updated_at: datetime
property
The entry's last updated date and time.
def __eq__(other)
Compare this entry to another object.
Supports comparison with str and EntryId to enable the following patterns:
"entry_name" in client.entries()
entry_id in client.entries()
def arrow_schema()
Return the Arrow schema of the data contained in the dataset.
def blueprint_dataset()
The associated blueprint dataset, if any.
def blueprints()
Lists all blueprints currently registered with this dataset.
def create_fts_index(*, column, time_index, store_position=False, base_tokenizer='simple')
deprecated
Deprecated
use create_fts_search_index
def create_fts_search_index(*, column, time_index, store_position=False, base_tokenizer='simple')
Create a full-text search index on the given column.
def create_vector_index(*, column, time_index, target_partition_num_rows=None, num_sub_vectors=16, distance_metric='Cosine')
deprecated
Deprecated
use create_vector_search_index instead
def create_vector_search_index(*, column, time_index, target_partition_num_rows=None, num_sub_vectors=16, distance_metric='Cosine')
Create a vector index on the given column.
This will enable indexing and build the vector index over all existing values in the specified component column.
Results can be retrieved using the search_vector API, which will include
the time-point on the indexed timeline.
Only one index can be created per component column -- executing this a second time for the same component column will replace the existing index.
| PARAMETER | DESCRIPTION |
|---|---|
column
|
The component column to create the index on.
TYPE:
|
time_index
|
Which timeline this index will map to.
TYPE:
|
target_partition_num_rows
|
The target size (in number of rows) for each partition. The underlying indexer (lance) will pick a default when no value is specified - today this is 8192. It will also cap the maximum number of partitions independently of this setting - currently 4096.
TYPE:
|
num_sub_vectors
|
The number of sub-vectors to use when building the index.
TYPE:
|
distance_metric
|
The distance metric to use for the index. ("L2", "Cosine", "Dot", "Hamming")
TYPE:
|
def default_blueprint()
Return the name currently set blueprint.
def default_blueprint_partition_id()
deprecated
Deprecated
Use default_blueprint_segment_id() instead
The default blueprint partition ID for this dataset, if any.
def delete()
Delete this entry from the catalog.
def delete_indexes(column)
deprecated
Deprecated
use delete_search_indexes instead
def delete_search_indexes(column)
Deletes all user-defined indexes for the specified column.
def do_maintenance(optimize_indexes=False, retrain_indexes=False, compact_fragments=False, cleanup_before=None, unsafe_allow_recent_cleanup=False)
Perform maintenance tasks on the datasets.
def download_partition(partition_id)
deprecated
Deprecated
Use download_segment() instead
Download a partition from the dataset.
def download_segment(segment_id)
Download a segment from the dataset.
def filter_contents(exprs)
Return a new DatasetView filtered to the given entity paths.
Entity path expressions support wildcards:
- "/points/**" matches all entities under /points
- "-/text/**" excludes all entities under /text
| PARAMETER | DESCRIPTION |
|---|---|
exprs
|
Entity path expression or list of entity path expressions. |
| RETURNS | DESCRIPTION |
|---|---|
DatasetView
|
A new view filtered to the matching entity paths. |
Examples:
# Filter to a single entity path
view = dataset.filter_contents("/points/**")
# Filter to specific entity paths
view = dataset.filter_contents(["/points/**"])
# Exclude certain paths
view = dataset.filter_contents(["/points/**", "-/text/**"])
# Chain with segment filters
view = dataset.filter_segments(["recording_0"]).filter_contents("/points/**")
def filter_segments(segment_ids)
Return a new DatasetView filtered to the given segment IDs.
| PARAMETER | DESCRIPTION |
|---|---|
segment_ids
|
A segment ID string, a list of segment ID strings, or a DataFusion DataFrame with a column named 'rerun_segment_id'. When passing a DataFrame, if there are additional columns, they will be ignored. |
| RETURNS | DESCRIPTION |
|---|---|
DatasetView
|
A new view filtered to the given segments. |
Examples:
# Filter to a single segment
view = dataset.filter_segments("recording_0")
# Filter to specific segments
view = dataset.filter_segments(["recording_0", "recording_1"])
# Filter using a DataFrame
good_segments = segment_table.filter(col("success"))
view = dataset.filter_segments(good_segments)
# Read data from the filtered view
df = view.reader(index="timeline")
def list_indexes()
deprecated
Deprecated
use list_search_indexes instead
def list_search_indexes()
List all user-defined indexes in this dataset.
def manifest()
Return the dataset manifest as a DataFusion DataFrame.
def partition_ids()
deprecated
Deprecated
Use segment_ids() instead
Returns a list of partition IDs for the dataset.
def partition_url(partition_id, timeline=None, start=None, end=None)
deprecated
Deprecated
Use segment_url() instead
Return the URL for the given partition.
def reader(index, *, include_semantically_empty_columns=False, include_tombstone_columns=False, fill_latest_at=False, using_index_values=None)
Create a reader over this dataset.
Returns a DataFusion DataFrame.
Server side filters
The returned DataFrame supports server side filtering for both rerun_segment_id
and the index (timeline) column, which can greatly improve performance. For
example, the following filters will effectively be handled by the Rerun server.
dataset.reader(index="real_time").filter(col("rerun_segment_id") == "aabbccddee")
dataset.reader(index="real_time").filter(col("real_time") == "1234567890")
dataset.reader(index="real_time").filter(
(col("rerun_segment_id") == "aabbccddee") & (col("real_time") == "1234567890")
)
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index (timeline) to use for the view.
Pass
TYPE:
|
include_semantically_empty_columns
|
Whether to include columns that are semantically empty.
TYPE:
|
include_tombstone_columns
|
Whether to include tombstone columns.
TYPE:
|
fill_latest_at
|
Whether to fill null values with the latest valid data.
TYPE:
|
using_index_values
|
If provided, specifies the exact index values to sample for all segments.
Can be a numpy array (datetime64[ns] or int64), a pyarrow Array, or a sequence.
Use with
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A DataFusion DataFrame. |
def register(recording_uri, *, layer_name='base')
Register RRD URIs to the dataset and return a handle to track progress.
This method initiates the registration of recordings to the dataset, and returns a handle that can be used to wait for completion or iterate over results.
| PARAMETER | DESCRIPTION |
|---|---|
recording_uri
|
The URI(s) of the RRD(s) to register. Can be a single URI string or a sequence of URIs. |
layer_name
|
The layer(s) to which the recordings will be registered to.
Can be a single layer name (applied to all recordings) or a sequence of layer names
(must match the length of |
| RETURNS | DESCRIPTION |
|---|---|
RegistrationHandle
|
A handle to track and wait on the registration tasks. |
def register_blueprint(uri, set_default=True)
Register an existing .rbl visible to the server.
By default, also set this blueprint as default.
def register_prefix(recordings_prefix, layer_name=None)
Register all RRDs under a given prefix to the dataset and return a handle to track progress.
A prefix is a directory-like path in an object store (e.g. an S3 bucket or ABS container). All RRDs that are recursively found under the given prefix will be registered to the dataset.
This method initiates the registration of the recordings to the dataset, and returns a handle that can be used to wait for completion or iterate over results.
| PARAMETER | DESCRIPTION |
|---|---|
recordings_prefix
|
The prefix under which to register all RRDs.
TYPE:
|
layer_name
|
The layer to which the recordings will be registered to.
If
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
A handle to track and wait on the registration tasks.
|
|
def schema()
Return the schema of the data contained in the dataset.
def search_fts(query, column)
Search the dataset using a full-text search query.
def search_vector(query, column, top_k)
Search the dataset using a vector search query.
def segment_ids()
Returns a list of segment IDs for the dataset.
def segment_table(join_meta=None, join_key='rerun_segment_id')
Return the segment table as a DataFusion DataFrame.
The segment table contains metadata about each segment in the dataset, including segment IDs, layer names, storage URLs, and size information.
| PARAMETER | DESCRIPTION |
|---|---|
join_meta
|
Optional metadata table or DataFrame to join with the segment table.
If a
TYPE:
|
join_key
|
The column name to use for joining, defaults to "rerun_segment_id".
Both the segment table and
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The segment metadata table, optionally joined with |
def segment_url(segment_id, timeline=None, start=None, end=None)
Return the URL for the given segment.
| PARAMETER | DESCRIPTION |
|---|---|
segment_id
|
The ID of the segment to get the URL for.
TYPE:
|
timeline
|
The name of the timeline to display.
TYPE:
|
start
|
The start selected time for the segment. Integer for ticks, or datetime/nanoseconds for timestamps. |
end
|
The end selected time for the segment. Integer for ticks, or datetime/nanoseconds for timestamps. |
Examples:
With ticks
>>> start_tick, end_time = 0, 10
>>> dataset.segment_url("some_id", "log_tick", start_tick, end_time)
With timestamps
>>> start_time, end_time = datetime.now() - timedelta(seconds=4), datetime.now()
>>> dataset.segment_url("some_id", "real_time", start_time, end_time)
| RETURNS | DESCRIPTION |
|---|---|
str
|
The URL for the given segment. |
def set_default_blueprint(blueprint_name)
Set an already-registered blueprint as default for this dataset.
def set_default_blueprint_partition_id(partition_id)
deprecated
Deprecated
Use set_default_blueprint_segment_id() instead
Set the default blueprint partition ID for this dataset.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
class DatasetView
A filtered view over a dataset in the catalog.
A DatasetView provides lazy filtering over a dataset's segments and entity paths.
Filters are composed lazily and only applied when data is actually read.
Create a DatasetView by calling filter_segments() or filter_contents() on a
DatasetEntry.
Examples:
# Filter to specific segments
view = dataset.filter_segments(["recording_0", "recording_1"])
# Filter to specific entity paths
view = dataset.filter_contents(["/points/**"])
# Chain filters
view = dataset.filter_segments(["recording_0"]).filter_contents(["/points/**"])
# Read data
df = view.reader(index="timeline")
def __init__(internal)
Create a new DatasetView wrapper.
| PARAMETER | DESCRIPTION |
|---|---|
internal
|
The internal Rust-side DatasetView object.
TYPE:
|
def __repr__()
Return a string representation of the DatasetView.
def arrow_schema()
Return the filtered Arrow schema for this view.
| RETURNS | DESCRIPTION |
|---|---|
Schema
|
The filtered Arrow schema. |
def download_segment(segment_id)
deprecated
def filter_contents(exprs)
Return a new DatasetView filtered to the given entity paths.
Entity path expressions support wildcards:
- "/points/**" matches all entities under /points
- "-/text/**" excludes all entities under /text
| PARAMETER | DESCRIPTION |
|---|---|
exprs
|
Entity path expression or list of entity path expressions. |
| RETURNS | DESCRIPTION |
|---|---|
DatasetView
|
A new view filtered to the matching entity paths. |
def filter_segments(segment_ids)
Return a new DatasetView filtered to the given segment IDs.
Filters are composed: if this view already has a segment filter, the result is the intersection of both filters.
| PARAMETER | DESCRIPTION |
|---|---|
segment_ids
|
A segment ID string, a list of segment ID strings, or a DataFusion DataFrame with a column named 'rerun_segment_id'. |
| RETURNS | DESCRIPTION |
|---|---|
DatasetView
|
A new view filtered to the given segments. |
def reader(index, *, include_semantically_empty_columns=False, include_tombstone_columns=False, fill_latest_at=False, using_index_values=None)
Create a reader over this DatasetView.
Returns a DataFusion DataFrame.
Server side filters
The returned DataFrame supports server side filtering for both rerun_segment_id
and the index (timeline) column, which can greatly improve performance. For
example, the following filters will effectively be handled by the Rerun server.
dataset.reader(index="real_time").filter(col("rerun_segment_id") == "aabbccddee")
dataset.reader(index="real_time").filter(col("real_time") == "1234567890")
dataset.reader(index="real_time").filter(
(col("rerun_segment_id") == "aabbccddee") & (col("real_time") == "1234567890")
)
| PARAMETER | DESCRIPTION |
|---|---|
index
|
The index (timeline) to use for the view.
Pass
TYPE:
|
include_semantically_empty_columns
|
Whether to include columns that are semantically empty.
TYPE:
|
include_tombstone_columns
|
Whether to include tombstone columns.
TYPE:
|
fill_latest_at
|
Whether to fill null values with the latest valid data.
TYPE:
|
using_index_values
|
If provided, specifies the exact index values to sample for all segments.
Can be a numpy array (datetime64[ns] or int64), a pyarrow Array, or a sequence.
Use with
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
A DataFusion DataFrame.
|
|
def schema()
Return the filtered schema for this view.
The schema reflects any content filters applied to the view.
| RETURNS | DESCRIPTION |
|---|---|
Schema
|
The filtered schema. |
def segment_ids()
def segment_table(join_meta=None, join_key='rerun_segment_id')
Return the segment table as a DataFusion DataFrame.
The segment table contains metadata about each segment in the dataset, including segment IDs, layer names, storage URLs, and size information.
Only segments matching this view's filters are included.
| PARAMETER | DESCRIPTION |
|---|---|
join_meta
|
Optional metadata table or DataFrame to join with the segment table.
If a
TYPE:
|
join_key
|
The column name to use for joining, defaults to "rerun_segment_id".
Both the segment table and
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The segment metadata table, optionally joined with |
class Entry
Bases: ABC, Generic[InternalEntryT]
An entry in the catalog.
catalog: CatalogClient
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
name: str
property
The entry's name.
updated_at: datetime
property
The entry's last updated date and time.
def __eq__(other)
Compare this entry to another object.
Supports comparison with str and EntryId to enable the following patterns:
"entry_name" in client.entries()
entry_id in client.entries()
def delete()
Delete this entry from the catalog.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
class EntryId
A unique identifier for an entry in the catalog.
def __init__(id)
Create a new EntryId from a string.
def __str__()
Return str(self).
class EntryKind
The kinds of entries that can be stored in the catalog.
def __int__()
int(self)
def __str__()
Return str(self).
class NotFoundError
Bases: Exception
Raised when the requested resource is not found.
class RegistrationHandle
Handle to track and wait on segment registration tasks.
def iter_results(timeout_secs=None)
Stream completed registrations as they finish.
Uses the server's streaming API to yield results as tasks complete. Each result is yielded exactly once when its task completes (success or error).
| PARAMETER | DESCRIPTION |
|---|---|
timeout_secs
|
Timeout in seconds. None for blocking. Note that using None doesn't guarantee that a TimeoutError will never be eventually raised for long-running tasks. Setting a timeout and polling is recommended for monitoring very large registration batches.
TYPE:
|
| YIELDS | DESCRIPTION |
|---|---|
SegmentRegistrationResult
|
The result of each completed registration. |
| RAISES | DESCRIPTION |
|---|---|
TimeoutError
|
If the timeout is reached before all tasks complete. |
def wait(timeout_secs=None)
Block until all registrations complete.
| PARAMETER | DESCRIPTION |
|---|---|
timeout_secs
|
Timeout in seconds. None for blocking. Note that using None doesn't guarantee that a TimeoutError will never be eventually raised for long-running tasks. Setting a timeout and polling is recommended for monitoring very large registration batches.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
RegistrationResult
|
The result containing the list of segment IDs in registration order. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If any registration fails. |
TimeoutError
|
If the timeout is reached before all tasks complete. |
class SegmentRegistrationResult
dataclass
Result of a completed segment registration.
error: str | None
instance-attribute
Error message if registration failed, or None if successful.
is_error: bool
property
Returns True if the registration failed.
is_success: bool
property
Returns True if the registration was successful.
segment_id: str | None
instance-attribute
The resulting segment ID. May be None if registration failed.
uri: str
instance-attribute
The source URI that was registered.
class TableEntry
Bases: Entry[TableEntryInternal]
A table entry in the catalog.
Note: this object acts as a table provider for DataFusion.
catalog: CatalogClient
property
The catalog client that this entry belongs to.
created_at: datetime
property
The entry's creation date and time.
id: EntryId
property
The entry's id.
kind: EntryKind
property
The entry's kind.
name: str
property
The entry's name.
storage_url: str
property
The table's storage URL.
updated_at: datetime
property
The entry's last updated date and time.
def __datafusion_table_provider__()
Returns a DataFusion table provider capsule.
def __eq__(other)
Compare this entry to another object.
Supports comparison with str and EntryId to enable the following patterns:
"entry_name" in client.entries()
entry_id in client.entries()
def append(batches=None, **named_params)
Append to the Table.
| PARAMETER | DESCRIPTION |
|---|---|
batches
|
Arrow data to append to the table. Can be a RecordBatchReader, a single RecordBatch, a list of
RecordBatches, or a list of lists of RecordBatches (as returned by
TYPE:
|
**named_params
|
Each named parameter corresponds to a column in the table.
TYPE:
|
def arrow_schema()
Returns the Arrow schema of the table.
def delete()
Delete this entry from the catalog.
def overwrite(batches=None, **named_params)
Overwrite the Table with new data.
| PARAMETER | DESCRIPTION |
|---|---|
batches
|
Arrow data to overwrite the table with. Can be a RecordBatchReader, a single RecordBatch, a list of
RecordBatches, or a list of lists of RecordBatches (as returned by
TYPE:
|
**named_params
|
Each named parameter corresponds to a column in the table.
TYPE:
|
def reader()
Registers the table with the DataFusion context and return a DataFrame.
def to_arrow_reader()
Convert this table to a pyarrow.RecordBatchReader.
def update(*, name=None)
Update this entry's properties.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
New name for the entry
TYPE:
|
def upsert(batches=None, **named_params)
Upsert data into the Table.
To use upsert, the table must contain a column with the metadata:
{"rerun:is_table_index" = "true"}
Any row with a matching index value will have the new data inserted. Any row without a matching index value will be appended as a new row.
| PARAMETER | DESCRIPTION |
|---|---|
batches
|
Arrow data to upsert into the table. Can be a RecordBatchReader, a single RecordBatch, a list of
RecordBatches, or a list of lists of RecordBatches (as returned by
TYPE:
|
**named_params
|
Each named parameter corresponds to a column in the table
TYPE:
|