Persistence

The persistence subpackage handles data storage and retrieval, mainly to support backtesting.

class BaseDataCatalog ( * args , ** kw )

Bases: ABC

Provides a abstract base class for a queryable data catalog.

class FeatherFile ( path , class_name )

Bases: NamedTuple

path : str

Alias for field number 0

class_name : str

Alias for field number 1

count ( value , / )

Return number of occurrences of value.

index ( value , start = 0 , stop = 9223372036854775807 , / )

Return first index of value.

Raises ValueError if the value is not present.

class ParquetDataCatalog ( * args , ** kw )

Bases: BaseDataCatalog

Provides a queryable data catalog persisted to files in Parquet (Arrow) format.

Parameters :
  • path ( PathLike [ str ] | str ) – The root path for this data catalog. Must exist and must be an absolute path.

  • fs_protocol ( str , default 'file' ) – The filesystem protocol used by fsspec to handle file operations. This determines how the data catalog interacts with storage, be it local filesystem, cloud storage, or others. Common protocols include ‘file’ for local storage, ‘s3’ for Amazon S3, and ‘gcs’ for Google Cloud Storage. If not provided, it defaults to ‘file’, meaning the catalog operates on the local filesystem.

  • fs_storage_options ( dict , optional ) – The fs storage options.

  • min_rows_per_group ( int , default 0 ) – The minimum number of rows per group. When the value is greater than 0, the dataset writer will batch incoming data and only write the row groups to the disk when sufficient rows have accumulated.

  • max_rows_per_group ( int , default 5000 ) – The maximum number of rows per group. If the value is greater than 0, then the dataset writer may split up large incoming batches into multiple row groups. If this value is set, then min_rows_per_group should also be set. Otherwise it could end up with very small row groups.

  • show_query_paths ( bool , default False ) – If globed query paths should be printed to stdout.

Warning

The data catalog is not threadsafe. Using it in a multithreaded environment can lead to unexpected behavior.

Notes

For more details about fsspec and its filesystem protocols, see https://filesystem-spec.readthedocs.io/en/latest/ .

classmethod from_env ( ) ParquetDataCatalog

Create a data catalog instance by accessing the ‘NAUTILUS_PATH’ environment variable.

Returns :

ParquetDataCatalog

Raises :

OSError – If the ‘NAUTILUS_PATH’ environment variable is not set.

classmethod from_uri ( uri : str ) ParquetDataCatalog

Create a data catalog instance from the given uri .

Parameters :

uri ( str ) – The URI string for the backing path.

Returns :

ParquetDataCatalog

write_data ( data : list [ nautilus_trader.core.data.Data | nautilus_trader.core.message.Event ] | list [ Union [ nautilus_trader.core.nautilus_pyo3.model.OrderBookDelta , nautilus_trader.core.nautilus_pyo3.model.OrderBookDepth10 , nautilus_trader.core.nautilus_pyo3.model.QuoteTick , nautilus_trader.core.nautilus_pyo3.model.TradeTick , nautilus_trader.core.nautilus_pyo3.model.Bar ] ] , basename_template : str = 'part-{i}' , ** kwargs : Any ) None

Write the given data to the catalog.

The function categorizes the data based on their class name and, when applicable, their associated instrument ID. It then delegates the actual writing process to the write_chunk method.

Parameters :
  • data ( list [ Data | Event ] ) – The data or event objects to be written to the catalog.

  • basename_template ( str , default 'part-{i}' ) – A template string used to generate basenames of written data files. The token ‘{i}’ will be replaced with an automatically incremented integer as files are partitioned. If not specified, it defaults to ‘part-{i}’ + the default extension ‘.parquet’.

  • kwargs ( Any ) – Additional keyword arguments to be passed to the write_chunk method.

Warning

Any existing data which already exists under a filename will be overwritten. If a basename_template is not provided, then its very likely existing data for the data type and instrument ID will be overwritten. To prevent data loss, ensure that the basename_template (or the default naming scheme) generates unique filenames for different data sets.

Notes

  • All data of the same type is expected to be monotonically increasing, or non-decreasing

  • The data is sorted and grouped based on its class name and instrument ID (if applicable) before writing

  • Instrument-specific data should have either an instrument_id attribute or be an instance of Instrument

  • The Bar class is treated as a special case, being grouped based on its bar_type attribute

  • The input data list must be non-empty, and all data items must be of the appropriate class type

Raises :

ValueError – If data of the same type is not monotonically increasing (or non-decreasing) based on ts_init .

class BarDataWrangler ( BarType bar_type , Instrument instrument )

Bases: object

Provides a means of building lists of Nautilus Bar objects.

Parameters :
  • bar_type ( BarType ) – The bar type for the wrangler.

  • instrument ( Instrument ) – The instrument for the wrangler.

process ( self , data: pd.DataFrame , double default_volume: float = 1000000.0 , int ts_init_delta: int = 0 )

Process the given bar dataset into Nautilus Bar objects.

Expects columns [‘open’, ‘high’, ‘low’, ‘close’, ‘volume’] with ‘timestamp’ index. Note: The ‘volume’ column is optional, if one does not exist then will use the default_volume .

Parameters :
  • data ( pd.DataFrame ) – The data to process.

  • default_volume ( float ) – The default volume for each bar (if not provided).

  • ts_init_delta ( int ) – The difference in nanoseconds between the data timestamps and the ts_init value. Can be used to represent/simulate latency between the data source and the Nautilus system.

Returns :

list[Bar]

Raises :

ValueError – If data is empty.

class OrderBookDeltaDataWrangler ( Instrument instrument )

Bases: object

Provides a means of building lists of Nautilus OrderBookDelta objects.

Parameters :

instrument ( Instrument ) – The instrument for the data wrangler.

process ( self , data: pd.DataFrame , int ts_init_delta: int = 0 , bool is_raw=False )

Process the given order book dataset into Nautilus OrderBookDelta objects.

Parameters :
  • data ( pd.DataFrame ) – The data to process.

  • ts_init_delta ( int ) – The difference in nanoseconds between the data timestamps and the ts_init value. Can be used to represent/simulate latency between the data source and the Nautilus system.

  • is_raw ( bool , default False ) – If the data is scaled to the Nautilus fixed precision.

Raises :

ValueError – If data is empty.

class QuoteTickDataWrangler ( Instrument instrument )

Bases: object

Provides a means of building lists of Nautilus QuoteTick objects.

Parameters :

instrument ( Instrument ) – The instrument for the data wrangler.

process ( self , data: pd.DataFrame , double default_volume: float = 1000000.0 , int ts_init_delta: int = 0 )

Process the given tick dataset into Nautilus QuoteTick objects.

Expects columns [‘bid_price’, ‘ask_price’] with ‘timestamp’ index. Note: The ‘bid_size’ and ‘ask_size’ columns are optional, will then use the default_volume .

Parameters :
  • data ( pd.DataFrame ) – The tick data to process.

  • default_volume ( float ) – The default volume for each tick (if not provided).

  • ts_init_delta ( int ) – The difference in nanoseconds between the data timestamps and the ts_init value. Can be used to represent/simulate latency between the data source and the Nautilus system. Cannot be negative.

Returns :

list[QuoteTick]

process_bar_data ( self , bid_data: pd.DataFrame , ask_data: pd.DataFrame , double default_volume: float = 1000000.0 , int ts_init_delta: int = 0 , int offset_interval_ms: int = 100 , bool timestamp_is_close: bool = True , random_seed: int | None = None , bool is_raw: bool = False , bool sort_data: bool = True )

Process the given bar datasets into Nautilus QuoteTick objects.

Expects columns [‘open’, ‘high’, ‘low’, ‘close’, ‘volume’] with ‘timestamp’ index. Note: The ‘volume’ column is optional, will then use the default_volume .

Parameters :
  • bid_data ( pd.DataFrame ) – The bid bar data.

  • ask_data ( pd.DataFrame ) – The ask bar data.

  • default_volume ( float ) – The volume per tick if not available from the data.

  • ts_init_delta ( int ) – The difference in nanoseconds between the data timestamps and the ts_init value. Can be used to represent/simulate latency between the data source and the Nautilus system.

  • offset_interval_ms ( int , default 100 ) – The number of milliseconds to offset each tick for the bar timestamps. If timestamp_is_close then will use negative offsets, otherwise will use positive offsets (see also timestamp_is_close ).

  • random_seed ( int , optional ) – The random seed for shuffling order of high and low ticks from bar data. If random_seed is None then won’t shuffle.

  • is_raw ( bool , default False ) – If the data is scaled to the Nautilus fixed precision.

  • timestamp_is_close ( bool , default True ) – If bar timestamps are at the close. If True then open, high, low timestamps are offset before the close timestamp. If False then high, low, close timestamps are offset after the open timestamp.

  • sort_data ( bool , default True ) – If the data should be sorted by timestamp.

class TradeTickDataWrangler ( Instrument instrument )

Bases: object

Provides a means of building lists of Nautilus TradeTick objects.

Parameters :

instrument ( Instrument ) – The instrument for the data wrangler.

process ( self , data: pd.DataFrame , int ts_init_delta: int = 0 , bool is_raw=False )

Process the given trade tick dataset into Nautilus TradeTick objects.

Parameters :
  • data ( pd.DataFrame ) – The data to process.

  • ts_init_delta ( int ) – The difference in nanoseconds between the data timestamps and the ts_init value. Can be used to represent/simulate latency between the data source and the Nautilus system.

  • is_raw ( bool , default False ) – If the data is scaled to the Nautilus fixed precision.

Raises :

ValueError – If data is empty.

process_bar_data ( self , data: pd.DataFrame , int ts_init_delta: int = 0 , int offset_interval_ms: int = 100 , bool timestamp_is_close: bool = True , random_seed: int | None = None , bool is_raw: bool = False , bool sort_data: bool = True )

Process the given bar datasets into Nautilus QuoteTick objects.

Expects columns [‘open’, ‘high’, ‘low’, ‘close’, ‘volume’] with ‘timestamp’ index. Note: The ‘volume’ column is optional, will then use the default_volume .

Parameters :
  • data ( pd.DataFrame ) – The trade bar data.

  • ts_init_delta ( int ) – The difference in nanoseconds between the data timestamps and the ts_init value. Can be used to represent/simulate latency between the data source and the Nautilus system.

  • offset_interval_ms ( int , default 100 ) – The number of milliseconds to offset each tick for the bar timestamps. If timestamp_is_close then will use negative offsets, otherwise will use positive offsets (see also timestamp_is_close ).

  • random_seed ( int , optional ) – The random seed for shuffling order of high and low ticks from bar data. If random_seed is None then won’t shuffle.

  • is_raw ( bool , default False ) – If the data is scaled to the Nautilus fixed precision.

  • timestamp_is_close ( bool , default True ) – If bar timestamps are at the close. If True then open, high, low timestamps are offset before the close timestamp. If False then high, low, close timestamps are offset after the open timestamp.

  • sort_data ( bool , default True ) – If the data should be sorted by timestamp.

align_bid_ask_bar_data ( bid_data : pd.DataFrame , ask_data : pd.DataFrame )

Merge bid and ask data into a single DataFrame with prefixed column names.

Args:
bid_data pd.DataFrame

The DataFrame containing bid data.

ask_data pd.DataFrame

The DataFrame containing ask data.

Returns:

pd.DataFrame: A merged DataFrame with columns prefixed by ‘ bid_ ’ for bid data and ‘ ask_ ’ for ask data, joined on their indexes.

calculate_bar_price_offsets ( num_records , timestamp_is_close: bool , int offset_interval_ms: int , random_seed=None )

Calculate and potentially randomize the time offsets for bar prices based on the closeness of the timestamp.

Parameters :
  • num_records ( int ) – The number of records for which offsets are to be generated.

  • timestamp_is_close ( bool ) – A flag indicating whether the timestamp is close to the trading time.

  • offset_interval_ms ( int ) – The offset interval in milliseconds to be applied.

  • random_seed ( Optional [ int ] ) – The seed for random number generation to ensure reproducibility.

Returns :

dict ( A dictionary with arrays of offsets for open, high, low, and close prices. If random_seed is provided, ) – high and low offsets are randomized.

calculate_volume_quarter ( volume: np.ndarray , int precision: int )

Convert raw volume data to quarter precision.

Args:
volume np.ndarray

An array of volume data to be processed.

precision int

The decimal precision to which the volume data is rounded, adjusted by subtracting 9.

Returns:

np.ndarray: The volume data adjusted to quarter precision.

prepare_event_and_init_timestamps ( index: pd.DatetimeIndex , int ts_init_delta: int )
preprocess_bar_data ( data : pd.DataFrame , is_raw : bool )

Preprocess financial bar data to a standardized format.

Ensures the DataFrame index is labeled as “timestamp”, converts the index to UTC, removes time zone awareness, drops rows with NaN values in critical columns, and optionally scales the data.

Parameters :
  • data ( pd.DataFrame ) – The input DataFrame containing financial bar data.

  • is_raw ( bool ) – A flag to determine whether the data should be scaled. If False, scales the data by 1e9.

Returns :

pd.DataFrame ( The preprocessed DataFrame with a cleaned and standardized structure. )

class StreamingFeatherWriter ( path : str , fs_protocol : str | None = 'file' , flush_interval_ms : int | None = None , replace : bool = False , include_types : list [ type ] | None = None )

Bases: object

Provides a stream writer of Nautilus objects into feather files.

Parameters :
  • path ( str ) – The path to persist the stream to.

  • fs_protocol ( str , default 'file' ) – The fsspec file system protocol.

  • flush_interval_ms ( int , optional ) – The flush interval (milliseconds) for writing chunks.

  • replace ( bool , default False ) – If existing files at the given path should be replaced.

  • include_types ( list [ type ] , optional ) – A list of Arrow serializable types to write. If this is specified then only the included types will be written.

property is_closed : bool

Return whether all file streams are closed.

Returns :

bool

write ( obj : object ) None

Write the object to the stream.

Parameters :

obj ( object ) – The object to write.

Raises :

ValueError – If obj is None .

check_flush ( ) None

Flush all stream writers if current time greater than the next flush interval.

flush ( ) None

Flush all stream writers.

close ( ) None

Flush and close all stream writers.

generate_signal_class ( name : str , value_type : type ) type

Dynamically create a Data subclass for this signal.

Parameters :
  • name ( str ) – The name of the signal data.

  • value_type ( type ) – The type for the signal data value.

Returns :

SignalData