Loading External Data

This notebook runs through an example of loading raw data (external to Nautilus) into the NautilusTrader ParquetDataCatalog , for use in backtesting.

The DataCatalog

The data catalog is a central store for Nautilus data, persisted in the Parquet file format.

We have chosen parquet as the storage format for the following reasons:

  • It performs much better than CSV/JSON/HDF5/etc in terms of compression ratio (storage size) and read performance

  • It does not require any separately running components (for example a database)

  • It is quick and simple to get up and running with

Getting some sample raw data

Before we start the notebook - as a once off we need to download some sample data for loading.

For this notebook we will use FX data from histdata.com , simply go to https://www.histdata.com/download-free-forex-historical-data/?/ascii/tick-data-quotes/ and select a Forex pair and one or more months of data to download.

Once you have downloaded the data, set the variable input_files below to the path containing the data. You can also use a glob to select multiple files, for example "~/Downloads/HISTDATA_COM_ASCII_AUDUSD_*.zip" .

import fsspec
fs = fsspec.filesystem("file")

input_files = "~/Downloads/HISTDATA_COM_ASCII_AUDUSD_T202001.zip"

Run the cell below; you should see the files that you downloaded:

# Simple check that the file path is correct
assert len(fs.glob(input_files)), f"Could not find files with {input_files=}"

Loading data via Reader classes

We can load data from various sources into the data catalog using helper methods in the nautilus_trader.persistence.external.readers module. The module contains methods for reading various data formats (CSV, JSON, text), minimising the amount of code required to get data loaded correctly into the data catalog.

There are a handful of readers available, some notes on when to use which:

  • CSVReader - use when your data is CSV (comma separated values) and has a header row. Each row of the data typically is one “entry” and is linked to the header.

  • TextReader - similar to CSVReader, however used when data may container multiple ‘entries’ per line. For example, JSON data with multiple order book or trade ticks in a single line. This data typically does not have a header row, and field names come from some external definition.

  • ParquetReader - for parquet files, will read chunks of the data and process similar to CSVReader .

Each of the Reader classes takes a line_parser or block_parser function, a user defined function to convert a line or block (chunk / multiple rows) of data into Nautilus object(s) (for example QuoteTick or TradeTick ).

Writing the parser function

The FX data from histdata is stored in CSV (plain text) format, with fields timestamp, bid_price, ask_price .

For this example, we will use the CSVReader class, where we need to manually pass a header (as the files do not contain one). The CSVReader has a couple of options, we’ll be setting chunked=False to process the data line-by-line, and as_dataframe=False to process the data as a string rather than DataFrame. See the API Reference for more details.

import datetime
import pandas as pd
from nautilus_trader.model.data.tick import QuoteTick
from nautilus_trader.model.objects import Price, Quantity
from nautilus_trader.core.datetime import dt_to_unix_nanos

def parser(data, instrument_id):
    """ Parser function for hist_data FX data, for use with CSV Reader """
    dt = pd.Timestamp(datetime.datetime.strptime(data['timestamp'].decode(), "%Y%m%d %H%M%S%f"), tz='UTC')
    yield QuoteTick(
        instrument_id=instrument_id,
        bid=Price.from_str(data['bid'].decode()),
        ask=Price.from_str(data['ask'].decode()),
        bid_size=Quantity.from_int(100_000),
        ask_size=Quantity.from_int(100_000),
        ts_event=dt_to_unix_nanos(dt),
        ts_init=dt_to_unix_nanos(dt),
    )

Creating a new DataCatalog

If a ParquetDataCatalog does not already exist, we can easily create one. Now that we have our parser function, we instantiate a ParquetDataCatalog (passing in a directory where to store the data, by default we will just use the current directory):

import os, shutil
CATALOG_PATH = os.getcwd() + "/catalog"

# Clear if it already exists, then create fresh
if os.path.exists(CATALOG_PATH):
    shutil.rmtree(CATALOG_PATH)
os.mkdir(CATALOG_PATH)
# Create an instance of the ParquetDataCatalog
from nautilus_trader.persistence.catalog import ParquetDataCatalog
catalog = ParquetDataCatalog(CATALOG_PATH)

Instruments

Nautilus needs to link market data to an instrument ID, and an instrument ID to an Instrument definition. This can be done at any time, although typically it makes sense when you are loading market data into the catalog.

For our example, Nautilus contains some helpers for creating FX pairs, which we will use. If however, you were adding data for financial or crypto markets, you would need to create (and add to the catalog) an instrument corresponding to that instrument ID. Definitions for other instruments (of various asset classes) can be found in nautilus_trader.model.instruments .

See Instruments for more details on creating other instruments.

from nautilus_trader.persistence.external.core import process_files, write_objects
from nautilus_trader.backtest.data.providers import TestInstrumentProvider

# Use nautilus test helpers to create a EUR/USD FX instrument for our purposes
instrument = TestInstrumentProvider.default_fx_ccy("EUR/USD")

We can now add our new instrument to the ParquetDataCatalog :

from nautilus_trader.persistence.external.core import write_objects

write_objects(catalog, [instrument])

And check its existence:

catalog.instruments()

Loading the files

One final note: our parsing function takes an instrument_id argument, as in our case with hist_data, however the actual file does not contain information about the instrument, only the file name does. In our instance, we would likely need to split our loading per FX pair, so we can determine which instrument we are loading. We will use a simple lambda function to pass our instrument ID to the parsing function.

We can now use the process_files function to load one or more files using our Reader class and parsing function as shown below. This function will loop over many files, as well as breaking up large files into chunks (protecting us from out of memory errors when reading large files) and save the results to the ParquetDataCatalog .

For the hist_data, it should take less than a minute or two to load each FX file (a progress bar will appear below):

from nautilus_trader.persistence.external.core import process_files
from nautilus_trader.persistence.external.readers import CSVReader


process_files(
    glob_path=input_files,
    reader=CSVReader(
        block_parser=lambda x: parser(x, instrument_id=instrument.id), 
        header=['timestamp', 'bid', 'ask', 'volume'],
        chunked=False, 
        as_dataframe=False,
    ),
    catalog=catalog,
)

Using the ParquetDataCatalog

Once data has been loaded into the catalog, the catalog instance can be used for loading data into the backtest engine, or simple for research purposes. It contains various methods to pull data from the catalog, such as quote_ticks , for example:

import pandas as pd
from nautilus_trader.core.datetime import dt_to_unix_nanos

start = dt_to_unix_nanos(pd.Timestamp("2020-01-01", tz="UTC"))
end =  dt_to_unix_nanos(pd.Timestamp("2020-01-02", tz="UTC"))

catalog.quote_ticks(start=start, end=end)

Finally, clean up the catalog

if os.path.exists(CATALOG_PATH):
    shutil.rmtree(CATALOG_PATH)