This notebook runs through an example of loading raw data (external to Nautilus) into the NautilusTrader  ParquetDataCatalog  , for use in backtesting.

## The DataCatalog ¶

The data catalog is a central store for Nautilus data, persisted in the Parquet file format.

We have chosen parquet as the storage format for the following reasons:

• It performs much better than CSV/JSON/HDF5/etc in terms of compression ratio (storage size) and read performance

• It does not require any separately running components (for example a database)

• It is quick and simple to get up and running with

### Getting some sample raw data ¶

For this notebook we will use FX data from  histdata.com  , simply go to https://www.histdata.com/download-free-forex-historical-data/?/ascii/tick-data-quotes/ and select a Forex pair and one or more months of data to download.

Once you have downloaded the data, set the variable  input_files  below to the path containing the data. You can also use a glob to select multiple files, for example  "~/Downloads/HISTDATA_COM_ASCII_AUDUSD_*.zip"  .

import fsspec
fs = fsspec.filesystem("file")



Run the cell below; you should see the files that you downloaded:

# Simple check that the file path is correct
assert len(fs.glob(input_files)), f"Could not find files with {input_files=}"


We can load data from various sources into the data catalog using helper methods in the  nautilus_trader.persistence.external.readers  module. The module contains methods for reading various data formats (CSV, JSON, text), minimising the amount of code required to get data loaded correctly into the data catalog.

There are a handful of readers available, some notes on when to use which:

•  CSVReader  - use when your data is CSV (comma separated values) and has a header row. Each row of the data typically is one “entry” and is linked to the header.

•  TextReader  - similar to CSVReader, however used when data may container multiple ‘entries’ per line. For example, JSON data with multiple order book or trade ticks in a single line. This data typically does not have a header row, and field names come from some external definition.

•  ParquetReader  - for parquet files, will read chunks of the data and process similar to  CSVReader  .

Each of the  Reader  classes takes a  line_parser  or  block_parser  function, a user defined function to convert a line or block (chunk / multiple rows) of data into Nautilus object(s) (for example  QuoteTick  or  TradeTick  ).

### Writing the parser function ¶

The FX data from  histdata  is stored in CSV (plain text) format, with fields  timestamp, bid_price, ask_price  .

For this example, we will use the  CSVReader  class, where we need to manually pass a header (as the files do not contain one). The  CSVReader  has a couple of options, we’ll be setting  chunked=False  to process the data line-by-line, and  as_dataframe=False  to process the data as a string rather than DataFrame. See the API Reference for more details.

import datetime
import pandas as pd

def parser(data, instrument_id):
""" Parser function for hist_data FX data, for use with CSV Reader """
dt = pd.Timestamp(datetime.datetime.strptime(data['timestamp'].decode(), "%Y%m%d %H%M%S%f"), tz='UTC')
yield QuoteTick(
instrument_id=instrument_id,
bid=Price.from_str(data['bid'].decode()),
bid_size=Quantity.from_int(100_000),
ts_event=dt_to_unix_nanos(dt),
ts_init=dt_to_unix_nanos(dt),
)


### Creating a new DataCatalog ¶

If a  ParquetDataCatalog  does not already exist, we can easily create one. Now that we have our parser function, we instantiate a  ParquetDataCatalog  (passing in a directory where to store the data, by default we will just use the current directory):

import os, shutil
CATALOG_PATH = os.getcwd() + "/catalog"

# Clear if it already exists, then create fresh
if os.path.exists(CATALOG_PATH):
shutil.rmtree(CATALOG_PATH)
os.mkdir(CATALOG_PATH)

# Create an instance of the ParquetDataCatalog
catalog = ParquetDataCatalog(CATALOG_PATH)


### Instruments ¶

Nautilus needs to link market data to an instrument ID, and an instrument ID to an  Instrument  definition. This can be done at any time, although typically it makes sense when you are loading market data into the catalog.

For our example, Nautilus contains some helpers for creating FX pairs, which we will use. If however, you were adding data for financial or crypto markets, you would need to create (and add to the catalog) an instrument corresponding to that instrument ID. Definitions for other instruments (of various asset classes) can be found in  nautilus_trader.model.instruments  .

See Instruments for more details on creating other instruments.

from nautilus_trader.persistence.external.core import process_files, write_objects

# Use nautilus test helpers to create a EUR/USD FX instrument for our purposes
instrument = TestInstrumentProvider.default_fx_ccy("EUR/USD")


We can now add our new instrument to the  ParquetDataCatalog  :

from nautilus_trader.persistence.external.core import write_objects

write_objects(catalog, [instrument])


And check its existence:

catalog.instruments()


One final note: our parsing function takes an  instrument_id  argument, as in our case with hist_data, however the actual file does not contain information about the instrument, only the file name does. In our instance, we would likely need to split our loading per FX pair, so we can determine which instrument we are loading. We will use a simple lambda function to pass our instrument ID to the parsing function.

We can now use the  process_files  function to load one or more files using our  Reader  class and  parsing  function as shown below. This function will loop over many files, as well as breaking up large files into chunks (protecting us from out of memory errors when reading large files) and save the results to the  ParquetDataCatalog  .

For the hist_data, it should take less than a minute or two to load each FX file (a progress bar will appear below):

from nautilus_trader.persistence.external.core import process_files

process_files(
glob_path=input_files,
block_parser=lambda x: parser(x, instrument_id=instrument.id),
chunked=False,
as_dataframe=False,
),
catalog=catalog,
)


## Using the ParquetDataCatalog ¶

Once data has been loaded into the catalog, the  catalog  instance can be used for loading data into the backtest engine, or simple for research purposes. It contains various methods to pull data from the catalog, such as  quote_ticks  , for example:

import pandas as pd

if os.path.exists(CATALOG_PATH):