Cache Files#

Data pipelines, by design, are optimized for moving data. Done well, it means data is optimized for sequential read access and compression.

Training models, conversely, require data optimized for fast random access.

The Transformation Process#

When you upload a file to the platform, all data is read sequentially once and only once. During this single pass, the data is:

Transformed into a format optimized for fast random access
Validated with integrity checks for data quality issues
Pre-computed with common statistics (roll-ups, standard deviation, min/max/mean) stored for instant access

This transformation happens automatically during upload, or can be manually triggered via the web app, CLI, or Python library.

HDF5 Structure#

Raw data (CSV, compressed CSV) is converted to optimized HDF5 files:

/path/to/raw/file.csv.gz → /path/to/raw/.adapter/file.h5

The HDF5 structure provides multiple access patterns:

file.h5
├── timestamp[]           # Nanosecond timestamps for precise indexing
├── Price[]               # Primary data values
├── bars/
│   ├── index/
│   │   ├── reference     # First timestamp (anchor point)
│   │   ├── m[]           # Minute bar index
│   │   ├── h[]           # Hour bar index
│   │   └── D[]           # Day bar index
│   └── m[], h[], D[]     # Pre-computed OHLCV bars at each timeframe
└── accelerators/
    ├── p10[]             # Every 10 records: [max, min, time]
    ├── p100[]            # Every 100 records
    └── p1000[]           # Every 1000 records

Accelerators: Skip-List Architecture#

Many ML features need to answer questions like "when does price first cross threshold X?" In raw data, this requires scanning potentially millions of records—an O(n) operation that would make builds impossibly slow.

The accelerator system solves this with pre-computed min/max values at multiple granularities:

How it works:

Check p1000 accelerator—if the max in this 1000-record block is below ceiling AND min is above floor, skip the entire block
Check p100 accelerator—skip 100 records at a time
Check p10 accelerator—skip 10 records at a time
Walk record-by-record only in the final small block

Result: O(n) searches become O(log n). Finding price crossings in 1.3 billion records takes milliseconds instead of minutes.

This architecture is why LIT can load 13 years of compressed trade data in under 20 milliseconds—the heavy computation happened once during upload, and every subsequent access benefits from the pre-computed structure.