Cache Files#
Data pipelines, by design, are optimized for moving data. Done well, it means data is optimized for sequential read access and compression.
Training models, conversely, require data optimized for fast random access.
The Transformation Process#
When you upload a file to the platform, all data is read sequentially once and only once. During this single pass, the data is:
- Transformed into a format optimized for fast random access
- Validated with integrity checks for data quality issues
- Pre-computed with common statistics (roll-ups, standard deviation, min/max/mean) stored for instant access
This transformation happens automatically during upload, or can be manually triggered via the web app, CLI, or Python library.
HDF5 Structure#
Raw data (CSV, compressed CSV) is converted to optimized HDF5 files:
The HDF5 structure provides multiple access patterns:
file.h5
├── timestamp[] # Nanosecond timestamps for precise indexing
├── Price[] # Primary data values
├── bars/
│ ├── index/
│ │ ├── reference # First timestamp (anchor point)
│ │ ├── m[] # Minute bar index
│ │ ├── h[] # Hour bar index
│ │ └── D[] # Day bar index
│ └── m[], h[], D[] # Pre-computed OHLCV bars at each timeframe
└── accelerators/
├── p10[] # Every 10 records: [max, min, time]
├── p100[] # Every 100 records
└── p1000[] # Every 1000 records
Accelerators: Skip-List Architecture#
Many ML features need to answer questions like "when does price first cross threshold X?" In raw data, this requires scanning potentially millions of records—an O(n) operation that would make builds impossibly slow.
The accelerator system solves this with pre-computed min/max values at multiple granularities:
How it works:
- Check p1000 accelerator—if the max in this 1000-record block is below ceiling AND min is above floor, skip the entire block
- Check p100 accelerator—skip 100 records at a time
- Check p10 accelerator—skip 10 records at a time
- Walk record-by-record only in the final small block
Result: O(n) searches become O(log n). Finding price crossings in 1.3 billion records takes milliseconds instead of minutes.
This architecture is why LIT can load 13 years of compressed trade data in under 20 milliseconds—the heavy computation happened once during upload, and every subsequent access benefits from the pre-computed structure.