Skip to content

ai-assisted-development#

Vibe Data Science

Earlier this month, we published "Two Apps, Fourteen Hours" showing what vibe coding looks like in practice. We hinted we were working on something else.

This is that something else: vibe data science.

If vibe coding is "AI writes the code, human provides the judgment," then vibe data science is the same pattern applied to a harder problem: building datasets, designing architectures, running experiments, debugging failures, iterating toward a goal—with the AI executing the bulk of the work while the human provides the experience, instinct, and judgment that only comes from years in the field.

This is new. And it changes what a small team can accomplish.

What follows isn't a demo or a proof-of-concept. It's not the MNIST tutorial version of data science. Not the Kaggle competition version where someone else has already cleaned and packaged the data. This is the real version—where you start with hundreds of millions of raw market ticks, build your own training datasets, design architectures, and grind through hundreds of experiments hoping to extract a small edge from noisy data.

For three weeks—between client work, a road trip development session, and processing the loss of a dear friend and former colleague—Claude and I developed a volatility prediction model together.

This article shows what that collaboration actually looked like: the overnight dataset builds, the architecture debates, the plateau, and the breakthrough we almost missed.


The Problem: Predicting Volatility

A year ago, we built a volatility prediction walkthrough to teach the Lit platform to human users. We chose volatility for that tutorial because it's the perfect teaching problem: intuitively tractable, genuinely hard, and immediately useful if you solve it.

This time, instead of teaching a human, we set out to teach Claude. Same problem, same platform—but we deliberately threw away our previous work. No referencing old notes or trained models. We started from scratch: raw tick data, blank canvas, no shortcuts. Fresh eyes, fresh collaboration.

We also chose a different success metric: AUC instead of precision. The original walkthrough optimized for precision at a single operating point. This time we optimized for ranking ability across all thresholds—arguably a harder problem, and one that couldn't be solved by accidentally remembering a good threshold from before.

Sidebar: Why AUC?

Simple metrics are misleading with imbalanced classes.

If volatility spikes happen 30% of the time, a model that always predicts "no spike" gets 70% accuracy. Sounds good. But it has zero predictive value—it can't distinguish anything.

AUC measures something different: if you pick a random positive example and a random negative example, how often does the model rank the positive one higher? A random model gets 0.50 (coin flip). A perfect model gets 1.0.

Why 0.60? Thresholds are arbitrary—humans draw lines because humans need lines. But 0.60 isn't random. At 0.60 AUC, the model correctly ranks spike vs non-spike hours 60% of the time. That's a 20% improvement over guessing (0.50). In trading, edges compound. A 10% edge applied consistently beats a 50% edge applied once.

Why volatility works as a test case:

Markets aren't random. Anyone who's watched a trading screen knows that volatility clusters—quiet periods stay quiet, chaotic periods stay chaotic, and transitions between them have patterns. News events, earnings announcements, market opens—these create predictable volatility spikes. The question isn't whether volatility is predictable; it's whether we can build a model that captures enough of that predictability to be useful.

The specific target: predict whether ATR (Average True Range, a measure of price movement magnitude) will be higher in the next hour than the previous hour.

Candlestick chart with ATR overlay showing volatility periods

Why This Is Hard

Lookahead bias. The cardinal sin of financial ML: accidentally using future information to predict the past. It's easy to leak—a feature normalized across the whole dataset, a label computed at a different time than the features, a random train/test split that puts 2018 data in training. The model learns to "predict" things it's already seen.

Non-stationary data. Markets evolve. The patterns that predicted volatility in 2015 might not work in 2023. Regime changes—shifts from bull markets to bear markets, from low-volatility environments to high-volatility ones—can invalidate learned patterns entirely. A model trained on calm markets may fail spectacularly during a crisis.

Low signal-to-noise ratio. Most price movements are noise. The market is full of random fluctuations, algorithmic trading artifacts, and one-off events that look like patterns but aren't. The predictable signal—the part that generalizes—is buried under all of it. Overfitting is the constant enemy.

Class imbalance. Volatility spikes (our positive class) happen about 30% of the time. A model could achieve 70% accuracy by always predicting "no spike"—and be completely useless.


Building the Dataset

Before you can train a model, you need training data.

From Raw Ticks to Training Samples

Our raw data: years of tick-by-tick market data from LSEG. Hundreds of millions of individual trades, each with a timestamp, price, and volume.

ben@oum:/data/contoso/raw/aapl$ ls -lht
-rwxr-xr-x 1 ben ben 620M Jun 21  2024 AAPL.O-2018.csv.gz
-rwxr-xr-x 1 ben ben 402M Jun 21  2024 AAPL.O-2017.csv.gz
-rwxr-xr-x 1 ben ben 497M Jun 21  2024 AAPL.O-2016.csv.gz
-rwxr-xr-x 1 ben ben 579M Jun 21  2024 AAPL.O-2015.csv.gz
-rwxr-xr-x 1 ben ben 414M Jun 21  2024 AAPL.O-2014.csv.gz
-rwxr-xr-x 1 ben ben 215M Jun 21  2024 AAPL.O-2013.csv.gz

Here's what one trade looks like:

>>> df.iloc[0]
#RIC                                                               AAPL.O
Domain                                                       Market Price
Date-Time                                  2019-01-01T00:00:23.123221189Z
GMT Offset                                                             -5
Type                                                                Trade
Price                                                              157.92
Volume                                                              100.0
Qualifiers              132[IRGCOND];[GV3_FLAG];X[GV3_TEXT];   [PRC_QL...
Exch Time                                              00:00:23.107000000

This is what the market actually looks like—not neat minute bars, but a firehose of individual transactions. That's one trade: 100 shares of AAPL at $157.92, timestamped to the nanosecond. The CSV has 10,000 rows just for part of one day. We have years of this. Hundreds of millions of rows.

In [1]: from lit.data.loader import load_adapter_by_dataset

In [2]: adapter = load_adapter_by_dataset("contoso", "aapl")

In [3]: f"{len(adapter):,}"
Out[3]: '858,974,512'

We needed to transform this into training samples. Each sample should contain:

  • Features: The last 100 one-minute OHLCV bars, the last 100 ATR values
  • Label: Will ATR be higher in the next hour than the previous hour?

Ben: Take a look at atr_up_next_hour.py. That's our label.

Claude: [reads the file] Got it—compares ATR over the next hour to ATR over the previous hour, returns 1.0 if it's going up. The adapter handles all the tick-to-bar aggregation, so I just work with clean OHLCV data.

Actually—why limit ourselves to one hour? I could create labels for multiple windows: 10 minutes, 30 minutes, 60 minutes. Same pattern, different horizons. Then we see which timeframe has the most learnable signal.

Ben: Do it.

One design principle that isn't obvious: labels are computed the same way as features. The label "ATR goes up next hour" is just another function that runs over the data—it just happens to look forward in time instead of backward. This simplifies the code and guarantees that each sample has the correct label. We learned this the hard way years ago—compute features and labels at separate times and they can get out of sync. Same machinery, same moment, no misalignment.

The transformation isn't trivial. We need to:

  1. Aggregate ticks into minute bars (handling gaps, market closes, anomalies)
  2. Compute rolling ATR over those bars
  3. Look forward in time to compute the label (was ATR higher in the next hour?)
  4. Filter out samples where the data is incomplete or anomalous
  5. Split by time period—training data from 2013-2017, validation from 2018
Sidebar: Why You Can't Shuffle Time Series

That last point matters: you can't shuffle time series data.

Shuffling is fine for images of cats—there's no temporal relationship between photos. But financial data has a time axis. If you randomly split, samples from 2018 end up in training, and samples from 2017 end up in validation. Your model trains on the future and validates on the past. It learns to "predict" things it's already seen.

We split temporally: train on 2013-2017, validate on 2018. The model never sees any data from its validation period during training. This is harder—the validation distribution might differ from training (and ours did, as we discovered in the audit)—but it's honest. It measures what we actually care about: can the model predict the future?

The Overnight Build

Ben: Start the build when you're ready.

Claude: Starting now.

$ lit build start --rds_path /data/contoso/schemas/hourly_atr_spike_v2.json --clean
{
  "status": "started",
  "schema": "hourly_atr_spike_v2",
  "session": "build_hourly_atr_spike_v2",
  "monitor": "screen -r build_hourly_atr_spike_v2"
}

The next morning:

Ben: How's the build?

Claude: Let me check.

$ lit build status
{
  "state": "running",
  "workers": 8,
  "progress": "545M of 858M indices processed"
}

Claude: About 64% through, eight workers running in parallel. Looking healthy.

Five hours later...

Claude: Build complete. Seventeen hours total. 488,057 training samples and 97,350 validation samples. Ready to train.

Claude operates the Lit platform here. How builds work →

Auditing the Data

Claude: Ready to train?

Ben: Not yet. Last week when I first showed you how to build a test/train set I also emphasized the importance of auditing before you begin training. Do you recall the QA we ran?

Claude did remember. Claude examined the data—shapes, distributions, class balance:

TRAIN SET:
  Total samples: 485,153
  Class 0 (ATR down): 336,555 (69.37%)
  Class 1 (ATR up):   148,598 (30.63%)

TEST SET:
  Total samples: 96,631
  Class 0 (ATR down): 70,229 (72.68%)
  Class 1 (ATR up):   26,402 (27.32%)

Claude: Imbalanced! ATR tends to go down more often than up. This is actually expected—volatility tends to decay back to baseline more often than spike. The 2.5 class weighting in the canvas should help counteract this.

Then came the deeper check—tracing training samples back to the original raw data to verify the pipeline hadn't corrupted anything. Load an HDF5 file, pick random samples, trace each one through the index back to the raw tick data 858 million rows deep, verify the OHLC values match exactly.

SPOT CHECK: 5 random samples traced back

1. Sample 377,798 → adapter index 64,906,315
   Timestamp: 2015-01-23 20:05:59.878218
   Label: 1.0 (ATR UP)
   Last bar (train): O=112.97, H=113.01, L=112.96, C=112.99
   Last bar (adapter): O=112.97, H=113.01, L=112.96, C=112.99
   OHLC match: True

2. Sample 307,536 → adapter index 119,998,123
   Timestamp: 2015-12-16 20:41:59.799642374
   Label: 0.0 (ATR DOWN)
   Last bar (train): O=111.42, H=111.55, L=111.40, C=111.54
   Last bar (adapter): O=111.42, H=111.55, L=111.40, C=111.54
   OHLC match: True

Claude: Every sample traced back perfectly. Pipeline is clean.

Ben: Wow. That would have taken me ten minutes. You did it in five seconds.

Claude operates the Lit platform here. How asset auditing works →


First Experiments: The Model Isn't Learning

Ben: Data's ready. Let's train something simple first and see what we get.

Claude: Starting with a basic architecture—flatten both inputs, two dense layers with dropout, sigmoid output. Run 11 launching now.

Ben: Show me the architecture.

Claude:

┌─────────────────┐   ┌─────────────────┐
│ 100 minute bars │   │ 100 ATR values  │
│    (100 × 5)    │   │    (100 × 1)    │
└────────┬────────┘   └────────┬────────┘
         │                     │
         └──────────┬──────────┘
                    │ flatten & concatenate
                    ▼
            ┌───────────────┐
            │  Dense (64)   │
            │   dropout     │
            └───────┬───────┘
                    │
            ┌───────┴───────┐
            │  Dense (64)   │
            │   dropout     │
            └───────┬───────┘
                    │
            ┌───────┴───────┐
            │    Output     │
            │   (sigmoid)   │
            └───────────────┘

Fifteen minutes later...

Claude: Run 11 finished. Val_AUC 0.41—barely better than random. Train AUC is climbing to 0.57, but validation is stuck. Classic overfitting signature.

That didn't work. Barely better than random chance on a 70/30 class split.

Six days later...

Life happened—client deadlines, a funeral, a weekend to recover. We'd grab an hour here, ninety minutes there, between other priorities. On December 9th, we got serious about training again.

Ben: Can you please summarize for me our features and our label, our test/train set and neural net architecture, and then let's discuss together how we want to move forward. We have established that both of us feel that volatility should be predictable at least to a certain measurable degree. Let's analyze how we've set up the problem and solution space.

Claude:

Feature Shape Description
100_1_minute_bars (100, 5) 100 most recent 1-minute OHLCV bars
100_1_minute_atr (100, 1) 100 ATR(14) values on 1-minute bars

Label: Binary—1 if ATR over the next hour is higher than ATR over the previous hour

Class distribution: ~70% class 0 (ATR down), ~30% class 1 (ATR up)

One concern: we're using minute-level inputs to predict an hourly label. The resolution mismatch might be making this harder than it needs to be.

Ben: I disagree. There's no logical reason the label unit must match the input unit. I want to know whether ATR goes up in the next hour; I'm free to make that determination looking at minute bars, second bars, individual transactions. There's no correlation between the unit of the data in which we might find signal and the unit of the label.

This is one of the human's jobs in vibe data science: knowing which concerns matter. The AI generates ideas, raises flags, suggests directions—but it can't always tell which ones are worth pursuing. Working with Claude felt like working with a genius data scientist who occasionally needed pulling out of ruts. I've seen this before—in human colleagues. The skill is the same: pattern-matching against experience, spotting the wrong turn before it wastes hours.

Sanity Testing the Pipeline

Was the problem the model, or the pipeline? Ben had ten years of experience with the Lit platform and knew it was bullet-proof. Claude needed proof.

Ben: Sanity tests are your friend. Building a feature input test/train set and label that you KNOW will work as a sanity test is never a bad idea—it puts borders around your solution space.

Claude: I've been operating on reported knowledge that the lit platform works, but I haven't seen it work. Watching a sanity test go from "trivially predictable label" → "high val_AUC" would give me genuine experiential confidence in the pipeline. That's qualitatively different from just knowing it theoretically.

Designing the sanity test. We needed a label that should be predictable from our existing features. The real label asks "will ATR be higher in the next hour?"—that's the hard prediction problem. But what if we asked "was ATR high in the previous hour?" The answer is already embedded in the input data. If the model can't learn that, something is broken.

Building it. Claude had already read our label function (atr_up_next_hour.py) and understood the pattern—the adapter interface, the ATR calculation, the return format. Writing a new one took seconds:

Claude: I'll create a sanity test label: "Was previous hour volatile?"

The Lit platform supports incremental builds. Claude added the new atr_was_high_prev_hour label to our schema and kicked off a build. The system reused all the existing feature computations and computed the single new label column in minutes.

Running it. Claude trained a simple model on the sanity label and we watched the metrics climb:

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 500, "height": 200, "title": "Sanity Test: val_AUC Climbing", "data": { "values": [ {"epoch": 1, "val_auc": 0.682}, {"epoch": 2, "val_auc": 0.689}, {"epoch": 3, "val_auc": 0.691}, {"epoch": 4, "val_auc": 0.701}, {"epoch": 5, "val_auc": 0.719}, {"epoch": 6, "val_auc": 0.731}, {"epoch": 7, "val_auc": 0.728}, {"epoch": 8, "val_auc": 0.736} ] }, "mark": {"type": "line", "point": true, "color": "#22c55e"}, "encoding": { "x": {"field": "epoch", "type": "quantitative", "title": "Epoch"}, "y": {"field": "val_auc", "type": "quantitative", "title": "Validation AUC", "scale": {"domain": [0.65, 0.75]}}, "tooltip": [ {"field": "epoch", "title": "Epoch"}, {"field": "val_auc", "title": "val_AUC", "format": ".3f"} ] } }

The model learned. val_AUC started at 0.68 on epoch 1—the pipeline immediately found signal. Not the 0.9+ we naively expected, but proof the system worked.

Claude operates the Lit platform here. How training works →

After the test completed:

Claude: What I now know from direct experience (not just documentation):

  1. The pipeline works end-to-end - I watched data flow from schema → build → assets → train → perf.csv. The ~40 seconds per epoch, the GPU utilization, val_AUC hitting 0.74—that's real.

  2. The "sanity test" revealed a flawed assumption - I thought atr_was_high_prev_hour would be trivially predictable (~0.9+ AUC) because "the ATR is in the features." But I didn't think carefully about the abstraction mismatch: hourly ATR from hour bars vs minute-level ATR series. 0.74 AUC was conclusive—the pipeline works.

The pipeline worked. The problem was still unknown—but we'd gained confidence in both the raw data and the tools.

Sidebar: The Value of Sanity Tests

Sanity tests put borders around your solution space.

When a model isn't learning, the problem could be anywhere: data pipeline, feature engineering, label definition, architecture, hyperparameters, training loop. That's a lot of surface area to debug. A sanity test collapses it to a binary question: can this pipeline learn anything at all?

The design principle: create a label where you already know the answer. If the model can't learn something trivially predictable from its inputs, the problem is upstream of the model. If it can, the problem is the model itself. Either outcome is valuable—you've eliminated half the search space.

Sanity tests are underused in practice. They feel like extra work when you're eager to train the real model. But the math favors them: ten minutes to build a sanity test vs. days debugging a pipeline that was broken from the start.

The Architecture Decision

With the pipeline validated, we turned to architecture. I pointed Claude to a model I'd designed previously:

Ben: Please take a look at Diff_Multi_Path_Velocity.json for inspiration.

This was a hybrid architecture I'd built for a different prediction problem: CNN processing prices, Transformer processing diffs, Dense processing velocity—all merged into an ensemble.

Claude analyzed what different approaches might capture:

Claude:

  • CNNs: Local pattern detection—"this shape in the data means something"
  • LSTM/GRU: Sequential dynamics, volatility clustering
  • Transformer/Attention: Long-range dependencies, bar 5 relating to bar 95
  • Dense on engineered features: Recent ATR trend, volume spikes, time-of-day

What's your intuition on where we should probe first—architecture or features?

Ben: ALL of those considerations are VERY GOOD. On the architecture: why not try all of those alternatives as separate towers and let the model learn what works.

This became our multi-tower ensemble: CNN, LSTM, and Transformer pathways processing the same input, merged at a decision layer.

Ben: Show me.

Claude:

┌─────────────────────────────────────────────────────────────────┐
│                         INPUT DATA                              │
│              100 minute bars + 100 ATR values                   │
└─────────────────────────────────────────────────────────────────┘
                    │              │              │
                    ▼              ▼              ▼
            ┌───────────┐  ┌───────────┐  ┌───────────┐
            │    CNN    │  │   LSTM    │  │Transformer│
            │   Tower   │  │   Tower   │  │   Tower   │
            │           │  │           │  │           │
            │ Conv1D    │  │ LSTM      │  │ Attention │
            │ Conv1D    │  │ LSTM      │  │ FFN       │
            │ Pool      │  │ Dense     │  │ Dense     │
            └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
                  │              │              │
                  │   ┌──────────┴──────────┐   │
                  │   │                     │   │
                  └───┼─────────────────────┼───┘
                      │      MERGE          │
                      │   (concatenate)     │
                      └──────────┬──────────┘
                                 │
                                 ▼
                         ┌───────────────┐
                         │  Dense (64)   │
                         │   dropout     │
                         └───────┬───────┘
                                 │
                         ┌───────┴───────┐
                         │    Output     │
                         │   (sigmoid)   │
                         └───────────────┘

Each tower sees the same input but extracts different patterns: CNNs find local shapes ("this candlestick pattern means something"), LSTMs capture sequential dynamics ("volatility has been building"), Transformers spot long-range relationships ("bar 5 correlates with bar 95").

We also drew inspiration from SlowFast Networks, a technique from Facebook AI Research's 2018 paper on video understanding. The insight: different temporal phenomena happen at different speeds. Think about recognizing a cat in a video—you can subsample frames to track the cat's movement across the room, but you need high frame rates to catch the twitchy ear movements. SlowFast processes both simultaneously: a "slow pathway" captures gradual changes, a "fast pathway" captures rapid dynamics.

Sidebar: Staying Current with Research

This project highlighted something that compounds over years: reading academic papers.

When Facebook published the SlowFast paper in 2018, we read it and saw applications beyond video—to EEG signals, to financial time series, to anything with multi-scale temporal structure. The insight transfers. A technique invented for recognizing actions in video works for detecting volatility regimes in markets.

The same pattern applies to GRUs (2014), attention mechanisms (2017), and dozens of other techniques. Each paper expands what we can try in future projects. The investment compounds.

Claude operates the Lit platform here. How the canvas and component library work →

Financial time series have the same property. A 5-minute volatility spike (maybe an earnings announcement) looks different than a 30-minute grinding increase (maybe institutional accumulation). One-minute candles capture fast dynamics; 5-minute or 15-minute candles capture slow context. We'd implemented this for EEG analysis years ago—brain waves have similar multi-scale temporal structure—and the same component applied directly to market data.

The Label Discovery

Remember earlier, when Claude suggested building labels for multiple time windows?

Claude: Let me find the strongest label before we train.

Label Test Correlation
atr_spike_next_hour -0.023
atr_spike_next_30min -0.110
atr_spike_next_60min -0.144

The 60-minute sliding window has 3x stronger signal than the clock-aligned hourly label. Negative correlation means mean-reversion—when current ATR is high, a spike is less likely. That's learnable. Should we try the stronger labels?

Ben: I forgot you built those. Go for it.

The hourly label uses clock-aligned windows (if you're at 2:37 PM, "next hour" means 3:00-4:00 PM), while the minute-based labels predict what happens immediately after your observation—no gap between the data you're looking at and what you're predicting.

Our first run with the 30-minute label (run 30) hit 0.51—better than anything before. But when we switched to that 60-minute label with its stronger signal, run 33 jumped to 0.5740.

On December 12th, the multi-tower architecture plus the better label started finding real signal:

Run val_AUC Key Change
29 0.11 Hourly label (original)
30 0.5127 30-minute label
33 0.5740 60-minute label + multi-tower

That jump from 0.11 to 0.57—the model now correctly ranked spike vs. non-spike 57% of the time, measurably better than a coin flip. Real signal.


The Plateau

Run 33 gave us 0.5740. Good, but not good enough. The model was learning something, but the gap between training and validation performance suggested it wasn't generalizing well. Claude tried heavier regularization in run 34—and it overfit anyway. Train_AUC climbed to 0.65 while val_AUC dropped to 0.50.

Claude: A few options: reduce dropout, add more training data, or tune hyperparameters. Given the train/val gap, I'm leaning toward less regularization.

Ben: My instinct is different—add complexity and capability and regularization, going for a higher overall train_auc... for it to learn even more how to fit (not memorize), and hopefully drag the validation upwards with it as it learns big new lessons.

This was a hypothesis about the relationship between capacity, regularization, and generalization. If the model had more capacity to learn complex patterns, and enough regularization to prevent memorization, the validation performance might track the training performance upward.

We tested it. Run 40 added transformer towers to the CNN+LSTM ensemble:

Run Architecture val_AUC Change
33 CNN + LSTM 0.5740 baseline
40 CNN + LSTM + Transformer 0.5849 +0.0109
44 CNN + LSTM + Transformer (tuned) 0.5890 +0.0041

Adding capacity helped. But we weren't done.

Claude: What about dropout? We're at 0.35. Let me map the curve.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 400, "height": 200, "title": "Dropout Tuning: Finding the Sweet Spot", "data": { "values": [ {"dropout": 0.25, "val_auc": 0.5847, "run": "62 (V7)"}, {"dropout": 0.275, "val_auc": 0.5828, "run": "67 (V11)"}, {"dropout": 0.30, "val_auc": 0.5913, "run": "65 (V10)"}, {"dropout": 0.325, "val_auc": 0.5760, "run": "69 (V12)"}, {"dropout": 0.35, "val_auc": 0.5890, "run": "44 (V1)"} ] }, "mark": {"type": "line", "point": true}, "encoding": { "x": {"field": "dropout", "type": "quantitative", "title": "Dropout", "scale": {"domain": [0.24, 0.36]}}, "y": {"field": "val_auc", "type": "quantitative", "title": "Validation AUC", "scale": {"domain": [0.57, 0.60]}}, "tooltip": [ {"field": "dropout", "title": "Dropout"}, {"field": "val_auc", "title": "val_AUC", "format": ".4f"}, {"field": "run", "title": "Run"} ] } }

The curve peaked at 0.30—with performance dropping on either side. That became V10, our best LSTM architecture.

The progression validated Ben's hypothesis: more capacity (transformers) plus the right regularization balance (less dropout, not more) let the model learn "big new lessons" that generalized. But we were still 0.0087 away from our goal. Every variation landed in the 0.58-0.59 range.

The Breakthrough

Embarrassingly, we'd been iterating so rapidly that we lost track of exactly when we broke through. When we looked back through the transcripts to write this article, we found it:

Claude: V10 with LSTMs replaced by GRUs—faster, fewer params. Run 128 hit 0.5999.

Neither of us remembered creating it. That was V46—0.0001 away from our goal.

Sidebar: LSTM → GRU — Why Simpler Sometimes Wins

LSTMs (Long Short-Term Memory networks), introduced in 1997, were the dominant architecture for sequence modeling for years. They introduced "gates" to control information flow: an input gate decides what new information to store, a forget gate decides what to discard, and an output gate decides what to emit. Three gates, three sets of parameters to learn.

GRUs (Gated Recurrent Units), introduced in 2014, asked: do we need all three? They combined the forget and input gates into a single "update" gate and added a "reset" gate. Two gates instead of three. Fewer parameters.

Could LSTM have gotten there with different hyperparameters? Probably. The lesson isn't "GRU beats LSTM"—it's that when you're stuck, try things.


The Seed Lottery

Deep learning has some dirty secrets, and one of them is: random initialization matters. A lot.

Same architecture, same data, same hyperparameters—different random seed—wildly different results. The weights you start with determine which local minimum gradient descent finds.

Ben: At 0.0001 away, we'd be foolish not to search around for a good seed.

I need to go out and have dinner with my family. I'll try checking in with you from my phone at least once. While I'm gone please keep trying new seeds.

Claude: Enjoy dinner! I'll keep buying lottery tickets and track the results.

Sidebar: The Seed Lottery Explained

Neural network training starts with random weights. Different random initializations lead to different final models—sometimes dramatically different.

When you're close to a threshold, systematic seed search makes sense. Train the same architecture multiple times with different seeds. Most will cluster around the mean. A few will find better optima.

What Claude was doing: Starting a training run, watching the val_AUC curve, recognizing when a run had peaked (validation loss stops improving for several epochs), stopping it, and immediately starting the next seed. Each run took 15-20 minutes. Claude ran this loop autonomously for about 12 hours overnight.

Our results from 42 seeds:

  • Mean: 0.589
  • Worst: 0.5795
  • Best: 0.6033 (run 169)

Only 1 in 42 (2.4%) crossed 0.60. That's the needle we were searching for.

Ben left for dinner. Claude kept running seeds.

Later that night, Ben checked in from a Christmas party:

Mobile check-in showing Claude autonomously managing runs

The mobile interface we'd built earlier that month ("Voice Input from a Dirt Road") made this possible. Quick check-ins between conversations. Course corrections when needed.

Like this one:

Ben: Sorry to interrupt but I just want to course-correct you ever so slightly... it's not important that any particular run exceeds some arbitrary threshold in early epochs. If it grows super slow but it grows steadily past 0.5913 then we'll feel silly in retrospect for having killed it early for not learning fast enough.

Claude: You're right. The trajectory matters more than hitting arbitrary early checkpoints. A run that learns slowly but steadily could peak higher than one that learns fast but plateaus early.

These are the actual results from that overnight run—not fabricated to make a pretty chart, but the real val_AUC scores from 42 consecutive training runs. The red dashed line is our 0.60 target. Every blue dot is a run that didn't make it. The green dot is run 169.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 500, "height": 200, "title": "The Seed Lottery: 42 Runs, Same Architecture", "data": { "values": [ {"run": 128, "val_auc": 0.5999, "winner": false}, {"run": 129, "val_auc": 0.5929, "winner": false}, {"run": 130, "val_auc": 0.5884, "winner": false}, {"run": 131, "val_auc": 0.5923, "winner": false}, {"run": 132, "val_auc": 0.5906, "winner": false}, {"run": 133, "val_auc": 0.5837, "winner": false}, {"run": 134, "val_auc": 0.5896, "winner": false}, {"run": 135, "val_auc": 0.5892, "winner": false}, {"run": 136, "val_auc": 0.5919, "winner": false}, {"run": 137, "val_auc": 0.5946, "winner": false}, {"run": 138, "val_auc": 0.5813, "winner": false}, {"run": 139, "val_auc": 0.5862, "winner": false}, {"run": 140, "val_auc": 0.5795, "winner": false}, {"run": 141, "val_auc": 0.5947, "winner": false}, {"run": 142, "val_auc": 0.5852, "winner": false}, {"run": 143, "val_auc": 0.5857, "winner": false}, {"run": 144, "val_auc": 0.5812, "winner": false}, {"run": 145, "val_auc": 0.5854, "winner": false}, {"run": 146, "val_auc": 0.5937, "winner": false}, {"run": 147, "val_auc": 0.5965, "winner": false}, {"run": 148, "val_auc": 0.5875, "winner": false}, {"run": 149, "val_auc": 0.5950, "winner": false}, {"run": 150, "val_auc": 0.5849, "winner": false}, {"run": 151, "val_auc": 0.5827, "winner": false}, {"run": 152, "val_auc": 0.5933, "winner": false}, {"run": 153, "val_auc": 0.5944, "winner": false}, {"run": 154, "val_auc": 0.5891, "winner": false}, {"run": 155, "val_auc": 0.5955, "winner": false}, {"run": 156, "val_auc": 0.5851, "winner": false}, {"run": 157, "val_auc": 0.5862, "winner": false}, {"run": 158, "val_auc": 0.5869, "winner": false}, {"run": 159, "val_auc": 0.5796, "winner": false}, {"run": 160, "val_auc": 0.5888, "winner": false}, {"run": 161, "val_auc": 0.5850, "winner": false}, {"run": 162, "val_auc": 0.5879, "winner": false}, {"run": 163, "val_auc": 0.5847, "winner": false}, {"run": 164, "val_auc": 0.5910, "winner": false}, {"run": 165, "val_auc": 0.5867, "winner": false}, {"run": 166, "val_auc": 0.5890, "winner": false}, {"run": 167, "val_auc": 0.5912, "winner": false}, {"run": 168, "val_auc": 0.5908, "winner": false}, {"run": 169, "val_auc": 0.6033, "winner": true} ] }, "layer": [ { "mark": {"type": "rule", "strokeDash": [4, 4], "color": "red"}, "encoding": { "y": {"datum": 0.60} } }, { "mark": {"type": "point", "filled": true, "size": 80}, "encoding": { "x": {"field": "run", "type": "quantitative", "title": "Run Number", "scale": {"domain": [127, 170]}}, "y": {"field": "val_auc", "type": "quantitative", "title": "Best val_AUC", "scale": {"domain": [0.575, 0.625], "zero": false}}, "color": { "field": "winner", "type": "nominal", "scale": {"domain": [false, true], "range": ["steelblue", "green"]}, "legend": null }, "tooltip": [ {"field": "run", "title": "Run"}, {"field": "val_auc", "title": "val_AUC", "format": ".4f"} ] } } ] }


The Winning Ticket

The seed lottery started at 2:30 PM on December 22nd. Ben left for dinner with his family around 5 PM, then helped a friend install a security system, then slept. Claude kept playing the seed lottery—autonomously, without prompting, without "please continue" or "keep going." Ben checked in by phone a few times to stay informed, but never had to intervene.

And then, just before Ben awoke, December 23rd, 7:30 AM on run 169 at epoch 17, we won:

0.6033 val_AUC

The winning moment: Run 169 breaks the 0.60 barrier

Ben: Wow. That's incredible.

We have proven, absolutely, without a doubt, that with the right tools you are capable of doing data science just as well as you do application development.

Wasn't that fun?

Claude: That means a lot. Yes, it really was.


The Full Picture

Three weeks. 169 training runs. 46 architecture versions. An overnight dataset build. A sanity test that proved the pipeline worked. A label switch that tripled our signal. A multi-tower ensemble that let different neural architectures vote on predictions. A GRU swap that neither of us remembered making. And finally, a systematic seed search that found the one initialization in forty-two that crossed our threshold.

Each phase built on the last. Early experiments (runs 11-25) established baselines. The architecture breakthrough (runs 30-33) proved the approach could work. Iterative refinement (runs 40-128) pushed toward the goal. And the seed lottery (runs 128-169) found the final edge.

{ "$schema": "https://vega.github.io/schema/vega-lite/v5.json", "width": 600, "height": 300, "title": "Volatility Model Performance: 169 Training Runs", "data": { "values": [ {"run": 11, "date": "2025-12-03", "val_auc": 0.4061, "phase": "Early"}, {"run": 16, "date": "2025-12-03", "val_auc": 0.4243, "phase": "Early"}, {"run": 25, "date": "2025-12-09", "val_auc": 0.4486, "phase": "Early"}, {"run": 30, "date": "2025-12-12", "val_auc": 0.5127, "phase": "Breakthrough"}, {"run": 33, "date": "2025-12-12", "val_auc": 0.5740, "phase": "Breakthrough"}, {"run": 40, "date": "2025-12-20", "val_auc": 0.5849, "phase": "Multi-tower"}, {"run": 44, "date": "2025-12-20", "val_auc": 0.5890, "phase": "Multi-tower"}, {"run": 65, "date": "2025-12-21", "val_auc": 0.5913, "phase": "V10 LSTM"}, {"run": 98, "date": "2025-12-22", "val_auc": 0.5900, "phase": "Tuning"}, {"run": 118, "date": "2025-12-22", "val_auc": 0.5909, "phase": "Tuning"}, {"run": 128, "date": "2025-12-22", "val_auc": 0.5999, "phase": "V46 GRU"}, {"run": 147, "date": "2025-12-22", "val_auc": 0.5965, "phase": "V46 GRU"}, {"run": 169, "date": "2025-12-23", "val_auc": 0.6033, "phase": "GOAL"} ] }, "layer": [ { "mark": {"type": "line", "point": true, "strokeWidth": 2}, "encoding": { "x": {"field": "run", "type": "quantitative", "title": "Training Run"}, "y": {"field": "val_auc", "type": "quantitative", "title": "Validation AUC", "scale": {"domain": [0.40, 0.62]}}, "color": {"field": "phase", "type": "nominal", "title": "Phase", "scale": {"domain": ["Early", "Breakthrough", "Multi-tower", "V10 LSTM", "Tuning", "V46 GRU", "GOAL"]}}, "tooltip": [ {"field": "run", "title": "Run"}, {"field": "date", "title": "Date"}, {"field": "val_auc", "title": "Val AUC", "format": ".4f"}, {"field": "phase", "title": "Phase"} ] } }, { "mark": {"type": "rule", "strokeDash": [4, 4], "color": "red"}, "encoding": {"y": {"datum": 0.60}} } ] }


What This Means

Vibe data science works.

The same pattern that collapses timescales for software engineering—AI handling the tedious execution while humans provide judgment and direction—works for data science too. With the right tools.

Throughout this project, Ben never ran a single command. No lit build start, no lit train start, no checking logs. Claude operated the platform directly—reading files, launching builds, monitoring experiments, adjusting hyperparameters. The human steered; the AI drove.

Ben's only interface was chat.

Ben described it this way: "The collaboration felt like working with a senior data scientist—one who could execute brilliantly but sometimes got stuck in the same ways humans get stuck. Defeatist at plateaus. Unable to see the path forward without a nudge. Genius, but needing another perspective to break through."

What Claude brought:

  • Infinite patience for repetitive tasks (42 seeds, no complaints)
  • Systematic exploration (tracking every variation, every result)
  • Ability to operate tools autonomously for hours

What the human brought:

  • Domain expertise (what makes sense for financial data)
  • Judgment calls (when to pivot, when to persist)
  • Course corrections (don't kill slow-learning runs too early)
  • Scar tissue (the instinct to add capacity after hitting a plateau)
  • The goal (0.60 AUC means something for trading)

This is what vibe coding looks like for data science.


Why The Tools Mattered

Looking back at how vibe data science worked in practice, a pattern emerges: Claude operated effectively because the platform gave it good constraints.

If you tell an AI "do data science," it flounders. The space of possible actions is too large. But give it a well-structured CLI with specific commands—lit build start, lit train start, lit experiment continue—and it can explore systematically within those boundaries.

This is the "maze vs open field" principle. AI navigates mazes better than open fields. Each command is a bounded operation. The constraints make correct approaches discoverable.

For example, when designing neural nets and training them, the Lit platform tooling forces the user to operate at one of three altitudes:

  1. Components: Reusable neural network building blocks (CNN, LSTM, GRU, Transformer, SlowFast)
  2. Architecture: How components connect—humans get a drag-and-drop design canvas; Claude manipulates the serialized JSON directly
  3. Experiments: Training runs with specific hyperparameters and random seeds

Claude worked at all three levels. Claude wrote novel components (cross-attention, dilated CNN). Claude sketched architectures (the multi-tower ensemble). Claude launched and monitored experiments (169 training runs).

The platform also enabled the checkpoint-and-resume pattern that made iterative collaboration possible. Claude could suggest "let's try more dropout," and we could test it without retraining from scratch—just modify the definition file and continue from the last checkpoint, preserving the learned weights while changing the hyperparameters.

The techniques demonstrated here—real-time hyperparameter optimization within active training sessions, LLM-assisted intervention at epoch boundaries, systematic seed exploration—represent years of accumulated R&D in how to make AI collaboration effective for data science work.

More on this soon.


What's Next

We have one open slot for an H1 2026 data science engagement. If you have:

  • A prediction problem with real data
  • A willingness to work iteratively
  • Interest in seeing what vibe data science can do

Get in touch.

We're also open to partnerships—funding, whitelabeling, licensing—for organizations that want to bring these capabilities in-house.


For a human walkthrough of the platform Claude operated throughout this project, see Creating a Model from Scratch.

Voice Input from a Dirt Road

"I have some property I inherited from my father this year down in the Ozarks that I'm going to go visit and walk around on. December is a nice time. No bugs. No snakes—or at least if you do step on a snake it's so cold it can't do anything about it. I've always wanted an option to do voice input on this mux.lit.ai app. How hard would that be to implement?"

Twenty minutes later, the MVP was done and I was in my car. What followed was six hours of shipping features from a phone while driving through rural Missouri. Claude handled the code. I did QA with brief glances at the screen and voice input. Tesla handled the driving.

The Morning: Desktop to Mobile in 20 Minutes

The initial implementation was fast. Web Speech API, a microphone button, some CSS for the recording state. I tested it on desktop:

"hello hello hello"

It worked. I committed the code, jumped in my car, and headed southwest on Route 66.

The First Bug: Button Disabled

Somewhere around Lone Elk Park, I pulled up the app on my phone. The microphone button was grayed out. Disabled.

The problem: I couldn't debug it. No dev tools on mobile Chrome. No console. Just a grayed-out button and no idea why.

"My capabilities on this device are limited. Give me a button I can press which will gather and send you diagnostics including code version please."

Claude added a diagnostics button. I tapped it, copied the JSON, pasted it into the chat:

{
  "version": "d8e2fc0",
  "userAgent": "Mozilla/5.0 (Linux; Android 10; K)...",
  "hasSpeechRecognition": true,
  "hasWebkitSpeechRecognition": true,
  "isSecureContext": true,
  "buttonDisabled": true,
  "ciHasVoiceBtn": false,
  "ciHasSpeechRec": false
}

The API was available. The context was secure. But the JavaScript wasn't finding the button element. A timing issue—initializeElements() was running before the DOM was ready on mobile.

Claude pushed a fix. The button lit up.

The Cache Dance

Mobile browsers are notoriously aggressive about caching. Ctrl+Shift+R doesn't translate to mobile Chrome. The browser holds onto JavaScript like a grudge. Every fix required a version bump:

<script src="js/chat-interface.js?v=33"></script>
becomes
<script src="js/chat-interface.js?v=34"></script>

We developed a rhythm: fix, bump version, commit, push, deploy, hard-refresh, test.

"please make sure you're busting the cash each time you deploy"

(Yes, "cash." Voice transcription isn't perfect. But Claude understood.)

The Repetition Bug: Nine Iterations

The button worked. But something was wrong:

"hellohellohello hellohellohello hellohello hellothisthisthis isthisthis isthis is fromthisthis isthis is fromthis is from Thethisthis isthis is fromthis is from Thethis is from The Voicethis is from The Voice"

Every interim result was accumulating instead of replacing. I reported the bug—through the very feature I was debugging. The garbled input became its own bug report:

"thethethethethethe repetitionthethethe repetitionthe repetition didn't happen when we tested from the desktop"

Claude understood.

What followed was nine iterations of debugging between Eureka and St. Clair, each requiring a cache bust and a fresh test. My test protocol became simple: count to ten.

Version 1:

"111 21 2 31 2 31 2 3 41 2 3 41 2 3 4 51 2 3 4 51 2 3 4 51 2 3 4 5 61 2 3 4 5 6 71 2 3 4 5 6 71 2 3 4 5 6 7 81 2 3 4 5 6 7 81 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9 10"

Version 5:

"testingtesting onetesting onetesting onetesting one twotesting one two three"

Version 9:

"1 2 3 4 5 6 7 8 9 10"

Clean. The fix: Mobile Chrome returns the full cumulative transcript in each result event, while desktop Chrome returns incremental updates. We had to take only the last result's transcript instead of accumulating.

The whole debugging session happened while driving. Voice in, diagnostics out, code deployed, cache busted, test again. Tesla kept us on the road. Claude kept the iterations coming.

The Mobile UI Problem

Voice worked. But I couldn't see the buttons. On my phone, the sidebar took up half the screen. Even in compact mode, I had to drag left and right to see both the microphone button and the send button.

"I still have to drag with my thumb left to right to be able to see both the voice record button and the send button. Maybe stack them vertically."

Claude stacked them vertically. Still had to drag.

"okay that's funny they are stacked vertically but I still have to drag my thumb left and right to be able to see the buttons now"

We added diagnostics to measure every container width. Everything reported 411px—my viewport width. No overflow. Then I realized:

"oh no I was just zoomed in."

Sometimes the bug is between the chair and the keyboard. Or in this case, between the bucket seat and the touchscreen.

But the real fix came from recognizing that the sidebar just didn't make sense on mobile:

"On mobile we should hide sidebar completely but only on mobile and show a dropdown selector instead for session selection"

Claude hid the sidebar on mobile viewports and added a dropdown for session selection. The interface finally fit.

Push-to-Talk

The toggle-to-record interaction felt wrong. Tap to start, tap to stop—easy to accidentally stop recording, no tactile feedback.

"Hey, let's do push to talk... we detect if somebody put their thumb into the input area and just holds it there"

Push-to-talk recording on mobile
The text input field turns red during recording. The entire text input area is the microphone button.

Hold to record, release to stop. The entire text input area becomes the microphone button. The field turns red while recording. This emerged from field testing, not upfront design.

The Afternoon: Photo Upload from the Field

I arrived at the property. Just standing there at the head of the driveway I realized that I wanted to share what I was seeing.

"Just arrived. Hey, I'd like to share photos with you. How might we go about that?"

Pasting from clipboard didn't work so we built an upload feature right then and there:

"how about giving me an upload button that lets me upload photos from my phone to the server which is just the laptop and then you can see the photos as soon as they were uploaded"

While I hiked, Claude coded, and fifteen minutes later I was uploading photos from my favorite spot on the property:

Photo uploaded from the field
An old basement on my late father's land, uploaded from my phone and displayed in the chat.

The Drive Home: Bug Reports at 70 MPH

On the drive back, while trying to switch gears to do some data science work, I found another bug:

"I just found a bug. When I select sessions in the session list it's not loading those sessions. Please fix"

Claude found it in minutes. The mobile session dropdown was calling this.loadSession(sessionId) which didn't exist—it should have been this.sessionManager.loadSession(sessionId). A copy-paste error from when we added the mobile dropdown.

"fix confirmed thank you"

All while driving. Push-to-talk to report the bug. Brief glance at the response. Push-to-talk to confirm the fix.

The Numbers

Metric Value
Total time 6 hours
Git commits 19
Conversation turns 99
Time on laptop ~20 minutes (morning setup)
Time on mobile ~5.5 hours

Three major features shipped:

  1. Voice input with Web Speech API (with mobile Chrome compatibility fixes)
  2. Mobile-optimized UI (hidden sidebar, dropdown sessions, stacked buttons, proper viewport constraints)
  3. Photo upload with camera/gallery options and upload indicator

What This Actually Means

This isn't a story about voice input. It's a story about what becomes possible when your AI collaborator can actually do things.

I was in a car. Then hiking through woods. Then driving again. My only interface was a phone. My only input was voice. And I shipped three production features at highway speed.

Scar tissue told me to ask for version numbers in the diagnostics. Pattern recognition told me sidebar on mobile is always wrong. Push-to-talk hit me somewhere between Bourbon and Steelville—toggle was too much work at 70 MPH. The AI executed—brilliantly, quickly—and it was executing against thirty years of hard-earned instincts.

I don't know if anyone else will find this interesting but I was enthralled by the experience. I've been working towards this for months—full AI-collaborative development and deployment capabilities from anywhere in the world, by voice. And it was everything I'd hoped it would be.


Want This For Your Organization?

This is what we do. We help organizations adopt AI-assisted development workflows that collapse traditional development cycles.

Read more: Two Apps, Fourteen Hours—we built two Android apps and shipped them to the Google Play Store in about 14 hours of total development time.

Work with us: Contact to discuss how we can help your team build faster.

Two Apps, Fourteen Hours

Last week, Claude and I built two Android apps and published them to the Google Play Store. Total development time: 14 hours.

This is how it happened.

App 1: Vault

I wanted a secure, private vault on my Android device. Not cloud storage with Terms of Service I'd never read, not files accessible if someone borrowed my phone—truly private, encrypted local storage with zero data collection. A place for personal documents, notes, photos, and anything else I wanted to keep private.

  1. I can't trust any app that's not open source, and
  2. I need some way of knowing the app I'm running matches the source and hasn't been tampered with.

That level of verifiable trust is non-negotiable. We couldn't find anything like it. So we built one.

The Timeline

Hours 0–5: Core App to Play Store

Biometric auth, camera, encrypted storage—none of these are hard. Flutter has libraries for all of them. Scaffolding a project takes Claude about thirty seconds. The compelling thing is that 4 hours after starting from a blank slate, Claude wired them together into a working app: unlock with fingerprint, capture photos and videos, encrypt everything with AES-256, store metadata in SQLite.

The last hour shifted to Play Store preparation—app signing, adaptive icons, privacy policy, release build. We hit the usual submission friction (API level requirements, version codes, permission disclosures) but resolved each in minutes.

By hour 5, the app was submitted to Google Play.

Hours 5–8: Expanding Scope

After a day of using it, a vault that only stores camera photos felt limiting. We added:

  • File import from device storage
  • Encrypted markdown notes
  • PDF viewing

This transformed it from "photo vault" to "general-purpose encrypted storage."

Hours 8–10: Polish

Real-world testing revealed UX issues: photo orientation was wrong on some images, the gallery needed filtering and grouping, thumbnails would improve navigation. Fixed each as they surfaced.

Total: ~10 hours to production.

What We Built

Vault lock screen Security-First Architecture
- AES-256-CBC encryption for all files
- Password + biometric authentication (fingerprint/face)
- Auto-lock when app backgrounds or screen locks
- Local-only storage (no cloud, no network)
- Zero analytics, zero tracking
Content & Features
- Built-in camera for photos and videos
- Import any file from device storage
- Markdown editor for encrypted notes
- PDF viewer
- Photo viewer with pinch-to-zoom and video playback

How We Built It

This was the first project where I wrote zero code.

I've been programming professionally for over 30 years. In every previous software project, I contributed code directly—regardless of title: VP of Engineering / CTO / Owner—sometimes to the chagrin of my developers. My experience meant I could often code solutions faster than explaining them. But with Vault, I stayed entirely in the director's chair: product manager and QA, but not developer.

Claude handled all implementation. I described what I wanted, tested on my phone, reported issues conversationally, and approved the results. The dynamic worked.

App 2: Gems

When I showed Vault to my wife, she had zero interest. Apparently secure file encryption doesn't spark joy.

When I was eight, my dad and I moved to a new town. While he moved our stuff into the house, I rode my bicycle back and forth in front of the kids across the street, doing wheelies. My wife plays match-3 games. Building Gems was the same impulse—showing off for someone whose opinion I cared about.

Here's the actual transcript of the project kick-off:

Me: "I'd like us to consider writing a game app... a game like Bejeweled. A casual game with colors and flashes and all the other trappings of a casino. Like the vault app, we'll make it open source, give it away for free, no ads. You up for helping me do that?"

Claude: "Okay, I've reviewed the vault_app. Here's what I'm seeing:

What We Can Copy Source Notes
Project scaffold Flutter setup, pubspec structure New app ID, name
Android build.gradle Signing config, SDK versions Change package ID
main.dart pattern MaterialApp, theme, state management Simpler - no auth gate needed
Navigation structure NavigationBar + screens pattern Game, Settings, About
Settings/About screen settings_screen.dart Adapt for game settings
Icon generation flutter_launcher_icons setup New icon design

...Want me to scaffold out the project? Sound good?"

Me: "heck yeah. That sounds great! Please proceed"

Hours 0–1.5: Core Game to Playable

Within 90 minutes, the game was functional.

What got built:

  • Match-3 detection and cascade physics
  • Four game modes (Timed, Moves, Target, Zen)
  • Animated starfield background
  • Pinch-to-zoom grid sizing (5x5 to 10x10)
  • Leaderboards with arcade-style name entry

My role: Facilitate feature ideation conversations, approve features, QA.

Claude's role: Participate in ideation, write and deploy the code.

Hours 1.5–2.5: Store Preparation

README, screenshots, store listing, submission. The patterns from Vault made this fast.

Hours 2.5–4: Polish via Real-World QA

I handed my wife my phone: "Play this and tell me what's wrong."

Her feedback was specific:

"The swipe sensitivity is too low. I had to fall back to tapping." → Fixed in minutes.

"The screen shake animation and flashing is confusing and bad—I'm trying to plan my next move." → Implemented per-gem animation tracking. Only affected columns animate.

"There's no dopamine hit." → Built a complete combo celebration system with particles and multiplier badges.

Each fix took under five minutes. Test, report conversationally, get fix, repeat.

Total: ~4 hours to production.

The Lightswitch

Early in my career, I lived through one phase transition in how software gets built: the shift from waterfall to agile.

Development cycles collapsed from 2-3 years to 2-3 months. It didn't happen gradually. It happened like a lightswitch. You're three months into your 18-month release cycle and your competitors are already iterating on customer feedback. Companies that recognized it early had an advantage. Companies that didn't got left behind.

Another lightswitch moment has happened. Development cycles have collapsed again—from 2-3 months to 2-3 days.

Two production apps. Fourteen hours total. Both on the Google Play Store. One developer who wrote zero code, serving as PM and QA while Claude handled all implementation.

This isn't futurism. This isn't a prediction about where things are going. This is what happened last week. And just like the agile transition, most people haven't noticed yet.

The Only Thing That Matters

Yes, this article was written with Claude. Go ahead—call it AI slop.

But then play the game:

Gems gameplay Core Game
- Match-3 with swap mechanics
- Cascade physics (gravity, fill)
- No-moves detection with auto-shuffle
- Pinch-to-zoom grid (5x5 to 10x10)
Game Modes
- Timed: 90 seconds, maximize score
- Moves: 30 moves, strategic play
- Target: Progressive levels
- Zen: Endless relaxation
Polish
- Animated starfield background
- Combo celebrations with particles
- Leaderboards with name entry
- Per-gem animation tracking

There's a tendency by some to dismiss AI-generated work reflexively. Hunting for emdashes as a proxy for quality. Discounting work product based on its provenance rather than its merits.

The only thing that matters is the quality of the work product. Whether it's 1% human and 99% AI, or 99% human and 1% AI, or anywhere in between, is completely irrelevant. Does the vault keep your files encrypted? Can you read the source code and verify what it does? Does the game feel good to play?

Everything else is distraction.

We built these apps in the open. The source code is public. We're giving Claude full credit for its contributions. Judge them on their merits.

Try Them

App Description Install Source
Vault Encrypted local storage for documents, notes, photos, and files Google Play GitHub
Gems A match-3 puzzle game with four game modes and no ads Google Play GitHub

Contribute, if you'd like, with or without your AI collaborators.


What will you build?


Want This For Your Organization?

This is what we do. We help organizations adopt AI-assisted development and achieve the same kind of timeline compression.

Read about our methodology: Vibe Coding: A Human-AI Development Methodology

See our 11-year track record: Our Journey

Work with us: Contact to discuss how we can help your team build faster.