ENTER THE HYDRA
The paradigm shift from numerical precision to contextual intelligence.
Foundation V5: Multi-Task Learning.
FOUNDATION V5 "HYDRA"
Status: Active | Core: Multi-Task Learning | Focus: Context & Safety
1. INTRODUCTION: THE PRECISION TRAP
Until now, there was a trap where ModelMango had somewhat stalled: the obsession with price precision.
For months, we chased the absolute minimum error. We introduced Ranger (V3) and then Sniper (V4), a model we called an "Idiot Savant". Sniper predicted price with pinpoint precision (MAPE 1.6%), but it was blind to macro risks.
In the private beta, we noticed a critical problem: despite the numerical precision, the system suggested buying during crashes because the math told it "rebound", while the world was screaming "crisis".
It became clear that predicting a number was not enough. It was necessary to understand the context.
2. THE 4 HEADS OF HYDRA
Hydra V5 is not a simple evolution; it is a paradigm shift. We abandoned pure linear regression for a Multi-Task Learning architecture based on Nested Learning.
While Sniper only asked itself "How much will this asset be worth tomorrow?", Hydra asks itself 4 questions simultaneously, thanks to its 4 neural "Heads":
PRICE HEAD (The Sniper)
What is the target price? Maintains the pinpoint precision of V4.
REGIME HEAD (The Context)
Is the market "Bull", "Bear", or is a "Crash" coming? This head sees the danger.
DURATION HEAD (The Tactic)
How long to hold the position? Optimizes time exit (T+1 vs T+2).
CONFIDENCE HEAD (Metacognition)
"How much do I trust my own prediction?" A level of model self-awareness.
3. METACOGNITION & SAFETY
What more do users get today? Context and actionable information, not just numbers.
- Safety (Crash Recall 81%): Hydra knows how to recognize a "toxic" market. If the Regime Head detects danger, the model does not generate positive signals, even if the price seems attractive.
- Robustness: Thanks to Nested Learning (inherited from V2), the model learns in real-time, updating its internal memory before every prediction. This solves the problem of "model decay" at the root.
- Capital Protection: Hydra no longer suggests blind entries. The system uses logic free of human heuristics: if the internal Confidence Head is low (< 0.5), the output suggests caution (Cash).
We moved from an algorithm that, although very precise, "took guesses", to an infrastructure that thinks and assesses risk. Welcome, Hydra. 🐲
V2: TITANS ARCHITECTURE (PREVIOUS)
Note: This architecture laid the foundation for Hydra's engine (Nested Learning) but operated as a single task (price only).
4. TITANS ARCHITECTURE & NESTED LEARNING
The technological core of V2 (which remains as the engine in V5) is based on two advanced research paradigms that redefine the concept of memory in neural networks:
- Titans Architecture (Memory as Context): Unlike classic Transformers that have a limited context window, the Titans architecture introduces a long-term neural memory. This allows the model to "remember" significant historical events (such as liquidity crises or structural rallies) and use them as context for current analysis, without the computational limits of traditional sliding windows.
- Nested Learning (Test-Time Training - TTT): This is the fundamental innovation. The model is not "frozen" after initial training. Using the Nested Learning paradigm, the system runs a continuous internal training loop during the inference phase (live). For every new market data point received, the model calculates gradients and updates its internal weights instantly.
In summary: the model learns while it operates, adapting to the specific volatility of the asset under examination at that precise historical moment.
- Titans: Learning to Memorize at Test Time
https://arxiv.org/abs/2501.00663 - Introducing Nested Learning (Google Research)
Google Research Blog
5. THE NEURAL MODEL V2
The current architecture is built upon the proprietary TransformerTimeSeriesMACModel class, integrating "Meta-Learning" mechanisms directly into the data flow:
- Continuum Memory System (CMS): The system's brain. Instead of static memory, we use a hierarchy of Functional Memory Modules. These are "stateless" modules operating on different time frequencies (short, medium, long term).
- Meta-Weights & Fast Weights: The model does not memorize raw prices. It memorizes the synaptic weights that generate the market's transition function. During inference, the system updates its "Fast Weights" instantly to adapt to the current market regime, while "Meta-Weights" (slow weights learned during global training) ensure structural stability.
- Instance Normalization: To make the model "absolute price agnostic" (working on both $10 and $30,000 assets), each time window is statistically normalized against its own local mean and standard deviation (Z-Score), allowing the network to focus purely on relative dynamics.
- Attention Pooling: Rather than performing a simple average of historical data, the model uses an Attention Pooling mechanism that assigns a "relevance weight" to each memory fragment, actively filtering market noise from valid signals.
6. TRAINING & DATASET
V2 training was performed on high-performance NVIDIA GPUs, utilizing a robust loss function (HuberLoss) to minimize the impact of extreme outliers typical of market crashes.
The data management strategy is rigorous to prevent "Data Leakage":
- Chronological Split: We do not use random shuffling. For every asset, the dataset is rigidly cut: the oldest 80% for Training, the most recent 20% for Validation. The model never sees the future during training.
- Extreme Clipping: During normalization, we apply 10-sigma clipping. This prevents "Black Swan" events from destroying gradients during backpropagation, keeping training stable.
- Risk & Momentum: BTC, ETH, SOL, BNB
- Market Health (Indices): S&P 500, Dow Jones, Nasdaq 100, DAX, Nikkei 225
- Market Movers (Big Tech): NVIDIA, Apple, Microsoft, Google, Amazon, Tesla, Meta
- Forex & Macro: EUR/USD, USD/JPY, GBP/USD, AUD/USD, USD/CAD
- Commodities & Energy: Gold, Silver, Crude Oil
- Macro Drivers: VIX (Fear Index), DXY (Dollar Index), Treasury Yields (TNX)
7. PERFORMANCE V2
Current performance, measured on the global validation dataset (data never seen by the model during training), shows an exceptional level of predictive accuracy for a non-linear financial system.
High Price MAPE
1.48%Low Price MAPE
1.20%Close Price MAPE
1.05%*MAPE: Mean Absolute Percentage Error. A value of 1.05% on the Close means that, on average, the model's prediction deviates by only 1% from the actual closing price of the next day.
V1: CLASSIC MODEL (DEPRECATED)
Note: This section describes the previous technology, decommissioned in Q3 2024 due to limitations in handling regime changes (static nature).
8. ARCHITECTURE V1
The system primarily operates through two integrated logical components:
MODELMANGO PREDICTION
Base AI Prediction Model- Responsible for loading historical OHLCV (Open, High, Low, Close, Volume) data for a specific asset.
- Performs a preprocessing and feature engineering phase, transforming raw data to extract meaningful information.
- Uses a pre-trained transformer model called
TransformerTimeSeriesMACModelto generate High, Low, and Close price predictions for the next day (T+1).
MODELMANGO STRATEGY
AI Model for direct stock market operations:- Loads its specific configuration and parameters.
- Uses the base HLC predictions generated by the previous model as fundamental input.
- Loads historical OHLCV data and the historical predictions generated in the past by the base model (history of predictions generated by the MODELMANGO PREDICTION model for a given asset, since its listing date) to create a richer feature set.
- Performs even more advanced feature engineering.
- Prepares data for inference, including a placeholder for day T+1 with the base predictions.
- Loads the trained model for strategies.
- Performs inference to obtain:
- Adjustment delta for the entry price (relative to the predicted Low).
- Adjustment delta for an implicit exit price (relative to the predicted Close).
- A volatility prediction specific to the strategy model.
- Calculates operational levels:
- Optimized Entry Price.
- Stop Loss Price (based on predicted volatility and potentially adaptive).
- Take Profit Price (based on volatility, desired Risk/Reward, and potentially the exit delta).
- Applies decision logic (based on configurable thresholds for exit delta, volatility, and R/R) to determine the final signal:
LONG ENTRYorHOLD/NO ENTRY. - Returns a structured dictionary with the decision and all calculated parameters.
9. BASE MODEL V1
The core of the system is a modified Transformer model, specifically designed for financial time series:
- Transformer Architecture: Leverages the self-attention mechanisms of Transformers, excellent at identifying complex relationships and long-term dependencies in sequential data.
- Memory as Context (MAC): Implements a memory mechanism inspired by recent work to improve long-term context management and adaptation:
- Persistent Memory: "Learnable" tokens that maintain general and stable information over time.
- Memory Module (M): A deep MLP (Multi-Layer Perceptron) that learns to map contextual queries to relevant memory representations (
u_C). - Online Update: The memory M is updated during inference using a mechanism based on the gradient of the loss between the memory retrieved for a
keyderived from the current chunk and thevalueassociato that chunk. This allows the model to quickly adapt to the recent dynamics of the specific asset being examined, even if it was not part of the main training. It uses momentum (eta), gradient intensity (theta), and forgetting (alpha) for stable updates. - Long-Term Memory: A buffer that accumulates representations of past chunks, queried to retrieve additional context.
- 1D Depthwise-Separable Convolutions: Applied to query, key, and value projections before attention and memory update. This helps capture local and spatial patterns within features efficiently.
- Chunking: The input sequence is processed in segments ("chunks") to handle long sequences and allow the integration of the MAC mechanism at each step.
10. PERFORMANCE & GENERALIZATION V1
Despite primary training on Bitcoin and the relatively small size of the model (4,426,509 parameters), the system demonstrates remarkable performance across a wide range of global assets (around 10,000 assets including stocks, cryptocurrencies, indices, Forex), as highlighted below by the recent average MAPE (Mean Absolute Percentage Error) metrics on the base HLC predictions:
- Average High Price MAPE: 1.26%
- Average Low Price MAPE: 1.31%
- Average Close Price MAPE: 1.38%
Across 41 different assets, including: AAPL.US, MSFT.US, GOOGL.US, BTC-USD.CC, ETH-USD.CC, XOM.US, EURUSD.FOREX, GSPC.INDX, etc.
(The model's performance is constantly updated and available at this address: https://www.modelmango.co/performance on a basket of 41 assets divided into categories)
How is this performance possible with a "small" model trained on only one asset?
Several factors contribute to explaining this surprising generalization capability:
- Robust Feature Engineering: Preprocessing and feature creation transform raw prices into more abstract representations that capture behavioral dynamics and patterns (trend, momentum, mean reversion, volatility) rather than absolute price levels. These underlying dynamics are often "universal" across different financial markets, even if they manifest with different intensities and scales.
- Learning Fundamental Patterns: By training on 5900 days of Bitcoin, an asset that has gone through multiple market regimes (bull, bear, sideways, high/low volatility), the model had the opportunity to learn these fundamental "price action" patterns and the relationships between derived technical indicators.
- Power of Transformers: Even with "only" 4.4M parameters, the Transformer architecture is inherently powerful in modeling complex, non-linear dependencies within the engineered feature sequences.
- Adaptation via MAC: The Memory-as-Context (MAC) mechanism, particularly its "online update" during inference, plays a crucial role. It allows the model, trained on Bitcoin's general patterns, to "dynamically adapt" to the specifics of the asset it is analyzing. When processing a new data chunk (a chunk consists of a portion of data), for example, Apple stock (using historical prices obtained daily via EODHD API), the M memory updates slightly to reflect recent dynamics, improving the relevance of predictions for that specific asset. The model not only "remembers" the past but "learns how to learn" from the current context.
- Lower Risk of Specific Overfitting: A smaller model might be less prone to excessively "memorizing" the specific idiosyncrasies of the training dataset (Bitcoin) compared to huge models (billions of parameters). This can foster generalization, as the model is forced to focus on the most robust and transferable patterns.
- Focus on Relative Predictions (Strategy): The strategy model, instead, does not predict absolute prices but "adjustment deltas" and "volatility". This additional level of abstraction can make the strategy more robust against errors still present in the base model's price prediction.
In summary, MODELMANGO's ability to generalize so effectively with extremely limited resources stems from a combination of extracting universal patterns through feature engineering, the Transformer's capacity to model these features, and an adaptive memory mechanism (MAC) that allows for "on-the-fly" specialization to the current asset during inference.