How Machine Learning Uses Alternative Data in Trading

Table of Contents

Disclaimer

All articles are for education purposes only, and not to be taken as advice to buy/sell. Please do your own due diligence before committing to any trade or investments.

Disclaimer

All articles are for education purposes only, and not to be taken as advice to buy/sell. Please do your own due diligence before committing to any trade or investments.

Table of Contents

Machine learning is changing trading by using alternative data – non-traditional information like social media, satellite imagery, and web traffic. Here’s the key takeaway: Traders who use alternative data with machine learning can gain faster, more detailed insights compared to relying on conventional sources. For instance, hedge funds that incorporate alternative data have seen 3% higher annual returns.

Key Points:

  • What is Alternative Data?
    It includes consumer transactions, social media sentiment, satellite images, and more. These sources reveal trends earlier than official reports.
    Example: Satellite data tracks port activity, while social media sentiment predicts stock movements with 87% accuracy.
  • How Machine Learning Helps:
    Machine learning models process this data to uncover patterns humans might miss. Techniques like NLP (Natural Language Processing) analyse text, while deep learning handles image and time-series data.
  • Singapore-Specific Insights:
    Traders in Singapore can leverage data like mall footfall, Changi Airport traffic, and digital payment trends. However, compliance with the Personal Data Protection Act (PDPA) is crucial.
  • Steps to Use Alternative Data:
    1. Acquire data through APIs, vendors, or scraping (ensure legal compliance).
    2. Clean and preprocess data – handle missing values, remove duplicates, and align timestamps.
    3. Engineer features, such as sentiment scores or satellite image metrics, to create trading signals.
    4. Train machine learning models while avoiding overfitting by using time-based validation methods.
  • Combining Data for Better Results:
    Blending alternative data with conventional indicators (e.g., momentum, valuation ratios) often leads to stronger trading strategies.

Alternative data is reshaping trading, especially in Singapore, where digital and regional data provide a competitive edge. By following a structured approach – cleaning data, engineering features, and training models – you can transform raw information into actionable trading insights.

Preparing Alternative Data for Machine Learning Models

For machine learning (ML) models to generate reliable trading signals, alternative data must be clean and well-structured. Although this process can be time-intensive, it lays the groundwork for effective feature engineering and model training.

How to Acquire Alternative Data

Traders can access alternative data through various channels, each with its own advantages and challenges:

  • APIs and Data Feeds: These provide structured access to sources like social media sentiment, news, ESG scores, and web traffic metrics. While APIs often require managing rate limits and authentication, the data is usually delivered in predictable formats like JSON or CSV, making it easier to integrate into workflows.
  • Commercial Providers: Vendors offer curated datasets, including credit card transactions, geolocation signals, and satellite imagery. Although these datasets come partially cleaned, traders still need to validate them for specific use cases and ensure compliance with Singapore’s regulatory requirements, such as the Personal Data Protection Act (PDPA).
  • Web Scraping: By using tools like Python’s requests and BeautifulSoup, traders can collect data from e-commerce sites, job postings, and company websites. However, it’s crucial to respect robots.txt files, implement rate limiting, and store raw HTML logs for reproducibility. For Singapore-based traders, ensuring scraping activities comply with local regulations and website terms of service is essential.
  • Local Proprietary Data: Data from sources like Collin Seow Trading Academy webinars offers tailored insights into Singapore’s market. Metrics such as course engagement data or order-book microstructure information can provide an edge.

Cleaning and Preprocessing Data

Raw alternative data is often messy and unstructured, so standardising it is key. This includes aligning column names, types, encodings, and units to enable seamless dataset merging. For example, transaction amounts should be converted to Singapore dollars (S$) with consistent formatting (e.g., 1,234.56).

Handle missing values with methods suited to the data type. For quarterly ESG scores, forward-filling within reasonable limits might work, while high-frequency data like web traffic may require interpolation or the creation of flags for missing entries.

Other essential steps include:

  • Standardisation and Outlier Handling: Use methods like winsorisation or z-score filtering to cap outliers.
  • Duplicate Removal: Eliminate repeated text entries, spam, or bot activity in social media feeds to ensure sentiment analysis reflects genuine signals.
  • Text Data Preprocessing: Apply natural language processing (NLP) techniques to extract sentiment scores. Transformer-based embeddings can help capture deeper semantic meanings.
  • Sensor and Image Data: For satellite imagery or footfall sensors in locations like Singapore’s shopping malls, preprocess data using geospatial alignment, artefact removal, normalisation, and feature extraction through convolutional neural networks (CNNs) or handcrafted metrics.

Finally, synchronise timestamps with market events to ensure data aligns with trading activities.

Aligning Alternative Data with Market Data

Proper time alignment is critical to avoid look-ahead bias. Alternative data should be consistently timestamped – usually in UTC – while also accounting for local trading schedules, such as SGX trading hours (UTC+8). For instance, social media sentiment collected up to 17:00 SGT can inform the next trading day’s opening prices if appropriately lagged.

Resampling helps transform irregular data into formats that match your trading horizon. High-frequency data like web traffic can be aggregated into hourly or daily intervals using metrics like sums, averages, or volatility estimates. Lower-frequency data, such as weekly satellite imagery, may be forward-filled, with the original update frequency clearly noted. Event-based data, like news or corporate filings, can be converted into features such as event counts over a set period, time elapsed since the last event, or rolling averages of sentiment scores.

To ensure models use only historically available data, adapt resampling methods to SGX trading hours and local holidays. Maintaining a central trading calendar that programmatically accounts for these nuances ensures that models operate on realistic and reliable data.

Feature Engineering for Alternative Data

Once you’ve cleaned and aligned your alternative data, the next step is feature engineering – turning raw inputs into meaningful signals that guide trading decisions. This process involves converting data into numeric features suitable for machine learning, a step that plays a crucial role in how effectively your model identifies trading signals.

Creating Features from Structured Data

Structured data, like credit card transactions or web traffic, often arrives in a tabular format. From this, you can derive metrics such as growth rates (e.g., month-over-month changes) or normalised indicators (e.g., z-scores) to uncover shifts in consumer behaviour. For example, with credit card data, you might calculate revenue growth proxies by aggregating transaction volumes by company or sector. Then, compute z-scores against historical averages to identify anomalies. Similarly, for web traffic, you could create features like daily visit z-scores, growth rates (e.g., a 20% spike might signal pre-earnings hype), or app download trends. Research suggests that combining web traffic data with market data can improve revenue prediction accuracy by about 10%.

Beyond basic metrics, advanced techniques like rolling averages and volatility measures can help smooth out noise and highlight demand stability. Percentile ranks are another useful tool, allowing you to compare a company’s current activity against its historical performance or its peers. For traders in Singapore, tracking regional e-commerce trends during events like Singles’ Day could reveal early signals, such as changes in basket size or customer churn patterns, even before official earnings reports.

The next challenge lies in transforming unstructured text into actionable quantitative signals.

Extracting Features from Text Data

Unstructured text – whether from social media posts, news articles, or earnings call transcripts – requires natural language processing (NLP) to extract trading signals. Sentiment analysis is a common technique here. Tools like VADER or FinBERT can score text sentiment on a scale from -1 (negative) to +1 (positive), capturing the overall market mood. You can then aggregate these scores into daily averages, measure their volatility (via standard deviation), or calculate volume-adjusted sentiment (multiplying the number of posts by the average score). Studies show that social media sentiment features can achieve forecast accuracies as high as 87% for event-driven signals.

To go deeper, word embeddings like Word2Vec, GloVe, or BERT convert text into numeric vectors that capture semantic relationships. Finance-specific models like FinBERT are especially adept at interpreting terms like “liquidity” or “default risk” in context. Topic modelling techniques, such as LDA or BERTopic, can identify recurring themes – like supply chain disruptions or regulatory changes – and quantify the proportion of each theme in a document. These features can help detect shifts in corporate narratives or market focus that often precede price movements.

For traders in Singapore analysing SGX announcements or regional news, entity-level sentiment (scoring mentions of specific companies or sectors) and tone-based features (e.g., uncertainty, forward-looking statements, or litigation-related terms) add depth. Combining sentiment analysis with traditional signals often yields better results than relying on either source alone.

Finally, let’s explore how sensor and image data can provide unique insights.

Converting Sensor and Image Data into Features

Sensor and image data require more advanced processing techniques. Convolutional neural networks (CNNs) are the go-to method here, as they apply layers of filters to extract patterns from images. For instance, satellite images of Singapore’s port facilities or retail car parks can be analysed using CNNs to count visible objects like ships, cars, or containers – proxies for economic activity. Studies have shown that parking lot car counts can improve earnings estimate accuracy by about 18%, as they reflect foot traffic ahead of official reports.

Rather than relying on raw counts, you can use change metrics (e.g., day-to-day or week-to-week differences) to capture shifts that might impact revenue. Alternatively, extract high-level features from the CNN’s final hidden layer – these latent features encapsulate complex visual patterns without requiring manual interpretation. For IoT sensor data, such as temperature readings from shipping vessels or footfall sensors in malls, traditional time-series techniques apply. Rolling means, volatility, Fourier transforms for periodic patterns, and lag features are all effective ways to process this data. These engineered vectors can serve as real-time indicators of supply chain activity or consumer demand across Singapore and the ASEAN region.

Hedge funds that integrate multi-modal features – combining metrics like credit card growth, sentiment scores, and parking lot occupancy – have reported a 3% increase in annual returns compared to peers relying solely on traditional data. The key is ensuring all features are aligned to the same timestamps and rigorously validated on out-of-sample data to avoid overfitting.

Master Systematic Trading with Collin Seow

Learn proven trading strategies, improve your market timing, and achieve financial success with our expert-led courses and resources.

Start Learning Now

Training Machine Learning Models with Alternative Data

Once your features are ready, the next step is building models that can effectively transform alternative data into actionable trading signals. The success of your strategy depends on the choice of model, validation process, and how predictions are converted into trades. These steps are crucial to ensure your strategy holds up in real-world market conditions.

Choosing the Right Machine Learning Models

The type of model you choose should align with the nature of your alternative data. Tree-based models like Random Forest, XGBoost, and LightGBM are ideal for structured data such as credit card transactions, web traffic, and sentiment scores. These models handle non-linear relationships and missing values efficiently, requiring minimal preprocessing. For traders in Singapore working with data from regional e-commerce platforms or SGX-listed companies, tree-based models strike a good balance between performance and training speed.

Linear models, such as Logistic Regression or Elastic Net, are great starting points when dealing with high-dimensional, sparse data like text-based features from social media. They are fast, interpretable, and helpful for identifying key features – an important factor when explaining strategies to stakeholders or meeting MAS transparency requirements. For more complex data, deep learning models shine. LSTMs and Temporal Convolutional Networks are well-suited for sequential time-series data like high-frequency IoT sensor streams, while CNNs excel at processing visual data such as satellite images. Transformer-based models like FinBERT are particularly effective for analysing large volumes of financial text, including SGX announcements, earnings transcripts, or regional news, to extract sentiment features.

Training and Validating Models

Once you’ve selected a model, rigorous training and validation are essential. When working with alternative data, maintaining strict time-series discipline is critical to avoid look-ahead bias. Always split your data chronologically – train on historical data, validate on the next time period, and test on unseen data. Avoid random splits, as they risk leaking future information into the training set. Instead, use methods like walk-forward or expanding-window cross-validation to simulate real-time performance. For instance, train on data from 2020–2022, validate on Q1 2023, and test on Q2 2023, rolling forward as needed.

When assessing performance, don’t just rely on traditional metrics like accuracy or AUC. Focus on trading-specific metrics such as the Sharpe ratio (aim for >1.5 for risk-adjusted returns), maximum drawdown (keep it under 20% to manage volatility), Sortino ratio, hit rate, and turnover. Account for transaction costs, bid-ask spreads, and slippage – factors that are particularly relevant for SGX and regional markets. Backtests should also cover multiple market conditions, such as the 2008 financial crisis, the 2020 COVID crash, and the 2022 rate hikes, to ensure the model performs well across different scenarios and isn’t overfitted to a single environment. This disciplined approach ensures that your predictions are both reliable and actionable when converted into trading signals.

Converting Model Predictions to Trading Signals

Once your model is trained, the next challenge is turning its predictions into trades. Raw outputs – whether probabilities, expected returns, or alpha scores – need to be translated into actionable positions. For classification models, set thresholds to determine long (e.g., >60% probability), flat (40%–60%), or short (<40%) positions. Fine-tune these thresholds through backtesting to optimise metrics like the Sharpe ratio. For regression models, rank stocks by predicted alpha and allocate capital proportionally, with higher scores receiving larger allocations, all within defined risk limits.

Position sizing rules are key to managing exposure. Options include targeting annualised volatility (e.g., 10%), using Kelly criterion variants, or applying fixed fractional sizing (risking 1–2% per trade). Research shows that sizing based on prediction confidence – such as applying the Kelly formula with win probabilities derived from sentiment data (up to 87% forecast accuracy for event-driven signals) – can improve returns while reducing drawdowns.

Finally, integrate these signals into your portfolio using methods like mean-variance optimisation, risk-parity weighting, or constraint-based approaches. For Singapore-based traders managing multi-asset portfolios across US, China, and ASEAN markets, consider sector and country caps as well as maximum exposure limits per stock to maintain diversification. Ensure all P&L calculations are normalised to SGD, and monitor signal decay closely, as the competitive edge of alternative data can erode over time as more traders adopt similar datasets. Keeping detailed documentation of your data sources, feature engineering, and validation processes will help meet MAS regulatory standards and strengthen your risk management framework.

For those interested in advancing their systematic trading skills, Collin Seow Trading Academy offers courses and resources on strategy design, risk management, and model-based trading frameworks tailored to workflows involving alternative data.

Integrating Machine Learning Signals into Systematic Trading

Combining Alternative Data with Traditional Signals

When working with systematic trading strategies, alternative data should complement, not replace, traditional signals. By integrating machine learning-driven insights – like web traffic, sentiment, or transaction data – into a broader multi-factor framework, you can create a more balanced and effective model. For instance, combining these alternative signals with traditional indicators such as momentum, valuation ratios, and macroeconomic variables can yield stronger results. Start by standardising all signals into a common scale, like z-scores or percentile ranks, so that different metrics (e.g., sentiment scores and momentum indicators) are comparable across stocks and time periods.

For those trading SGX-listed equities in Singapore, it’s wise to allocate weights between traditional and alternative factors based on the prevailing market conditions. A practical approach could involve going long on STI component stocks where both price momentum and machine learning sentiment scores rank in the top decile, provided macroeconomic conditions are also supportive. To avoid overfitting, begin with simple linear combinations or ranking methods. Only consider more complex machine learning meta-models after gathering enough live performance data to validate their effectiveness. Interestingly, 65% of hedge funds now incorporate alternative data, and a 2024 J.P. Morgan study revealed that blending it with traditional signals led to up to 3% higher annual returns.

Once these signals are integrated, it’s essential to evaluate their effectiveness across different asset classes and regions.

Cross-Asset and Regional Considerations

The impact of alternative data varies depending on the asset class and geographic region. For equities, consumer-focused data like credit card transactions, web traffic, and social media sentiment can enhance earnings forecasts. Commodities, on the other hand, benefit from data sources like satellite imagery, vessel IoT data, and weather forecasts, which help identify supply-chain disruptions or crop conditions. In FX trading, macro-sensitive data such as cross-border payment flows and trade activity often serve as short-term signals layered on top of carry and fundamental models.

For Singapore and broader APAC markets, regional consumer and trade data hold particular relevance. Metrics such as e-commerce activity, shipping flows, port traffic, and tourism figures directly influence SGX equities, REITs, and SGD currency pairs. Given the high rates of mobile and e-commerce usage across ASEAN, app usage and geolocation data can provide valuable insights. When trading in North Asian markets, it’s crucial to adopt language-specific natural language processing (NLP) models that can accurately process news and social sentiment in Chinese, Japanese, or Korean. Ensure that less mature data sources undergo thorough preprocessing to maintain data quality.

While signal development is key, operational and governance practices are equally important for systematic trading success.

Operational and Governance Best Practices

Data licensing and compliance must always come first. Maintain a centralised inventory detailing each data vendor’s licensing terms and any geographic restrictions. Ensure that all data is collected with proper user consent and complies with Singapore’s Personal Data Protection Act (PDPA) as well as MAS guidelines on technology risk. Partner only with data providers that explicitly grant rights for investment use and adhere to regulatory standards. Smaller funds in Singapore might consider starting with affordable trial datasets to test their effectiveness in controlled experiments before committing to more expensive premium feeds.

To manage operations effectively, set up real-time monitoring dashboards that track:

  • The freshness of your data and the distribution of features
  • Model outputs, with alerts for missing data or sudden distribution shifts
  • Anomalies in signal frequency

Evaluate signal performance by breaking down profit and loss (P&L) contributions from traditional factors versus alternative data components. Use walk-forward validation and shadow models to test new machine learning techniques in a paper trading environment before risking actual capital. Establish clear escalation protocols, such as reducing position sizes or temporarily deactivating a signal, when live performance deviates significantly from expectations. This structured approach ensures capital protection while diagnosing issues and aligns with the rule-based strategies emphasised by platforms like Collin Seow Trading Academy for Singapore traders adopting new signals.

Conclusion

Machine learning is reshaping how traders leverage alternative data. By focusing on thorough data acquisition, cleaning, and feature engineering, traders can turn predictions into actionable signals. When paired with traditional indicators like momentum and valuation ratios, these strategies have proven effective. For instance, a 2024 study by J.P. Morgan revealed that hedge funds using such methods achieved up to 3% higher annual returns.

For Singapore traders dealing with SGX-listed equities and regional ASEAN markets, alternative data opens up real-time insights into areas like shipping routes, supply chains, and e-commerce trends. With the alternative data market forecasted to hit $273 billion by 2032, emerging sources such as IoT sensors and crypto data are creating fresh opportunities. However, success in this field requires strict adherence to MAS regulations, PDPA requirements, and protocols for identifying material non-public information.

The technical challenges of handling biased datasets, crafting meaningful features, and translating model outputs into reliable signals make education a key factor. Platforms like Collin Seow Trading Academy provide valuable resources for Singapore traders. These include free e-courses such as “Market Timing 101”, live webinars on “Systematic Trading Profits”, and the book The Systematic Trader v.2. These tools are designed to help traders bridge the gap between understanding machine learning concepts and applying them in a disciplined and risk-aware manner.

To get started, begin small. Use free datasets to test your methods, build dependable data pipelines with Python libraries, and set up real-time monitoring dashboards. Before scaling up, validate your approach through walk-forward testing and shadow models within paper trading environments. By combining technological advancements with a disciplined trading strategy, Singapore traders can position themselves for long-term success in the market.

“Success in trading is not just about making decisions; it’s about making informed decisions.” – Collin Seow

FAQs

How does machine learning use alternative data to enhance trading strategies?

Machine learning taps into alternative data sources like social media trends, satellite images, and transaction records to detect patterns and forecast market shifts. By analysing massive and complex datasets, it reveals insights that may escape traditional analysis.

The process starts with data preprocessing, where raw data is cleaned and structured. Next comes feature selection, which narrows down the most relevant variables for analysis. Finally, model training develops algorithms capable of producing precise trading signals. Together, these steps equip traders with the tools to make smarter decisions and capitalise on market opportunities.

How can traders in Singapore ensure compliance with data protection regulations?

To align with Singapore’s Personal Data Protection Act (PDPA), businesses need to follow several important steps to handle personal data responsibly. This starts with securing clear and explicit consent from individuals before collecting or using their personal information. Additionally, it’s crucial to ensure that the data collected is both accurate and relevant to its intended purpose.

Businesses must also put in place strong security measures to safeguard sensitive information. Access to personal data should be limited to authorised personnel only, with well-defined protocols to manage any potential data breaches swiftly and effectively.

Regular reviews of compliance practices are equally important to ensure your business stays up-to-date with any changes in the regulations. Staying proactive in this area helps maintain trust and adherence to the PDPA requirements.

How can traders combine alternative data with traditional trading signals effectively?

To make the most of alternative data alongside traditional trading signals, the first step is to preprocess and clean the data. This ensures that the information is accurate and relevant. Once the data is refined, traders can identify key features – like social media trends or satellite imagery – and combine them with technical and fundamental analysis.

Using machine learning models can reveal patterns and relationships that would otherwise go unnoticed. These insights not only sharpen decision-making but also improve timing for trades. Additionally, this method enhances risk management by helping traders fine-tune position sizing, ultimately boosting overall trading performance. Merging data-driven insights with a disciplined strategy reflects time-tested approaches for navigating fast-changing markets.

Share this post:

Facebook
Twitter
WhatsApp
Pinterest
Telegram

Bryan Ang

Bryan Ang is a financial expert with a passion for investing and trading. He is an avid reader and researcher who has built an impressive library of books and articles on the subject.

Leave a Reply

Your email address will not be published. Required fields are marked *

Share this post:

REACH YOUR HIGHEST TRADING PERFORMANCE

Copy My No Brainer Trading Strategy

REACH YOUR HIGHEST TRADING PERFORMANCE

Copy My No Brainer Trading Strategy

Get Started HERE With Our FREE Market-Timing 101 Video Course

X

Copy My No-Brainer Trading Strategy