01 // The Challenge

Threat intelligence platforms like AlienVault OTX serve up huge volumes of indicators of compromise — malicious IPs, URLs, domains, file hashes. Raw is easy. Actionable is the hard part. Analysts need the data shaped, filtered, and visualized before it's worth anything, and doing that by hand is expensive and error-prone.

The brief I set for myself: an end-to-end pipeline that collects, transforms, stores, visualizes, and interprets threat intel with as little human-in-the-loop as possible.

splunk · dashboard
Splunk — Intel Overview Dashboard · live visualization of 412K+ threat indicators
Splunk — Intel Overview Dashboard · live visualization of 412K+ threat indicators

02 // The Pipeline

▸ extract
AlienVault OTXPython SDK pulls pulses (threat metadata) and indicators (IoCs).
▸ transform
PandasFlatten and normalize OTX's nested JSON into relational shape.
▸ load
PostgreSQL 17Relational schema for pulses + indicators, tuned for query speed.
▸ visualize
SplunkDB Connect feeds an Intel Overview Dashboard with 5 interactive panels.
▸ analyze
Grok + ClaudeSQL metrics piped to two LLMs for independent, structured insight.
▸ scale
412,985 indicators7,128 pulses, 99.96% active — handled comfortably end-to-end.

03 // The Splunk Dashboard

The centerpiece is the Intel Overview Dashboard — five interactive panels, each answering a specific analyst question:

A TLP (Traffic Light Protocol) filter lets different analyst roles drill down to the data their clearance permits.

pycharm · ide
PyCharm IDE — ETL orchestration scripts under active development
PyCharm IDE — ETL orchestration scripts under active development

04 // SQL → LLM Pipeline

Instead of letting an LLM loose on the raw database, I precompute the interesting metrics in SQL and feed the structured result to the model. Same prompt, same data, sent to both Grok and Claude — so the output is two independent interpretations I can cross-check against each other.

Representative queries from src/sql_queries.py:

-- Top targeted countries (pulses)
SELECT country, COUNT(*) AS count
FROM (SELECT jsonb_array_elements_text(targeted_countries) AS country FROM pulses) AS sub
GROUP BY country ORDER BY count DESC LIMIT 5;

-- Active vs. expired indicators
SELECT
  SUM(CASE WHEN expiration < NOW() THEN 1 ELSE 0 END) AS expired,
  SUM(CASE WHEN expiration >= NOW() OR expiration IS NULL THEN 1 ELSE 0 END) AS active
FROM indicators;

-- Pulse trends over the last 6 months
SELECT DATE_TRUNC('month', created) AS month, COUNT(*) AS pulse_count
FROM pulses GROUP BY DATE_TRUNC('month', created)
ORDER BY month DESC LIMIT 6;

05 // Results · May 15 2025 snapshot

Grok's read emphasized hash-based indicator dominance, geopolitical targeting, and industry-specific risk. Claude leaned into critical-infrastructure targeting, sophistication signals (PowerShell + APT tags clustered), and the monthly pulse spike in April 2025. Both models independently flagged phishing and APT activity against government and finance as the top concern.

python · llm output
Grok + Claude — side-by-side LLM analysis of pre-computed SQL metrics
Grok + Claude — side-by-side LLM analysis of pre-computed SQL metrics

06 // What I Took From It

The biggest lesson is the one that keeps repeating across every project: modularity compounds. Separate modules for extract / transform / load / query / LLM analysis meant every piece could be swapped or extended without touching the rest.

07 // Try It

Open-source at github.com/marky224/Threat-Intel-ETL. Full setup: AlienVault OTX API key, PostgreSQL 17, Splunk with DB Connect, Java JRE 11, and API keys for Grok and Claude if you want the analysis layer. Clone, configure src/config.py, run setup_db.py, then python main.py.