VC Funding Scrape & Store

Clean venture funding data retrieved via SQL query — // the destination — typed, queryable records of every scraped funding round

01 // The Goal

Venture capital is where tomorrow's tech giants show up first — Google, Tesla, Uber, Doordash all moved through multiple VC rounds before they were household names. Tracking that activity reveals developing trends in technology, culture, and market direction earlier than almost anywhere else.

VCNewsDaily publishes free details on the 30 most recent funding rounds at any time, with a paid catalog behind that. I set out to do the reverse of their paywall — programmatically read the public HTML, clean it, and store it in a format that supports real analysis: SQL tables with proper datatypes.

02 // Why It Matters

Once this data is warehoused cleanly, teams across three disciplines can use it:

PE/VC analysts — spot acquisition targets and emerging competitors before the mainstream picks them up.
Product R&D (biotech, aerospace, electronics) — identify upcoming technologies that could cross into their own domains.
Business development — generate qualified lead lists filtered by industry, region, or funding stage.

This project isn't the analysis layer. It's the scaffold that makes the analysis layer possible.

03 // The Source

VCNewsDaily's website is well-suited for visual browsing — but without buying the catalog, their search tools cap you at the last 30 rounds. Scraping the raw HTML gets past that ceiling.

VCNewsDaily website — the live data source — // site view — the version humans read

Raw HTML structure of a VCNewsDaily funding entry — // html view — the version the scraper reads

04 // Extract · Three Pages per Round

Each funding round on VCNewsDaily spans three linked pages: a general summary, the funding-round details, and the company profile. The Python scraper walks all three per round and stitches the fields back together by key.

General summary page per round — // page 1 — round overview

Funding-round detail page — // page 2 — funding details

source · company

// page 3 — company profile

05 // Transform · From Raw to Clean

Raw scrape output gets merged row-by-row into a single raw dataframe — messy strings, inconsistent date formats, currency mixed into text, everything still looking like HTML.

Raw merged dataframe before cleaning — // raw stage — all three pages merged, nothing normalized yet

Pandas then normalizes dates, strips currency symbols to numeric amounts, standardizes company names and categories, and drops duplicates. The cleaned dataframe is ready for warehousing.

Cleaned dataframe ready for SQL load — // clean stage — typed, deduped, analysis-ready

06 // Load · Into SQL Server

The clean dataframe is written into MS SQL Server with explicit datatypes — dates as datetime, funding amounts as decimal, categorical fields as nvarchar with bounded length. Typed storage is what makes the next step (calculations, predictive modeling, joins with external datasets) actually practical.

SQL Server table schema with proper datatypes — // ms sql server — typed table, ready for queries

Sample SQL query returning clean funding records — // retrieval — SELECT returns analysis-ready records

07 // What I Took From It

ETL fundamentals — extract / transform / load as three distinct concerns, each worth doing well on its own before you chain them.
Datatype discipline — loading strings into SQL is easy; loading the right types makes every downstream query faster and every downstream report correct.
Scaffolding beats one-shot analysis — this repo's real value isn't the 2020 snapshot of fundings, it's the reusable pattern for scraping → cleaning → warehousing any similar public dataset.

The arc from this project to later work (the Threat-Intel ETL, the KPMG analysis) is a straight line: same three stages, bigger data, better tooling.

08 // Try It

Requires Python 3, BeautifulSoup, pandas, and a SQL Server instance (or SQLite with minor adjustments).