sql · final table
Clean venture funding data retrieved via SQL query
// the destination — typed, queryable records of every scraped funding round

01 // The Goal

Venture capital is where tomorrow's tech giants show up first — Google, Tesla, Uber, Doordash all moved through multiple VC rounds before they were household names. Tracking that activity reveals developing trends in technology, culture, and market direction earlier than almost anywhere else.

VCNewsDaily publishes free details on the 30 most recent funding rounds at any time, with a paid catalog behind that. I set out to do the reverse of their paywall — programmatically read the public HTML, clean it, and store it in a format that supports real analysis: SQL tables with proper datatypes.

02 // Why It Matters

Once this data is warehoused cleanly, teams across three disciplines can use it:

This project isn't the analysis layer. It's the scaffold that makes the analysis layer possible.

03 // The Source

VCNewsDaily's website is well-suited for visual browsing — but without buying the catalog, their search tools cap you at the last 30 rounds. Scraping the raw HTML gets past that ceiling.

04 // Extract · Three Pages per Round

Each funding round on VCNewsDaily spans three linked pages: a general summary, the funding-round details, and the company profile. The Python scraper walks all three per round and stitches the fields back together by key.

05 // Transform · From Raw to Clean

Raw scrape output gets merged row-by-row into a single raw dataframe — messy strings, inconsistent date formats, currency mixed into text, everything still looking like HTML.

pandas · raw data
Raw merged dataframe before cleaning
// raw stage — all three pages merged, nothing normalized yet

Pandas then normalizes dates, strips currency symbols to numeric amounts, standardizes company names and categories, and drops duplicates. The cleaned dataframe is ready for warehousing.

pandas · clean data
Cleaned dataframe ready for SQL load
// clean stage — typed, deduped, analysis-ready

06 // Load · Into SQL Server

The clean dataframe is written into MS SQL Server with explicit datatypes — dates as datetime, funding amounts as decimal, categorical fields as nvarchar with bounded length. Typed storage is what makes the next step (calculations, predictive modeling, joins with external datasets) actually practical.

07 // What I Took From It

The arc from this project to later work (the Threat-Intel ETL, the KPMG analysis) is a straight line: same three stages, bigger data, better tooling.

08 // Try It

Source notebook on github.com/marky224. Requires Python 3, BeautifulSoup, pandas, and a SQL Server instance (or SQLite with minor adjustments).