Contribute Media
A thank you to everyone who makes this possible: Read More

NLP and text analytics at scale with PySpark and notebooks


Who's who in a developer community and what do they discuss? And with whom? This project, based on Apache Spark, provides Python pipelines for scraping, parsing, and analyzing discussion forums for a given Apache developer community -- along with analysis of related meetup events and conference talks.

Messages get parsed with NLTK and TextBlob, then represented as JSON. Analytics pipelines, organized as notebooks, produce leaderboards with Spark SQL, predictive models using MLlib, and visualizations in Seaborn, while storing the data with Parquet. Code is available on GitHub.



Improve this page