Trendster — Rawanyat

Demystifying trends and tracking topics across verticals.

Since news is structured in a rigid categorical format, it may be difficult to follow a topic of interest over time in one snapshot. Generally, if one is interested in a topic, it tends to not exhibit this type of compartmentalization, and is actually more fluid in nature, sitting at the intersection of these categorical verticals. My idea was to create a set of tools and visualizations to help track the evolution of a topic of interest in the media over time. The topic that I chose to explore was gender equality (aka feminism).

Pipeline

Filtering

I had to think extensively about how I would collect articles about my topic. The easiest solution would have been to perform a keyword search and pull out articles that mention any of the keywords. However, not only is it difficult to build a concept/theme around keywords, but depending on the keywords I chose, my subset was at risk of being biased. Given my time limit, a I chose to train a classifier on articles about my topic in order to filter out similar articles from my large corpus of NYT articles (which took 3 painful weeks to scrape from the NYT website).

• • •

I hand labeled 900 articles (100 articles about gender equality), split my labeled data into stratified train and test sets, and used the term frequency matrix of my train set to train a gradient boosting classifier. My model was comprised of 100 weak learners. It achieved a mean recall score of 0.78 and a mean precision score of 0.83 through cross validation. It achieved a recall and precision score of 0.85 when tested on the test set.

• • •

I used the term frequency matrix that was fitted on my training data to transform the rest of my New York Times articles and passed the matrix through my trained (and pickled) gradient boosting model. After carefully inspecting my data and adjusting my acceptance threshold to 0.67, my model classified 18,000 articles as relating to gender equality. I decided to increase my classification threshold as, in my case, I cared more about having a lower false negative rate (some true positives were worth the sacrifice).

• • •

The rows of the term frequency matrix represent the document space and the columns represent the term space. Each term frequency is normalized against the amount of terms in a given document. I used a term frequency matrix and not a term frequency inverse document frequency matrix (tfidf), which is further normalized on the number of documents in which a term occurs, as there is great variance in words used in articles.

Topic Modeling

To extract subtopics from my corpus of articles about gender equality, I used a non-negative matrix factorization model (NMF).

Key Takeaways

Relevant and time dependent categories.

My model was able to detect relevant and time dependent categories. In the context of gender equality and between the years of 1992 and 2004, the above subtopics were top of mind.

My model showed that these topics were talked about pretty consistently over these years.

These are some of the headlines that came up during those years.

"Clinton" was a big topic during that time, which makes sense that they would be top of mind due to Bill Clinton's presidency and Hilary Clinton's time in the senate. However, specifically within the context of gender equality was due to President Clinton's sexual harassment lawsuits, which started in 1997, and Hilary being the first first lady to serve in the senate.

Meaningful nuances between topics

My model was able to detect meaningful nuances between subtopics. For example, it made a distinction between articles about lawsuits about sexual assault, and actual reportings of rape and sexual assault

Github