,

What’s in a headline? NLP analysis in news

First post on my Newspaper Headlines Analysis NLP project. I’ll explain how I picked the project, my plan and the end goal. I’ll also show the initial setup, project structure and tools.

By.

min read

“Disinformation does not mean false information. It means misleading information–misplace, irrelevant, fragmented or superficial information–information that creates the illusion of knowing something but which in fact leads one away from knowing.”

Neil Postman, Amusing Ourselves to Death: Public Discourse in the Age of Show Business

My parents have watched the news for as long as I can remember. And I bet many can relate to that, I’ve never seen the point of it really. Once I asked my mom why she does it, and she told me that she likes to know what is happening in the world. It’s a fair point, but… in the age of Social Media, clicks and likes, and 24h TV news, are we really being informed about anything, or it just feels like that? I don’t know. Is there a way I can quantify all the skewness, irrelevance and volatility I (and maybe others) perceive? I don’t know either, but I think it’ll be fun to try.

So this is the first post on the series where I talk about my Newspaper Headlines Analysis project (project repo). It’s my first time doing Natural Language Processing (NLP), and here I’ll explain how I picked the project, though you may have an idea already, and what’s my plan and end goal. I’ll also show how I did the initial setup, and talk a bit about how I organise my projects and what libraries I’m planning to use.

Table of Contents

  1. Crafting the project
  2. Getting started

Crafting the project

Ever since I started my Data Analysis journey, I heard that it’s better to build your own projects when you want to truly learn something. The question then becomes: “How do you choose one?”. Many times, you can use something from the current industry you are in, you have a dataset you want to explore or know of an existing problem at work that can be solved using data. But to be honest, I think sometimes the best projects come from sheer curiosity.

So in the beginning, I just had that: I wanted to do something around my country’s news coverage. So to narrow that idea I asked myself a few questions, in no particular order.

  1. What kind of media am I going to analyse?
  2. Is there available technology to do the analysis with?
  3. And the data, where do I get it from?
  4. What methods do I want to use?
  5. What will be my end product? Will I have one?

In the end, I chose to focus on written media, because I wanted to do something with text. That gets rid of questions 1, 4 and maybe 2. The analysis was going to be done using NLP, and a quick search in Google let me know about the existence of NLTK, Spacy and a world of tutorials (At the end of the post I’ll share a few resources that helped me get started and broadened my understanding of the NLP field).

Without data, there is no project

Now the data… first I thought about scrapping the web pages of the newspapers to get the information. That meant using tools like BeautifulSoup or Selenium to get the information. It will also mean that I had to build some sort of automatic tool to run every day to get the headlines and articles, I could use AWS Lambda for that. That is a bunch of work, just to get started. Although taking that route can improve skills in data collection from multiple sources (a bunch of dirty data at that).

But then I remembered the existence of Twitter and noticed that all the main newspapers in my country post their headlines there. It has a good API, easy to access, which means the data won’t be extremely messy and it comes in a clean JSON format. Although it does takes away some of the initial fun and complexity of the project and makes the data collection more straightforward. But on the flip side, it means that I can collect larger volumes of data more easily. I decided to go with that.

Edit note:

While working on the following posts for this series, I’ve come across the news that Twitter’s API is no longer free to use. This means that moving forward I’ll be relying on the BeautifulSoup and Selenium to get the data.

What do I do with this much data?

So, I had a way to collect the data but no idea of how do you even start with NLP projects. To solve that I found a great presentation from the PyOhio 2018 conference which covers the basics for NLP projects (link to video). From there I got a rough sketch of what I needed to do.

  • Data retrieval: Getting the data from the API
  • Data cleaning: Removing stop words and useless tweets
  • Feature engineering: Transforming the data into formats required for further analysis
  • Exploratory Data Analysis: Exploring the data and getting a good overview of what we have. This also helps me to see if I need anything else, or if something has to be cleaned
  • Sentiment Analysis: Exploring the sentiment of the tweets or responses
  • Topic Modeling: What topics come out and how do they change over time

Is there anything I want to accomplish?

From the possibilities presented above, I figured I’d take a more exploratory route to the project and play with all the possibilities presented. This was a good decision at the start since I had no idea what I would find, and what questions I could even ask from the data.

Further along, as the project matured and I became more comfortable with the tools and methods my goal changed and I opted for building a dashboard. This would showcase what I found, and allow others to use the dashboard as a tool to get to their own conclusions. Also, it would mean that I’ll have a deployable product to add to my portfolio ????. This is to say that, it doesn’t matter if your initial route doesn’t work or if your project mutates while you are working on it. Especially in this kind of learning and exploratory project.

Below is the design of the dashboard (I’m currently in the process of building it). It will have four tabs corresponding to each step of the analysis.

  • Number of tweets
  • Engagements stats
  • Top 30 words per week
  • Sentiment Analysis

Getting started

Tools and Libraries

Now the tech stuff. I decided to use Python for the project, although I know R has also good support for NLP, so it could have been either one. Now, any Python project begins with a new virtual environment for the project, for that my tools are: Pyenv, Pyenv-virtualenv and Poetry to handle the dependencies. As for the libraries, the main workhorses are Pandas for data loading and wrangling, and for graphics Plotly and Dash. Also, for each step, there are other libraries that I’ll be using that are not in the standard library.

Generalpython-dotenv: To manage environmental variables and keep them out of version control
boto3: For integration with AWS
Data retrievalRequests: To interact with the API in a sane way
Feature engineeringSpacy: The main tool I’ll be using for language processing
Sentiment AnalysisPySentimiento: A library to do sentiment analysis in Spanish
Topic ModellingGensim: A library with tools for topic modelling

Project Layout

I’ve always wondered how I structure my projects in a way that is similar to what I might find in the field. And, during the time I was learning how to build Web Applications with the Django framework, I came across the Cookiecutter project. A utility that helps build better projects based on templates, so you don’t have to start from scratch. There are many different cookiecutters, for most types of common projects, and sets of tools. For Data Science I found a particularly good one: Data Science Cookiecutter, it might be a little opinionated, but the documentation it’s great, and it actually makes sense it’s helped me structure my code in an easy and sensible way across my projects.

To get started you need to install the Cookiecutter package, which can be installed from pip, or conda. But in the case of Cookiecutter, which I use for multiple projects, I installed it with pipx, and this means the tool is available everywhere in my system and I don’t have to install it on each virtual environment. Once it is installed you can run the following command in the folder you want to store the project. It’s an interactive script that will guide you through the setup.

ShellScript
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science

After that is done, once you go into the directory you’ll find this file structure (taken from the documentation).

File Structure
├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
   ├── external       <- Data from third party sources.
   ├── interim        <- Intermediate data that has been transformed.
   ├── processed      <- The final, canonical data sets for modeling.
   └── raw            <- The original, immutable data dump.

├── docs               <- A default Sphinx project; see sphinx-doc.org for details

├── models             <- Trained and serialized models, model predictions, or model summaries

├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.

├── references         <- Data dictionaries, manuals, and all other explanatory materials.

├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting

├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`

├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py

└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Now I’m ready to start working on the project! I hope you found this “behind the curtains” helpful to come up with your own projects. In the next article on this series, I’ll cover Data Retrieval and Storage.

Further reading

One last thing before I finish today’s post. Below is the list of NLP resources I read or I’m currently reading for this project.

Books

  1. Natural Language Processing with Python and spaCy by Yuli Vasiliev
  2. Getting Started with Natural Language Processing by Ekaterina Kochmar

Articles

  1. Basic tweet preprocessing in Python by Parthvy Shah
  2. How to Easily Cluster Textual Data in Python by James Asher

Thanks for reading, see you in the next article.

May the data be most of the time in your favour,

Andrea