From Messy to Meaningful: A Practical Guide to Preprocessing Raw Text Data

Garbage in, garbage out.
widely attributed to be coined by George Fuechsel, an IBM programmer and instructor

This is one of the first phrases I heard when I started learning about NLP analysis, and it got me a bit paranoid about my data cleaning process. But one of the things that stuck with me the most was, the importance of thinking about what’s your analysis goal, and then tailoring your preprocessing process to that purpose. This is the topic of today’s post.

A small note:

I’ll be covering data retrieval in a future post, since the data that was originally used for the project came from Twitter’s API (which is no longer free) and I’m currently working on a solution to be able to collect new information.

This is the second post in my NLP series (you can find the first one here) and will cover the process from the raw data obtained from the API to the final formats with we will use to explore and analyse our data. Also, all the code for this post can be found here.

Table of Contents

Loading the Data
Cleaning and Preprocessing Techniques
Evaluation and Iteration
Key takeaways

Loading the Data

Let’s begin loading the data. During the process of data collection, the data was stored straight from the API, and it came in JSON format. Below is an example tweet.

Example Tweet

{
    "possibly_sensitive": false,
    "created_at": "2022-07-14T23:59:57.000Z",
    "text": "Más de 324.000 vehículos usan GNV en el país, según Infogas https://t.co/PztJXMGSJM",
    "id": "1547732663062081539",
    "public_metrics": {
        "retweet_count": 2,
        "reply_count": 1,
        "like_count": 7,
        "quote_count": 0
    }
},

JSON

These JSON files were stored in an AWS S3 bucket as well as locally. But for the sake of this post, I’ll work from local data using the pandas library, more specifically the json_normalize function. I’ll cover integration with AWS in a future post, along with the building of the final dashboard. Also, in this example the data will come from weeks 35, 40, 45 and 50 of 2022, and week 3 of 2023.

Once the data was loaded into memory, the result is a pandas data frame with the following structure.

	created_at	possibly_sensitive	id	text	retweet_count	reply_count	like_count	quote_count	referenced_tweets	newspaper	edit_history_tweet_ids	impression_count
0	2022-08-28 23:57:24+00:00	False	1564039479391838209	Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE	0	0	6	1	nan	elcomercio_peru	nan	nan
1	2022-08-28 23:49:59+00:00	False	1564037610393280512	“Me dijeron que estaba llevando vergüenza a la universidad”: la profesora obligada a renunciar por postear fotos en bikini https://t.co/zAe98GI7W2	0	0	5	1	nan	elcomercio_peru	nan	nan
2	2022-08-28 23:29:00+00:00	False	1564032331706470401	AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa	0	0	2	0	nan	elcomercio_peru	nan	nan
3	2022-08-28 23:14:11+00:00	False	1564028601053347843	Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7	6	7	18	1	nan	elcomercio_peru	nan	nan
4	2022-08-28 23:09:07+00:00	False	1564027328157683713	Essalud: realizan con éxito operativo de donación de órganos para salvar vida de siete pacientes en espera https://t.co/3sDo7q9Nuu	1	0	11	0	nan	elcomercio_peru	nan	nan

Now, for the rest of this post we will be focusing mainly on the text column, I’ll be cleaning the text string, as well as the contents in the context of the dataset.

Cleaning and Preprocessing Techniques

Removing unnecessary tweets

The first step to getting a more cohesive dataset is to clean the data as a whole. In this case, that means removing tweets that do not contribute to the analysis. To do that, I actually went into the Tweeter feeds of each newspaper to find which posts repeat everyday, and don’t contribute to the narrative. This is because the focus of my project is on the narrative of the newspapers, which means that posts that repeat everyday do not contribute to it, and can be removed. Those types of posts were:

Daily horoscope
Daily newspaper cover
Daily caricature
Ongoing contests

From the coding side, I decided to iterate over a list of identifier strings. That’ll allow me to edit the list whenever a newspaper decided to change the way they make their posts, and I can also save the list on a separate file to load when I do integration. Bellow you can see the code snippet used.

for identifier in identifier_strings: 
    data.drop(data.loc[data["text"].str.contains(identifier, flags=re.IGNORECASE, regex=True)].index, inplace=True)

Python

Extracting some features

Now we need to be aware that what we are working with here are Tweets from specific users, newspapers at that. This means many of them share some characteristics, things like URLs (redirecting you to the full article), hashtags, mentions, emojis, etc. And some of those things might be worthy or looking at. So, prior to any cleaning we need to extract those. For that, I’ll continue using common string methods to do, but while reading Parthvi Shah’s article on Tweet preprocessing [1] she mentioned a package called tweet-preprocessor which you might want to check out. For this project, I decided to look at hashtags and mentions.

data["mentions"] = data["text"].apply(lambda x: re.findall("@(\w+)", x))
data["hasthags"] = data["text"].apply(lambda x: re.findall("#(\w+)", x))

Python

Preprocessing tweets

With that, we are ready to process all tweet’s. But, before we begin, I’ll talk a little bit about cleaning and preprocessing data, along with some common techniques used in that process. Preprocessing data is a crucial part of the process, and is specifically important when working with text data. This is because analysing text data means that, we need to convert highly unstructured data into input features, which are numeric in value. So, how we do that? In his book “Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing” Dipanjan Sarkar [2] lists the most popular techniques and widely used techniques for text preprocessing, which are the following:

Removing HTML tags, URLs and noisy characters
Tokenization
Removing unnecessary tokens and stopwords
Handling contractions
Correcting spelling error
Stemming
Lemmatization
Tagging
Chunking
Parsing

We will be using most of these techniques except handling contractions because the tweets are in Spanish and Spanish doesn’t have contractions.

Removing HTML tags, URLs, and other noisy characters

The first technique we are going to use is, in plain words, the removal of clutter. For that I’ll group my cleaning in four passes to address certain aspects of text:

Numbers: I would normally remove numbers altogether, but while researching for the article I came across a package called num2words, and I decided to implement a function to replace numbers for their word counterparts. This packages supports multiple languages.
First pass: Very basic things like changing everything to lower case, removing punctuation, removing URLs, unicode and escape characters.
Emojis: I’m going to eliminate emojis altogether. For that I’ll use the emoji library.
Second pass: Removes calls to action and phrases that don’t add meaning to the newspaper discourse.

Here are the functions used.

def number_processing(text: str) -> str:
    """Takes a string, finds numbers on it, converts numbers to words and returns string with numbers replaced

    Args:
        text (str): text string to be processed

    Returns:
        str: string with numbers processed
    """
    numbers = re.findall(r"\b\d+\b", text)

    if numbers is []:
        return text

    for number in numbers:
        word_number = num2words(float(number), lang="es")
        text = re.sub(number, word_number, text)

    return text

Python

def clean_text_first_pass(text):
    """Get rid of other punctuation and non-sensical text identified.

    Args:
        text (string): text to be processed.
    """
    text = text.lower()
    text = re.sub("http[s]?(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", text) # Eliminates URLs
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text) # Eliminates punctuation
    text = re.sub("[‘’“”…«»►¿¡|│`]", "", text)
    text = re.sub("\n", " ", text)

    return text

Python

data["text_clean"] = data["text_clean"].apply(lambda x: emoji.replace_emoji(x, ""))

Python

def clean_text_second_pass(text):
    """Get rid of other punctuation and non-sensical text identified.

    Args:
        text (string): text to be processed.
    """
    text = re.sub("click aquí", "", text)
    text = re.sub("opinión", "", text)
    text = re.sub("rt ", "", text)
    text = re.sub('lee aquí el blog de', '', text)
    text = re.sub('vía gestionpe', '', text)
    text = re.sub('entrevista exclusiva', '', text)
    text = re.sub('en vivo', '', text)
    text = re.sub('entérate más aquí', '', text)
    text = re.sub('lee la columna de', '', text)
    text = re.sub('lee y comenta', '', text)
    text = re.sub('lea hoy la columna de', '', text)
    text = re.sub('escrito por', '', text)
    text = re.sub('lee la nota aquí', '', text)
    text = re.sub('una nota de', '', text)
    text = re.sub('aquí la nota', '', text)
    text = re.sub('nota completa aquí', '', text)
    text = re.sub('nota completa', '', text)
    text = re.sub('lee más', '', text)
    text = re.sub('lee aquí', '', text)

    text = re.sub("  ", " ", text)
    text = re.sub(" \w ", " ", text)
    text = re.sub("^(plusg)", "", text)
    text = re.sub("( video )$", "", text)
    text = re.sub("( lee )$", "", text)
    text = re.sub("( lee la )$", "", text)

    return text

Python

A note on the function for the second pass, the first part of the function could be a refactored into a loop taking a list of registered strings and making the sub for each of them.

After running through this steps we end up with two columns, one with the raw Tweet straight from the API, and another with the text free of clutter. Below is a small sample.

	text	text_clean
150	¿Sueles hablar de tus exparejas? Cuidado con esa manía, ¡suelta el pasado! https://t.co/0TujehPOmB	sueles hablar de tus exparejas cuidado con esa manía suelta el pasado
232	???? Durand critica al Congreso y usuarios la ‘trolean’: «Te quedaste sin chamba» https://t.co/Cp9baGoXPg	durand critica al congreso usuarios la trolean te quedaste sin chamba

A small tip:

Instead of using the head function of pandas to check the how the function are working, the sample function might be a better, since it shows you a different set every time so you can get a broader picture of how the dataset is being processed.

Text normalisation: Tokenisation, lemmatisation, stemming and stop-word removal

Now that we have uncluttered text, we can start converting the “clean text” into a format that will (after some more work) become the numerical representations we need to do further analysis. There are different libraries available for the task, the most popular ones are: NLTK and SpaCy. For this project I chose to use SpaCy, it has good support for Spanish and I also was curious to try it. After the package has been installed (here’s the documentation for installation), I need to download the model for the language I need, and in this case is Spanish.

A small caveat for the installation. Since I’m using Poetry for dependency management, I won’t use the pip command to install the package, but the poetry add command.

poetry add -G dev spacy
python -m spacy download es_core_news_sm

Bash

These commands will add SpaCy to your dev group in the pyproject.toml file, and download the small Spanish model. After that, we are ready to use the package to process our data into formats that are ready for analysis. Now, there is something that I would like to point out, the selection of techniques depends both on the dataset and what we intend to to with it. And, we also need to consider the computational costs of each technique.

In this project I’ll be using tokenisation, lemmatisation and stop-word removal, which can all be achieved with SpaCy. I decided against stemming since I’ll be doing Sentiment Analysis as a part of the project, and while is more computationally expensive, it yields more accurate results [3]. Below is the code used.

First, we need to create a doc object for the text object.

data_dtm["doc"] = data_dtm["corpus"].apply(lambda x: nlp(x))

Python

Then we can go onto tokenisation and lemmatisation, which we will need in order to build the document-term matrix for topic modelling, this can be done as follows.

data_dtm["token"] = data_dtm["doc"].apply(lambda doc: [t.orth_ for t in doc if not t.is_punct | t.is_stop | t.is_space])
data_dtm["lemma"] = data_dtm["doc"].apply(lambda doc: [t.lemma_ for t in doc if not t.is_punct | t.is_stop | t.is_space])

Python

As we’ve seen above, when using SpaCy the removal of stop-words is done while doing the tokenisation and lemmatisation. This is because each part of the doc object has different properties that indicate what type of entity we are referring to. Check the results of the previous steps on the table below.

	id	text	created_at	newspaper	corpus	doc	token	lemma
0	1564039479391838209	Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE	2022-08-28 23:57:24+00:00	elcomercio_peru	venezuela colombia retoman relaciones diplomáticas rotas hace tres años	venezuela colombia retoman relaciones diplomáticas rotas hace tres años	[‘venezuela’, ‘colombia’, ‘retoman’, ‘relaciones’, ‘diplomáticas’, ‘rotas’, ‘años’]	[‘venezuela’, ‘colombia’, ‘retomar’, ‘relación’, ‘diplomático’, ‘roto’, ‘año’]
1	1564032331706470401	AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa	2022-08-28 23:29:00+00:00	elcomercio_peru	amlo afirma que familias ya aceptaron plan de rescate de diez mineros	amlo afirma que familias ya aceptaron plan de rescate de diez mineros	[‘amlo’, ‘afirma’, ‘familias’, ‘aceptaron’, ‘plan’, ‘rescate’, ‘mineros’]	[‘amlo’, ‘afirmar’, ‘familia’, ‘aceptar’, ‘plan’, ‘rescate’, ‘minero’]
2	1564028601053347843	Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7	2022-08-28 23:14:11+00:00	elcomercio_peru	zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones	zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones	[‘zelensky’, ‘ocupantes’, ‘rusos’, ‘sentirán’, ‘consecuencias’, ‘futuras’, ‘acciones’]	[‘zelensky’, ‘ocupante’, ‘ruso’, ‘sentir’, ‘consecuencia’, ‘futuro’, ‘acción’]
3	1564023766937731073	Autoridades confirman transmisión comunitaria de viruela del mono en Panamá https://t.co/EBFcdrHz4Y	2022-08-28 22:54:58+00:00	elcomercio_peru	autoridades confirman transmisión comunitaria de viruela del mono en panamá	autoridades confirman transmisión comunitaria de viruela del mono en panamá	[‘autoridades’, ‘confirman’, ‘transmisión’, ‘comunitaria’, ‘viruela’, ‘mono’, ‘panamá’]	[‘autoridad’, ‘confirmar’, ‘transmisión’, ‘comunitario’, ‘viruela’, ‘mono’, ‘panamá’]
4	1564017585561141248	Las imágenes de los enfrentamientos entre seguidores de Cristina Kirchner y la policía en Argentina https://t.co/BYalmVyPBF	2022-08-28 22:30:25+00:00	elcomercio_peru	las imágenes de los enfrentamientos entre seguidores de cristina kirchner la policía en argentina	las imágenes de los enfrentamientos entre seguidores de cristina kirchner la policía en argentina	[‘imágenes’, ‘enfrentamientos’, ‘seguidores’, ‘cristina’, ‘kirchner’, ‘policía’, ‘argentina’]	[‘imagen’, ‘enfrentamiento’, ‘seguidor’, ‘cristina’, ‘kirchner’, ‘policía’, ‘argentina’]

Part-of-speech tagging

The last technique we could use during the cleaning stage is part-of-speech-tagging (POS tagging), which allows the analysis of the structure of text. In this project I decided not to, since tweets are small pieces of text and, from and exploratory perspective it doesn’t offer any additional information. Also, another thing to take into account is the end product of this project, which is a dashboard showing insights into the tweets and their performance.

Still, if we were working in a different applications such as a tweet generator, POS tagging is something that should be done in this stage. Now I’m feeling tempted to give that a try… (maybe in a future post)

Evaluation and Iteration

Preprocessing text data is an iterative process, so in order to measure the success of the techniques applied, EDA is required. The first way to get a small peek is through the .sample() function on the pandas data frame. But there are other ways to check such as word counts and unicode characters. It’s also very important to keep the cleaning functions modular and independent, this way we can make changes without disturbing the flow of the data and continue to adapt to new inputs.

For this project my process flow can be seen in the following image.

The main points where I check on the success of the preprocessing process are after the data cleaning process and on the columns of the document-term matrix. Bear in mind that, rebuilding the document-term matrix is an expensive process so it is important to try to catch as many errors as possible before working on a redo.

Key takeaways

In these part of the series, I covered the preprocessing of the dataset and the treatment of the tweets texts themselves. What techniques I’ve used, why, and how I’m planning to fit this stage into the pipeline. With that, I’ll leave you with a couple of things to keep in mind when you work on your own projects:

Select the techniques based on the final application of your project
Keep it modular because it’s a very iterative process and you will have to do this multiple times.

In the next post on the series I’ll go through the EDA process as well as the creation of the graphics for three of the tabs in the final dashboard.

Until then, stay curious and read!

Andrea.

References

[2] D. Sarkar, Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing. Berkeley, CA: Apress, 2019. doi: 10.1007/978-1-4842-4354-1.

[1] P. Shah, ‘Basic Tweet Preprocessing in Python’, Medium, Jun. 07, 2020. https://towardsdatascience.com/basic-tweet-preprocessing-in-python-efd8360d529e (accessed May 03, 2023).

[3] R. Singh, ‘The Ultimate Preprocessing Pipeline for Your NLP Models’, Medium, May 08, 2023. https://towardsdatascience.com/the-ultimate-preprocessing-pipeline-for-your-nlp-models-80afd92650fe (accessed May 10, 2023).

Loading the Data

Cleaning and Preprocessing Techniques

Removing unnecessary tweets

Extracting some features

Preprocessing tweets

Removing HTML tags, URLs, and other noisy characters

Text normalisation: Tokenisation, lemmatisation, stemming and stop-word removal

Part-of-speech tagging

Evaluation and Iteration

Key takeaways

References

Share: