Garbage in, garbage out.
widely attributed to be coined by George Fuechsel, an IBM programmer and instructor
This is one of the first phrases I heard when I started learning about NLP analysis, and it got me a bit paranoid about my data cleaning process. But one of the things that stuck with me the most was, the importance of thinking about what’s your analysis goal, and then tailoring your preprocessing process to that purpose. This is the topic of today’s post.
A small note:
I’ll be covering data retrieval in a future post, since the data that was originally used for the project came from Twitter’s API (which is no longer free) and I’m currently working on a solution to be able to collect new information.
This is the second post in my NLP series (you can find the first one here) and will cover the process from the raw data obtained from the API to the final formats with we will use to explore and analyse our data. Also, all the code for this post can be found here.
Loading the Data
Let’s begin loading the data. During the process of data collection, the data was stored straight from the API, and it came in JSON format. Below is an example tweet.
{
"possibly_sensitive": false,
"created_at": "2022-07-14T23:59:57.000Z",
"text": "Más de 324.000 vehículos usan GNV en el país, según Infogas https://t.co/PztJXMGSJM",
"id": "1547732663062081539",
"public_metrics": {
"retweet_count": 2,
"reply_count": 1,
"like_count": 7,
"quote_count": 0
}
},
JSONThese JSON files were stored in an AWS S3 bucket as well as locally. But for the sake of this post, I’ll work from local data using the pandas
library, more specifically the json_normalize
function. I’ll cover integration with AWS in a future post, along with the building of the final dashboard. Also, in this example the data will come from weeks 35, 40, 45 and 50 of 2022, and week 3 of 2023.
Once the data was loaded into memory, the result is a pandas
data frame with the following structure.
created_at | possibly_sensitive | id | text | retweet_count | reply_count | like_count | quote_count | referenced_tweets | newspaper | edit_history_tweet_ids | impression_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-08-28 23:57:24+00:00 | False | 1564039479391838209 | Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE | 0 | 0 | 6 | 1 | nan | elcomercio_peru | nan | nan |
1 | 2022-08-28 23:49:59+00:00 | False | 1564037610393280512 | “Me dijeron que estaba llevando vergüenza a la universidad”: la profesora obligada a renunciar por postear fotos en bikini https://t.co/zAe98GI7W2 | 0 | 0 | 5 | 1 | nan | elcomercio_peru | nan | nan |
2 | 2022-08-28 23:29:00+00:00 | False | 1564032331706470401 | AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa | 0 | 0 | 2 | 0 | nan | elcomercio_peru | nan | nan |
3 | 2022-08-28 23:14:11+00:00 | False | 1564028601053347843 | Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7 | 6 | 7 | 18 | 1 | nan | elcomercio_peru | nan | nan |
4 | 2022-08-28 23:09:07+00:00 | False | 1564027328157683713 | Essalud: realizan con éxito operativo de donación de órganos para salvar vida de siete pacientes en espera https://t.co/3sDo7q9Nuu | 1 | 0 | 11 | 0 | nan | elcomercio_peru | nan | nan |
Now, for the rest of this post we will be focusing mainly on the text column, I’ll be cleaning the text string, as well as the contents in the context of the dataset.
Cleaning and Preprocessing Techniques
Removing unnecessary tweets
The first step to getting a more cohesive dataset is to clean the data as a whole. In this case, that means removing tweets that do not contribute to the analysis. To do that, I actually went into the Tweeter feeds of each newspaper to find which posts repeat everyday, and don’t contribute to the narrative. This is because the focus of my project is on the narrative of the newspapers, which means that posts that repeat everyday do not contribute to it, and can be removed. Those types of posts were:
- Daily horoscope
- Daily newspaper cover
- Daily caricature
- Ongoing contests
From the coding side, I decided to iterate over a list of identifier strings. That’ll allow me to edit the list whenever a newspaper decided to change the way they make their posts, and I can also save the list on a separate file to load when I do integration. Bellow you can see the code snippet used.
for identifier in identifier_strings:
data.drop(data.loc[data["text"].str.contains(identifier, flags=re.IGNORECASE, regex=True)].index, inplace=True)
PythonExtracting some features
Now we need to be aware that what we are working with here are Tweets from specific users, newspapers at that. This means many of them share some characteristics, things like URLs (redirecting you to the full article), hashtags, mentions, emojis, etc. And some of those things might be worthy or looking at. So, prior to any cleaning we need to extract those. For that, I’ll continue using common string methods to do, but while reading Parthvi Shah’s article on Tweet preprocessing [1] she mentioned a package called tweet-preprocessor
which you might want to check out. For this project, I decided to look at hashtags and mentions.
data["mentions"] = data["text"].apply(lambda x: re.findall("@(\w+)", x))
data["hasthags"] = data["text"].apply(lambda x: re.findall("#(\w+)", x))
PythonPreprocessing tweets
With that, we are ready to process all tweet’s. But, before we begin, I’ll talk a little bit about cleaning and preprocessing data, along with some common techniques used in that process. Preprocessing data is a crucial part of the process, and is specifically important when working with text data. This is because analysing text data means that, we need to convert highly unstructured data into input features, which are numeric in value. So, how we do that? In his book “Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing” Dipanjan Sarkar [2] lists the most popular techniques and widely used techniques for text preprocessing, which are the following:
- Removing HTML tags, URLs and noisy characters
- Tokenization
- Removing unnecessary tokens and stopwords
- Handling contractions
- Correcting spelling error
- Stemming
- Lemmatization
- Tagging
- Chunking
- Parsing
We will be using most of these techniques except handling contractions because the tweets are in Spanish and Spanish doesn’t have contractions.
Removing HTML tags, URLs, and other noisy characters
The first technique we are going to use is, in plain words, the removal of clutter. For that I’ll group my cleaning in four passes to address certain aspects of text:
- Numbers: I would normally remove numbers altogether, but while researching for the article I came across a package called
num2words
, and I decided to implement a function to replace numbers for their word counterparts. This packages supports multiple languages. - First pass: Very basic things like changing everything to lower case, removing punctuation, removing URLs, unicode and escape characters.
- Emojis: I’m going to eliminate emojis altogether. For that I’ll use the
emoji
library. - Second pass: Removes calls to action and phrases that don’t add meaning to the newspaper discourse.
Here are the functions used.
def number_processing(text: str) -> str:
"""Takes a string, finds numbers on it, converts numbers to words and returns string with numbers replaced
Args:
text (str): text string to be processed
Returns:
str: string with numbers processed
"""
numbers = re.findall(r"\b\d+\b", text)
if numbers is []:
return text
for number in numbers:
word_number = num2words(float(number), lang="es")
text = re.sub(number, word_number, text)
return text
Pythondef clean_text_first_pass(text):
"""Get rid of other punctuation and non-sensical text identified.
Args:
text (string): text to be processed.
"""
text = text.lower()
text = re.sub("http[s]?(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", text) # Eliminates URLs
text = re.sub("[%s]" % re.escape(string.punctuation), "", text) # Eliminates punctuation
text = re.sub("[‘’“”…«»►¿¡|│`]", "", text)
text = re.sub("\n", " ", text)
return text
Pythondata["text_clean"] = data["text_clean"].apply(lambda x: emoji.replace_emoji(x, ""))
Pythondef clean_text_second_pass(text):
"""Get rid of other punctuation and non-sensical text identified.
Args:
text (string): text to be processed.
"""
text = re.sub("click aquí", "", text)
text = re.sub("opinión", "", text)
text = re.sub("rt ", "", text)
text = re.sub('lee aquí el blog de', '', text)
text = re.sub('vía gestionpe', '', text)
text = re.sub('entrevista exclusiva', '', text)
text = re.sub('en vivo', '', text)
text = re.sub('entérate más aquí', '', text)
text = re.sub('lee la columna de', '', text)
text = re.sub('lee y comenta', '', text)
text = re.sub('lea hoy la columna de', '', text)
text = re.sub('escrito por', '', text)
text = re.sub('lee la nota aquí', '', text)
text = re.sub('una nota de', '', text)
text = re.sub('aquí la nota', '', text)
text = re.sub('nota completa aquí', '', text)
text = re.sub('nota completa', '', text)
text = re.sub('lee más', '', text)
text = re.sub('lee aquí', '', text)
text = re.sub(" ", " ", text)
text = re.sub(" \w ", " ", text)
text = re.sub("^(plusg)", "", text)
text = re.sub("( video )$", "", text)
text = re.sub("( lee )$", "", text)
text = re.sub("( lee la )$", "", text)
return text
PythonA note on the function for the second pass, the first part of the function could be a refactored into a loop taking a list of registered strings and making the sub for each of them.
After running through this steps we end up with two columns, one with the raw Tweet straight from the API, and another with the text free of clutter. Below is a small sample.
text | text_clean | |
---|---|---|
150 | ¿Sueles hablar de tus exparejas? Cuidado con esa manía, ¡suelta el pasado! https://t.co/0TujehPOmB | sueles hablar de tus exparejas cuidado con esa manía suelta el pasado |
232 | ???? Durand critica al Congreso y usuarios la ‘trolean’: «Te quedaste sin chamba» https://t.co/Cp9baGoXPg | durand critica al congreso usuarios la trolean te quedaste sin chamba |
A small tip:
Instead of using the head
function of pandas to check the how the function are working, the sample
function might be a better, since it shows you a different set every time so you can get a broader picture of how the dataset is being processed.
Text normalisation: Tokenisation, lemmatisation, stemming and stop-word removal
Now that we have uncluttered text, we can start converting the “clean text” into a format that will (after some more work) become the numerical representations we need to do further analysis. There are different libraries available for the task, the most popular ones are: NLTK
and SpaCy
. For this project I chose to use SpaCy
, it has good support for Spanish and I also was curious to try it. After the package has been installed (here’s the documentation for installation), I need to download the model for the language I need, and in this case is Spanish.
A small caveat for the installation. Since I’m using Poetry for dependency management, I won’t use the pip
command to install the package, but the poetry add
command.
poetry add -G dev spacy
python -m spacy download es_core_news_sm
BashThese commands will add SpaCy
to your dev group in the pyproject.toml
file, and download the small Spanish model. After that, we are ready to use the package to process our data into formats that are ready for analysis. Now, there is something that I would like to point out, the selection of techniques depends both on the dataset and what we intend to to with it. And, we also need to consider the computational costs of each technique.
In this project I’ll be using tokenisation, lemmatisation and stop-word removal, which can all be achieved with SpaCy
. I decided against stemming since I’ll be doing Sentiment Analysis as a part of the project, and while is more computationally expensive, it yields more accurate results [3]. Below is the code used.
First, we need to create a doc object for the text object.
data_dtm["doc"] = data_dtm["corpus"].apply(lambda x: nlp(x))
PythonThen we can go onto tokenisation and lemmatisation, which we will need in order to build the document-term matrix for topic modelling, this can be done as follows.
data_dtm["token"] = data_dtm["doc"].apply(lambda doc: [t.orth_ for t in doc if not t.is_punct | t.is_stop | t.is_space])
data_dtm["lemma"] = data_dtm["doc"].apply(lambda doc: [t.lemma_ for t in doc if not t.is_punct | t.is_stop | t.is_space])
PythonAs we’ve seen above, when using SpaCy
the removal of stop-words is done while doing the tokenisation and lemmatisation. This is because each part of the doc
object has different properties that indicate what type of entity we are referring to. Check the results of the previous steps on the table below.
id | text | created_at | newspaper | corpus | doc | token | lemma | |
---|---|---|---|---|---|---|---|---|
0 | 1564039479391838209 | Venezuela y Colombia retoman relaciones diplomáticas rotas hace tres años https://t.co/L6uVA6LcEE | 2022-08-28 23:57:24+00:00 | elcomercio_peru | venezuela colombia retoman relaciones diplomáticas rotas hace tres años | venezuela colombia retoman relaciones diplomáticas rotas hace tres años | [‘venezuela’, ‘colombia’, ‘retoman’, ‘relaciones’, ‘diplomáticas’, ‘rotas’, ‘años’] | [‘venezuela’, ‘colombia’, ‘retomar’, ‘relación’, ‘diplomático’, ‘roto’, ‘año’] |
1 | 1564032331706470401 | AMLO afirma que familias ya aceptaron plan de rescate de 10 mineros https://t.co/dG3VJXWgNa | 2022-08-28 23:29:00+00:00 | elcomercio_peru | amlo afirma que familias ya aceptaron plan de rescate de diez mineros | amlo afirma que familias ya aceptaron plan de rescate de diez mineros | [‘amlo’, ‘afirma’, ‘familias’, ‘aceptaron’, ‘plan’, ‘rescate’, ‘mineros’] | [‘amlo’, ‘afirmar’, ‘familia’, ‘aceptar’, ‘plan’, ‘rescate’, ‘minero’] |
2 | 1564028601053347843 | Zelensky: los ocupantes rusos sentirán las consecuencias de “futuras acciones” https://t.co/mNJTLz0SS7 | 2022-08-28 23:14:11+00:00 | elcomercio_peru | zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones | zelensky los ocupantes rusos sentirán las consecuencias de futuras acciones | [‘zelensky’, ‘ocupantes’, ‘rusos’, ‘sentirán’, ‘consecuencias’, ‘futuras’, ‘acciones’] | [‘zelensky’, ‘ocupante’, ‘ruso’, ‘sentir’, ‘consecuencia’, ‘futuro’, ‘acción’] |
3 | 1564023766937731073 | Autoridades confirman transmisión comunitaria de viruela del mono en Panamá https://t.co/EBFcdrHz4Y | 2022-08-28 22:54:58+00:00 | elcomercio_peru | autoridades confirman transmisión comunitaria de viruela del mono en panamá | autoridades confirman transmisión comunitaria de viruela del mono en panamá | [‘autoridades’, ‘confirman’, ‘transmisión’, ‘comunitaria’, ‘viruela’, ‘mono’, ‘panamá’] | [‘autoridad’, ‘confirmar’, ‘transmisión’, ‘comunitario’, ‘viruela’, ‘mono’, ‘panamá’] |
4 | 1564017585561141248 | Las imágenes de los enfrentamientos entre seguidores de Cristina Kirchner y la policía en Argentina https://t.co/BYalmVyPBF | 2022-08-28 22:30:25+00:00 | elcomercio_peru | las imágenes de los enfrentamientos entre seguidores de cristina kirchner la policía en argentina | las imágenes de los enfrentamientos entre seguidores de cristina kirchner la policía en argentina | [‘imágenes’, ‘enfrentamientos’, ‘seguidores’, ‘cristina’, ‘kirchner’, ‘policía’, ‘argentina’] | [‘imagen’, ‘enfrentamiento’, ‘seguidor’, ‘cristina’, ‘kirchner’, ‘policía’, ‘argentina’] |
Part-of-speech tagging
The last technique we could use during the cleaning stage is part-of-speech-tagging (POS tagging), which allows the analysis of the structure of text. In this project I decided not to, since tweets are small pieces of text and, from and exploratory perspective it doesn’t offer any additional information. Also, another thing to take into account is the end product of this project, which is a dashboard showing insights into the tweets and their performance.
Still, if we were working in a different applications such as a tweet generator, POS tagging is something that should be done in this stage. Now I’m feeling tempted to give that a try… (maybe in a future post)
Evaluation and Iteration
Preprocessing text data is an iterative process, so in order to measure the success of the techniques applied, EDA is required. The first way to get a small peek is through the .sample()
function on the pandas
data frame. But there are other ways to check such as word counts and unicode characters. It’s also very important to keep the cleaning functions modular and independent, this way we can make changes without disturbing the flow of the data and continue to adapt to new inputs.
For this project my process flow can be seen in the following image.
The main points where I check on the success of the preprocessing process are after the data cleaning process and on the columns of the document-term matrix. Bear in mind that, rebuilding the document-term matrix is an expensive process so it is important to try to catch as many errors as possible before working on a redo.
Key takeaways
In these part of the series, I covered the preprocessing of the dataset and the treatment of the tweets texts themselves. What techniques I’ve used, why, and how I’m planning to fit this stage into the pipeline. With that, I’ll leave you with a couple of things to keep in mind when you work on your own projects:
- Select the techniques based on the final application of your project
- Keep it modular because it’s a very iterative process and you will have to do this multiple times.
In the next post on the series I’ll go through the EDA process as well as the creation of the graphics for three of the tabs in the final dashboard.
Until then, stay curious and read!
Andrea.
References
[2] D. Sarkar, Text Analytics with Python: A Practitioner’s Guide to Natural Language Processing. Berkeley, CA: Apress, 2019. doi: 10.1007/978-1-4842-4354-1.
[1] P. Shah, ‘Basic Tweet Preprocessing in Python’, Medium, Jun. 07, 2020. https://towardsdatascience.com/basic-tweet-preprocessing-in-python-efd8360d529e (accessed May 03, 2023).
[3] R. Singh, ‘The Ultimate Preprocessing Pipeline for Your NLP Models’, Medium, May 08, 2023. https://towardsdatascience.com/the-ultimate-preprocessing-pipeline-for-your-nlp-models-80afd92650fe (accessed May 10, 2023).