Making the Leap to Kedro: Beyond Notebooks

One of the main challenges I had when I started to work on Data Science projects, was to make the jump from notebook-based data exploration to production-ready pipelines. To me, those were two different ways of thinking about problems, on one side product-based development projects, and on the other Data Science explorations. The problem arose when I started building a portfolio: How do I showcase my skills as a Data Scientist? And this is especially important to me as a, mostly self-taught, Data Scientist. My first instinct was to build an end product, but that didn’t solve the problem of showing my skills to carry out a project and be part of an existing working team.

So, after hearing about the Cookiecutter project, I found the Cookiecutter for Data Science, and it changed the way I approached and structured my projects. Now I had a place for most things, but this brought to my attention a small detail… I had no idea how to build a Data Science pipeline. And, while trying to learn about it, I found this article in Towards Data Science about Kedro. I was reluctant at first, but after getting stuck I decided to give it a try. Once I finished the migration, everything made a lot more sense. Now I could fully showcase my process more straightforwardly, and I actually had an idea of how a project can work, it even helped me to write the documentation.

In this post, I’ll show you how I migrated my NLP Newspapers Headlines Project to Kedro and what I learned about structuring pipelines in general. Because now, if I ever have to work on a project that is not using Kedro, I think I’ll be able to find my way around.

The Jupyter Exploration Dilemma

Jupyter Notebooks in Context

Jupyter Notebooks are an excellent tool for interactive data exploration. I can run, debug and correct code as I go, without having to run and rerun a script over and over again, or type the same command endless times in a REPL (No hate for REPLs here, I love and use them a bunch). Also, Jupyter Notebooks can serve as a tool for reporting with its Markdown capabilities.

And while all that is great, especially if you are very random and whimsical in your exploration, it can get tricky when you start to think in terms of production. For example, my NLP Newspapers Analysis project started in a very exploratory way, with the tutorial from Alice Zhao, as a way of figuring out all that can be done with NLP. But, after the exploration was done I started to wonder, what could I finally accomplish that was worth showing.

In the beginning, I thought about building things such as Dashboards… reports maybe? But once I started putting together a portfolio I figured that is not just about the end project, but also my process and ability to adapt to any existing team. That’s when I knew I had to switch from notebook-based work to script-based pipelines that could be run in sequences. And the truth is, I had no idea where to start.

Challenges in Transitioning

In that way, the finding of the Cookiecutter for Data Science is a blessing. Its structure has a place for data, notebooks, reports, and source code. And it means that, once the exploration is done, I can go into the source folder and make modules out of the different steps in the process. However, I had no idea how to make the modules because, my exploration processes tend to be very fluid, with the different parts bleeding into one another. I had the data transformation on one side, and utilities (settings, I/O) on the other, but no real idea of how to put together a cohesive module ready for show. So, I had the space, but I didn’t know how to organize the information. Enter Kedro.

Introducing Kedro for Data Pipelines

I found out about Kedro in 2021 from this article in Towards Data Science and, while I found it interesting, I know I’m a sucker for new shiny things so I decided to opt for restraint and not try it. However, I was still looking for a way to showcase my Data Skills, I decided to build a Dashboard from the project, and that’s when I realized that I needed a reproducible way of producing outputs. I tried again with the pipelines, and when I hit the wall again, I decided to give Kedro a chance.

What is Kedro?

Is a Python framework that can help people build modular, maintainable, and production-ready Data Science projects. It takes inspiration from the Cookiecutter for Data Science, as it also has a set structure and place for notebooks, source code, data, etc. But it’s built in a way that makes the process of creating a useful Data Science project very intuitive, the tutorial is also very helpful if you are starting. And if you are a bit lost after exploration, it can show a path and help figure out what comes next after the exploration phase.

In general, using Kedro can help you understand how a Data Science project works, especially if you, like me, have never thought about it as part of a production pipeline. How to build source code out of Data Explorations, which is very powerful, because if you end up in a job that has an already working in-house pipeline, you can have an idea of how things work.

Advantages of Kedro

Kedro brings the advantages of regular scripting source code to Data Science. Things such as reproducibility, plug-ability, and versioning, plus a general blueprint for your project. And it all stems from its structure and the way it’s all tied together. The way I can explain it is by thinking in terms of inputs and outputs. To build a project using Kedro, you have to write functions that move and transform the data as it goes along and it turns into a pipeline. It’s a very result-driven process and, after the exploration stage, it can be very helpful to map the different processes and the results they produce.

To tie everything together, there is the CLI which you can use to run the pipelines, slice them, debug them, etc.

Migrating to Kedro

Now, for the rest of the post, I’ll explain how I migrated my existing NLP Newspapers Analysis to Kedro.

Project Initialization

Installing Kedro is a very straightforward process, you can use Pip, Poetry or any package manager in your project virtual’s environment. You can also use Pipx for system-wide installation and inject it into your environment. When I migrated the project, the first thing I did was visually chart the process. By that time I already had a clear idea of the general flow and, I had some functions built from my previous attempt. So, I took inspiration from Kedro-Viz and built the data flow and functions.

And this is the first difference from what I had done before. It was the graph above that helped me distinguish the different pipelines. Data Cleaning, Feature Engineering, Machine Learning and EDA. To separate them I thought about what I want to happen when I run.

kedro run {insert pipeline name}

ShellScript

Defining the Data Catalog

I built the Data Catalog first, and since I had the flowchart with the names of each data stage assigned, the process was very straightforward. That way, it’s easier to assign names and locations to everything. For the NLP Newspapers Analysis Project, the data comes as a series of files, one per week, because before Kedro, the project consisted of a process that was run once a week. I haven’t given a thought to whether I’ll change that, or how I would configure the project to act as a stream. These types of Datasets are defined as Partitioned Datasets and are defined as follows.

{name_in_catalog}:
  type: PartitionedDataset
  path: data/{folder_in_data}/{subfolder}
  dataset: {type_of_dataset}

YAML

For example, the dataset for my raw results from the API is as follows.

Partitioned Dataset Example

newspapers_raw_tweets:
  type: PartitionedDataset
  path: data/01_raw/tweets
  dataset: json.JSONDataset

YAML

There are several types of datasets, which you can find in the kedro-datasets documentation.

Creating Nodes and Pipelines

After that, it was time to create the functions that connect the data catalog points, for that I created the pipeline with the following command.

kedro pipeline create <NAME>

ShellScript

That will create a directory with the name of the pipeline with 4 files inside (README.md, __init__.py, nodes.py, and pipeline.py). And it’s inside the nodes.py that we create the functions that will transform the data. Once the functions are created, we tie them together in the pipeline.py file. Below is an example of my feature_engineering pipeline.

pipeline.py

from kedro.pipeline import Pipeline, pipeline, node

from .nodes import make_corpus, make_data_dtm, make_dtm, make_dtm_newspaper


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=make_corpus,
                inputs="clean_data",
                outputs="corpus",
                name="make_corpus_node",
            ),
            node(
                func=make_data_dtm,
                inputs="corpus",
                outputs="data_dtm",
                name="make_data_dtm_node",
            ),
            node(func=make_dtm, inputs="data_dtm", outputs="dtm", name="make_dtm_node"),
            node(
                func=make_dtm_newspaper,
                inputs=["corpus", "dtm"],
                outputs="dtm_newspaper",
                name="make_dtm_newspaper_node",
            ),
        ]
    )

Python

In the file, we import each of the functions from nodes.py and then build the create_pieline function. Then, when all pipelines have been built we can register them in the pipeline_registry.py file in the project module.

pipeline_registry.py

"""Project pipelines."""
from typing import Dict

from kedro.pipeline import Pipeline

from newspapersAnalysis.pipelines import (
    cleaning_and_preprocessing,
    eda,
    feature_engineering,
    sentiment_emotion_analysis,
    topic_modeling,
)


def register_pipelines() -> Dict[str, Pipeline]:
    """Register the project's pipelines.
    Returns:
        Dict[str, Pipeline]: A mapping from a pipeline name to a Pipeline object.
    """
    cleaning_and_preprocessing_pipeline = cleaning_and_preprocessing.create_pipeline()
    eda_pipeline = eda.create_pipeline()
    feature_engineering_pipeline = feature_engineering.create_pipeline()
    sentiment_emotion_analysis_pipeline = sentiment_emotion_analysis.create_pipeline()
    topic_modeling_pipeline = topic_modeling.create_pipeline()

    return {
        "cleaning_preprocessing": cleaning_and_preprocessing_pipeline,
        "eda": eda_pipeline,
        "feature_engineering": feature_engineering_pipeline,
        "sentiment_emotion": sentiment_emotion_analysis_pipeline,
        "topic_modeling": topic_modeling_pipeline,
        "__default__": cleaning_and_preprocessing_pipeline
        + eda_pipeline
        + feature_engineering_pipeline
        + sentiment_emotion_analysis_pipeline
        + topic_modeling_pipeline,
    }

Python

This way of organizing code helped me understand the separation between the functions that process the data, and the functions that tie the processes together. It is one of the main differences with my first attempt, where I had an I/O module for each of the pipelines, and the functions could not be run individually.

Executing and Managing Kedro Pipelines

Running your Pipeline

Once I registered the pipelines, I could start running them. For that, there are a few commands that can help us run the project, whether we want to run the whole thing, or each pipeline separately, and even splice them.

kedro run  # Runs the pipelines declared en the __default__ key
kedro run —-pipeline={name of pipeline} # Run a specified pipeline

ShellScript

With that, I started running the pipelines one node at a time, and I realized that some took a long time, and I had no idea of what was going on. There were also failures, errors, and bugs, so knowing what was happening was essential. For that I used logging.

Logging in Kedro

Kedro has an easy integration with the default Python logging module. The configuration for the logger is done in the conf/base/logging.yml file. At the beginning I had some trouble getting it to work, but upon reading the documentation and seeing some YouTube videos I sort of understood how to set it up. Below is the structure for my logging.yml file, which is the default config from the documentation.

version
disable_existing_loggers
formatters
└ simple
  └ format
handlers
├ console
│ ├ class
│ ├ level
│ ├ formatter
│ └ stream
├ info_file_handler
│ ├ class
│ ├ level
│ ├ formatter
│ ├ filename
│ ├ maxBytes
│ ├ backupCount
│ ├ encoding
│ └ delay
└ rich
  ├ class
  ├ rich_tracebacks
  └ tracebacks_show_locals
loggers
├ kedro
│ └ level
└ nlp-newpapersAnalysis
  └ level
root
└ handlers

ShellScript

There, looked mostly at the available loggers, and modified the formatter a bit to make it blend in with Kedro’s (just a personal preference). I can then call and use the nlp-newspapersAnalysis logger from any of the node files. This is just a simple way of doing logging, but if I wanted to have a separate logger for other purposes I would just add it there. Now to call the logger inside a node I imported the logging module and defined a logger at the top of the file.

example_node.py

import logging
import pandas as pd
import spacy

from typing import Any, Callable, Dict
from itertools import product


logger = logging.getLogger("nlp-newpapersAnalysis")
 […]

Python

To use the logger inside a node I called the appropriate method and typed the message. Not the that the logger also supports Rich markup form the logger. Below is an example of usage.

logger.info(f"[bold blue]Data DTM ->[/bold blue] {new_filename} starts", extra={"markup": True})

Python

Debugging and the REPL

After the logger has been set you can see your nodes running process, and that is a clearer view of bugs and errors. Kedro also has debugging capabilities with integrations with widely used debuggers. But the way I like to go about fixing things is by reading the error messages and using the integrated REPL. It has the project session integrated, so you can access nodes and data from within, and it works with IPython. To use that you just run.

kedro ipython

ShellScript

Conclusion

And that will be it. I think at this time, migrating to Kedro was a good decision. It helped me understand how to structure projects, and how Data Science might look in production. It also helped me move beyond Jupyter Notebook exploration and create a process that can be shown. This is not the end though, after the main migration process, I’ll need to write and publish documentation, which can make or break the chance that others interact with your project for your project. But that’s for a future post since I’m still working on it.

The Jupyter Exploration Dilemma

Jupyter Notebooks in Context

Challenges in Transitioning

Introducing Kedro for Data Pipelines

What is Kedro?

Advantages of Kedro

Migrating to Kedro

Project Initialization

Defining the Data Catalog

Creating Nodes and Pipelines

Executing and Managing Kedro Pipelines

Running your Pipeline

Logging in Kedro

Debugging and the REPL

Conclusion

Share: