1. Introduction
python logging library works in modular form. It consists of the follwoing modules.
- Logger: The logger exposes the interface that the application code uses to log the code.
- Handler: It sends the log records to appropriate destinations.
- Filter: It filters which log records to output.
- Formatter: It formats the log output to be written at the destination.
The log records are organized into levels based on the relevance and severity of the log. There are following log levels each assigned a numerical value that define the heirarchy of importance.
DEBUG (10)
: Detailed information about the processes specifically for diagnosing the problems.INFO (20)
: Confirmation that certain things are working as expected in the process.WARNING (30)
: An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The process is still working as expected.ERROR (40)
: Due to more serious problem, the process has not been able to perform some function.CRITICAL (50)
: A serious error, the program itself may be unable to continue running.
We will illustrate the logging using a file loading script below.
def get_file(filename: str):
try:
with open(filename, "r") as f:
file = f.read()
return file
except Exception as e:
print(f"Exception occured while loading the file: {e})
get_file("listings.csv")
We can configure our custom logger using a config file as below.
[loggers]
key=root
[handlers]
keys=fileHandler, consoleHandler
[formatters]
keys=logFormatter
[logger_root]
level=INFO
handlers=fileHandler, consoleHandler
[handler_fileHandler]
class=FileHandler
level=DEBUG
formatter=logFormatter
args=(log_filename, 'a')
[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=logFormatter
args=(sys.stdout, 'a')
[formatter_logFormatter]
format=%(asctime)s|%(name)s|%(loglevel)s: %(message)s
We consider the data ingestion pipeline for logging. Consider a pipeline that ingests data from two sources: database, and API. To pipeline consists of three stages: data sources, ingestion module, and the staging layer. In order to log the pipeline well, we need to define the metrics for logging.
- Data sources
- Size of downloaded data
- API response status
- API response time
- Database health
- Ingestion
- Schema match
- Time to retrieve data
- File size
- Dataframe dimension
- Time to transform data
- Staging
- Size of parquet files
- Schema match