Air Quality
In light of the continuous wildfires, there has been a lot of attention on air quality and the impact of PM2.5 and O₃ pollutants on people's health. As this is an area that I am not familiar with, I will be using this post to explore the data and learn more about the topic.
Air quality worst offenders
According to the World Health Organization (WHO):
Together with the connected issue of climate change, the WHO recognises air pollution as the biggest global health threat in the current century. Every year, exposure to ambient air pollution is estimated to cause around 4.5 million premature deaths globally, and indoor air pollution causes a further 2.3 million.2 By comparison, the COVID-19 pandemic caused around 5 million deaths globally in 2020, and around 12 million deaths in 2021, according to the WHO. [source]
The air pollutants that pose the most significant risk to our health are PM2.5 (Particulate Matter ≤ 2.5 μm) and O₃ (Ground Level Ozone). For context, here are some details:
PM2.5 (Particulate Matter ≤ 2.5 µm)
- Source: Combustion from cars, industry, and wood smoke. Also forms secondarily from gases like SO₂ and NOx.
- Health Effects:
- Penetrates deep into lungs and enters bloodstream
- Triggers heart attacks and strokes
- Worsens asthma and bronchitis
- Linked to lung cancer and premature death
O₃ (Ground-Level Ozone)
- Source: Not emitted directly. Forms from NOx + VOCs in sunlight.
- Health Effects:
- Irritates airways
- Reduces lung function
- Triggers asthma attacks and chronic respiratory diseases
- Worsens existing heart and lung conditions
The two main questions that I am trying to answer are:
-
What is the air quality, specifically PM2.5 and O₃ levels, in some of the major cities in North America and Europe?
-
Has the air quality improved or worsened significantly over the last 5 years?
The data
After some research, I came across OpenAQ. This is a non-profit organization that provides air quality data from various sources around the world. They offer API access to their data, but even more conveniently, all their data is stored in a publicly accessible S3 bucket s3://openaq-data-archive/...
. More details on their AWS bucket structure can be found here.
To download the data, I decided to parametrize the S3 URLs for each of the cities I am interested in. I used DuckDB's glob CSV reading functionality. I came up with this simple query that loads the data for a specific location_id:
FROM read_csv(
"s3://openaq-data-archive/records/csv.gz/locationid={loc_id}/year=202[{from_yr}-{to_yr}]/month=*/*",
union_by_name=True
);
The parameters loc_id
, from_yr
, and to_yr
are used in my Python script to pass the location ID, start year, and end year for the data I want to download. I got the location IDs for some cities that I selected in a somewhat random and subjective way from this interactive map. What makes this query particularly robust is the union_by_name=True
argument that is used to unify the schema of files that have different or missing columns. If a file does not have certain column, NULL values will be filled in.
Finally, I built a simple Python function which retrieves the data for each city from the S3 bucket and saves it locally in parquet files. I ended up with approximately 10MB of data for the 5 years of historical hourly sensor readings for the 22 cities I selected.
For the data exploration, I used a self-contained Marimo notebook. It is part of the project's GitHub repo and can be started on its own like this:
marimo edit air-quality.py --sandbox
The --sandbox
flag ensures that the notebook runs in a self-contained temporary Python environment. All package dependencies are already defined in the special # /// script ...
comment at the beginning of the file using TOML. I find this very practical and use it all the time.
Finally, the data is loaded using polars
's read_parquet()
.
To keep the length of this post manageable, the following exploration will focus only on PM2.5 particles pollution. The air_quality.py notebook contains similar analysis of O₃ (Ground Level Ozone) levels.
PM2.5
Before digging deeper into the PM2.5 pollution data, I wanted to create a simple heatmap showing the daily average PM2.5 reading by city. For the color coding, I relied on the WHO Air Quality Guidelines. Specifically, I used the part of the guideline stating that "... 24-hour average exposures should not exceed 15 µg/m3 more than 3 - 4 days per year...".
That is why the color scale turns red at about 15 µg/m3 level. I achieved this by adjusting the domainMid
parameter of the color scale in altair
:
...
heatmap = base.mark_rect().encode(
alt.Color(col)
.title(col_title)
.scale(scheme="redblue", domainMid=domainMid, reverse=True)
)
...
The cities (on the y-axis) are grouped by continent with European locations on the top. The first thing that struck me was that Athens, Greece is an outlier. It has average daily PM2.5 readings significantly higher than the rest of the sample. Apart from that, two interesting patterns emerge.
The first pattern shows that in Europe, peak PM2.5 values occur mostly between November and January. The pattern is very pronounced in the 2022-2023 and 2024-2025 seasons. It's slightly less visible in 2023-2024. My takeaway is that the increased PM2.5 levels in those periods are mainly caused by heating emissions. Interestingly, this phenomenon is not observable in North America.
The second pattern shows the impact of North American wildfires, which were particularly severe in 2023 (June-July), January 2025 (Southern California), and lastly June 2025 (Central Canada). As a result, the observed PM2.5 levels during those periods were higher in almost all of the selected North American cities.
By using average daily and monthly readings, we lose visibility of how high the PM2.5 pollution went and for how long it stayed there. Going back to WHO's definition, I wanted to understand how many times per year PM2.5 pollution exceeded 15 µg/m³ for 3 or more consecutive days. I also wanted to know the longest duration from those sequences. The following heatmap answers the first question:
As an example, there were 15 periods of 3 days or more in Vienna, Austria during 2021 when PM2.5 readings were higher than 15 µg/m³. Based on this metric, the European cities seem to be affected worse by PM2.5 pollution than the North American ones, with the exception of Los Angeles. Almost all of the examined cities in the old continent experienced high count of 3-day or longer periods with significant PM2.5 pollution. Rome, Athens and Sofia stand out in 2024 with respectively 20, 21 and 17 such periods.
The final heatmap should be examined in conjunction with the previous one. It shows the longest 3-days or more period for each year during which PM2.5 readings were higher than 15 µg/m³. Looking at the same three European cities in 2024, the longest PM2.5 pollution streak was in Rome. There, PM2.5 remained above the prescribed 15 µg/m³ threshold for 15 consecutive days.
It seems that the pollution situation in Sofia, Bulgaria is getting worse this year. The city has already seen 8 periods of 3-days or more with PM2.5 above prescribed levels. The longest of these lasted 20 days.
In terms of the practical implementation, the most interesting part of the pipeline is the following:
...
.with_row_index()
.with_columns(
pl.when(pl.col("value") < thr).then(0).when(pl.col("value") >= thr).then(1).alias("above_thr")
)
.with_columns(
pl.col("above_thr")
.diff()
.abs()
.mul(pl.col("index"))
.cum_max()
.alias("streak_group")
)
.with_columns(
pl.col("above_thr")
.cum_sum()
.over("city", "year", "streak_group")
.alias("consec_above_thr")
)
...
Taking the diff()
on the 'above_thr' column gives me the specific points in time when the signal changed from below the threshold to above and vice versa. Then multiplying by the 'index' and taking the cum_max()
assigns unique IDs to each consecutive group of 'above or below the threshold'.
Wrap up
This exploration revealed some concerning patterns in PM2.5 pollution across major cities in North America and Europe. European cities consistently show seasonal spikes during winter months, likely due to increased heating emissions. Meanwhile, North American cities face acute pollution events driven by wildfire activity.
Athens stands out as particularly problematic, with consistently high PM2.5 levels throughout the year. Cities like Rome and Sofia also show troubling trends, with extended periods of poor air quality that far exceed WHO recommendations.
All the code and data from this post are available in this repo