Get Session Campagin from GA4 Exported BigQuery Data

Mu Kong
4 min readJan 6, 2023

--

Google Analytics 4 has a very convienent feature to export the traffic raw data to BigQuery, so that users can exam or proess the data by their own standards or customized logic.

Although the exported data in BigQuery has the schema to cover all dimensions and metrics as the data model in GA4, there are some key fields missing in the data, and the session scoped campaign/source/medium are among them.

There is no “Session campaign” in the BigQuery exported data

Since we have all the raw data in BigQuery, it should not be a problem to calculate the session scoped campaign/source/medium by ourselves. The key here is to use the window function in BigQuery to get the campaign/source/medium of the first event in each session.

To make the calculation more clear and straight forward, we first construct the subquery by extracting the fields we need from the original exported data set events_*.


SELECT
PARSE_DATE("%Y%m%d", _TABLE_SUFFIX) AS event_date,
event_name,
event_timestamp,
user_pseudo_id,

(
SELECT
value.int_value
FROM
UNNEST(event_params)
WHERE
key = 'ga_session_id') AS ga_session_id,

(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'source') AS event_source,

(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'medium') AS event_medium,

(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'campaign') AS event_campaign,

FROM
`XXXXXXX-data-project.analytics_XXXXXXXX.events_*`
WHERE
event_name NOT IN ('first_visit', 'session_start')

The event date here is apparently used for date range selection.

We will need event_timestamp, ga_session_id and user_pseudo_id for window processing later when we calcualte the session scoped campaign.

And of course, we extract the campaign/source/medium from the event_params to calculate the session campaign/source/mediums later in the outer query.

One thing to watch out here is that we need to exclude the first_visit, session_start events here, because those events are always at the beginning of sessions and they don’t have campaign/source/medium in event_param. If we include them in the windowing process, we will consistently get session campaign/source/medium as null.

Now we have all the fields from each individual events, the rest of the work is to mark the campaign/source/medium of the first event in each session and the session campaign/source/medium of all events in each session. We use window function to achieve that.

Below shows how to get the session campaign by ultilizing the window function.

FIRST_VALUE( event_campaign ) 
OVER (
PARTITION BY event_date, user_pseudo_id, ga_session_id
ORDER BY event_timestamp)
AS session_campaign

The clause is getting the first event_campaign of the events in each combination of event_date, user_pseudo_id, ga_session_id.

Although most part of this snipet is easy to understand, the fields we choose to put after PARTITION BY can be confusing.

We put both user_pseudo_id and ga_session_id instead of just ga_session_id here because ga_session_id is not globally unique but only unique per user. That is why we need to partition by both of them to get well bounded sessions.

Also, event_date is here because the window function will make the query unbounded, which will end up with querying all the historical data to find the first event_campaign for each session even if we add an event_date filter. This will of course cause large billed process capacity, and we don’t want that. Adding event_date into the PARTITION BY will make the query bounded again if we filter with an event_date range.

To add things up, we now have the final query:

SELECT
event_date,
event_name,
event_timestamp,
user_pseudo_id AS user_id,
ga_session_id AS session_id,
FIRST_VALUE( event_source) OVER (PARTITION BY event_date, user_pseudo_id, ga_session_id ORDER BY event_timestamp) AS session_source,
FIRST_VALUE( event_medium ) OVER (PARTITION BY event_date, user_pseudo_id, ga_session_id ORDER BY event_timestamp) AS session_medium,
FIRST_VALUE( event_campaign ) OVER (PARTITION BY event_date, user_pseudo_id, ga_session_id ORDER BY event_timestamp) AS session_campaign
FROM (
SELECT
PARSE_DATE("%Y%m%d", _TABLE_SUFFIX) AS event_date,
event_name,
event_timestamp,
user_pseudo_id,
(
SELECT
value.int_value
FROM
UNNEST(event_params)
WHERE
key = 'ga_session_id') AS ga_session_id,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'source') AS event_source,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'medium') AS event_medium,
(
SELECT
value.string_value
FROM
UNNEST(event_params)
WHERE
key = 'campaign') AS event_campaign,
FROM
`XXXXXXX-data-project.analytics_XXXXXXXX.events_*`
WHERE
event_name NOT IN ('first_visit',
'session_start'))
-- You can add event_date filter here too

This will get us the campaign/source/medium of the first event of the session as the session campaign/source/medium. I hope this will help people with their daily data analylsis work.

Please note that this is not exact the same with GA4 Session campaign/source/medium definition as GA4 definition of session attribution is still a blackbox.

For more differences between BigQuery exported data and original GA4 data, please read this thorough analysis post:

At the end, I want to point out that there are still space to improve the query.

One major thing is to precalculate the session_campagin, session_source, sesson_medium into a table. This is because every time we calculate the session scoped campaign/source/medium, we have to repeatly fetch and parse the event_param, which is a relatively large struct. This will bill us a lot of processing capacity and we don’t necessarily need the whole event_param. Precalculate the session_campagin, session_source, sesson_medium will save us a reasonable fortune.

--

--