GA4 sample data set for use in my testing project for AI Analytics tools
Do AI Analytics products on the market today live up to the hype? My kickoff post showed how I plan on taking AI Analytics products for a spin to see if they are worth their salt.
Here now is a look at the uniform data source that will be the testing grounds for delivering automated insights, It’s Google’s sample Google Analytics 4 data set residing on BigQuery.
What is this data set about?
It’s sample, anonymized data from Google’s own Merch store. It’s web analytics data of click-based behaviors and purchase transactions.
Consistent with GA4’s purpose, it is event-level data, not just traditional pageview and session-level web analytics data that we were used to from GA Universal Analytics.
Now that the old Google Analytics UA has been sunset, it’s not worth going back to the former GA UA data set anymore.
How much more do I need to know about this data model?
In an ideal world, a business decision maker would probably just love to leave the rest to AI, i.e. we specify what business decisions we are trying to make, let the AI figure out
- the data model
- data quality
- what questions can be answered
- what questions can’t.
- Let alone, how to query the data to get those answers.
- Make recommendations based on what the data says
If it was a human colleague from the data insights team, that’s what we would expect them to bring to the table.
But to what degree can AI tools do this? We’ve seen examples of Gen AI understanding simple tables and charts and running analysis on them. The problem is, real-life data models are not simple. They are not just a table but multi-level schemas, e.g. in this case
- A user has multiple website / app sessions over time, with multiple pageviews, multiple events during each pageview, potentially with a transaction which includes multiple purchased items.
- Querying this kind of multi-level data typically requires human analysts to be aware of the levels so they can formulate their queries correctly. Will AI understand this automatically?
- Running predictive analytics models on this type of data has typically required extracting data into a simple table where each row has one data unit and the columns represent the features that are the input into the model. Will AI do this transformation on its own?
To understand what the various AI Analytics products can vs. cannot do automatically today, we need to roll up our sleeves and understand the GA4 schema in detail ourselves first.
Sorry, no way around getting our hands dirty for this project!
What is this GA4 schema like?
IMO the older GA UA data set was easier to work with. The old schema was more like a typical, normalized BI data warehouse schema where a lot of the business dimensions that you want to analyze had their own data columns.
- GA UA had 2x as many columns because the data was more normalized, whereas in GA4 a lot of the same info is contained as rows of different event types, e.g. “Start_session” is an event type.
- A Visit ID and related information were present in the old GA UA schema so that you could describe what happens within a single interaction session. The GA4 data set does not really care about sessions and page views but it is focused on users and interaction events. That is a more modern view, more suitable for today’s highly interactive web and app experiences. But it is both more and less granular at the same time. Challenging to work with, or at least a steeper learning curve.
- According to Google’s own description
- In GA UA “Each row within a table corresponds to a session in Analytics 360.”
- In GA4, even a single event can be represented across multiple rows of data
“Data for a single event may be represented in one or multiple rows, […]. A page_view event with multiple event_params, for example, would look similar to the following table. The initial row contains the event name, date, timestamp and other non-repeated data items. The event_params RECORD is repeated for each parameter associated with the event. These repeated RECORDS are populated in subsequent rows directly under the initial event row.”
So, a query to analyze this pageview including the page referrer, and title would need to combine multiple rows of data as input. Good luck, AI!
Data teams can do that with manual effort. They might extract data into a simpler representation that makes it easier to query. They would do this for conventional analysis as well, but this dampens my expectations on the degree to which AI will be able to make insights effortless.
Nested and Repeated columns!
It shows that I haven’t kept up with the evolution of databases, but this data on BigQuery has been my first exposure to nested and repeated table columns.
Note that some parameters are nested within RECORDS, and some RECORDS such as
GA4 Export Schema documentationitems
andevent_params
are repeatable.
In the image below we can see the columns of a single row in the GA4 data set. The columns that have a little triangle can be expanded. If the type is “RECORD” it is essentially an array of multiple columns grouped together, no big deal.
But if, additionally, the mode is “REPEATED” we’ve got multiple of those arrays in a single row, e.g. multiple product items purchased together within the transaction. See below, how for each item purchased the array of data captures the item’s name, category, price, quantity etc.
This is all great data. But instead of having a traditional star schema where items would be a table of its own, this schema just contains everything in one data table with nested and repeated fields.
That is for a very good reason, namely query performance and avoiding costly SQL joins. BigQuery SQL can deal with this data as needed by un-nesting the repeated records and flattening them on the fly, if we need to access the items purchased in a single purchase event.
That’s all great, but it is an additional learning curve for part time data people like me, and for the AI
Here is a great video from Sean Cronin to walk through this data set and how to work with it in SQL
Let the fun begin
Let’s see how far today’s Analytics tools have come with their ability to deal with this type of schema. Let’s see what roadblocks this might put in front of AI’s ability to make analysis effortless for us.
I have got my work cut out for me to produce automated answers for the business questions that will be the test criteria for this AI Analytics tool evaluation.
Leave a Reply