Data Parsing

On this page you can find information about Data parsing and how to store the data in a normalized and efficiant way.

Now that the data is persisted in the database it is time for the next step.

Getting the data

There is a background process constantly running on the webserver to check whether there are unprocessed pageviews in memory or records in the table umbracoEngageAnalyticsRawClientSideData.

The records in the table umbracoEngageAnalyticsRawClientSideData can be identified because the column processingStarted is NULL.

If the background process finds unprocessed pageviews in memory or one of these unprocessed records it fetches the rows of data and starts processing it. Once it has finished processing it updates the record in the table by setting values in the columns 'processingFinished' and 'processingMachine'.

Parsing

When the data is fetched Umbraco Engage will perform some different actions:

Normalize the data

All data is stored in a normalized way in the tables with the prefix: umbracoEngageAnalytics.

For example; each browser is only stored once in the table umbracoEngageAnalyticsBrowser and each browser version is stored once in the table umbracoEngageAnalyticsBrowserVersion.

The session is now related to the primary key ID of the browser version instead of storing the full-text string. This way, data can be queried effortlessly and is stored more efficiently (only an integer per browser instead of a text string).

This happens for all data:

  • Browser and browser version

  • Operating system

  • Visitor type

Relate data to Umbraco nodes

When the data was stored in the raw database tables only the URL was stored. In the parsing step, we try to identify which Umbraco node and which culture is served on this URL. This is an important step to report at a later point what happened on which page within the Umbraco backoffice.

Goals

Within Umbraco Engage you can set up goals via a specific page that is reached or an event that has been triggered. When parsing data Umbraco Engage checks whether one of the goals is reached with this record.

Configuration options

How frequently the data is processed can be set in the configuration file. Two parameters can be set:

  • The IntervalInRecords setting specifies how many unprocessed records should be fetched per parsing process.

  • The IntervalInSeconds setting specifies how often the background process is triggered and how often the parsing happens.

The higher you set these amounts the less frequent the parsing takes place.

It is possible to specify which web server should execute the processing step. The processing step is the heaviest in the data flow process. Most likely it will not have any impact, but for optimization reasons, you can specify which server is responsible for processing the raw data. This can be one web server, multiple web servers, or even a dedicated web server that does not serve the website itself. This can be set with the setting IsProcessingServer.

If using Umbraco in a load-balanced configuration ensure the front-end servers have the configuration setting for IsProcessingServer set to false. Also, make sure that the backend (Umbraco backoffice) server should only have this setting enabled.

Cleaning up the data

There is probably no or little reason to store this data forever. That is why we have two settings to clean up this data.

  • The first setting is 'AnonymizeDataAfterDays'. After the set number of days, the data will be anonymized. This means the data will still be shown in aggregate reports like pageviews, used browsers, number of visitors, etcetera, but it can not be related to an individual visitor anymore.

  • The second setting is 'DeleteDataAfterDays'. With this setting the data will be deleted after a set number of days. The reason is that it does not make sense to store your data for all eternity.

Last updated