SaaS Data Aggregation: A Design Pattern for the modern Enterprise

Technical Architects and Solution Architects have recently had to consider a wide variety of systems and integrations to satisfy business needs. The evolution of Software-as-a-Service solutions on the market has driven a considerable increase in a hybrid application landscape that combines internal homegrown systems, third party supplier/partner systems and SaaS-based applications.
When defining your Enterprise Architecture for the modern business, it is no longer enough to only consider your internal business systems. Very few companies can still do this, and as such, the Architect must plan and manage this new hybrid environment in a systematic and controlled manner.
Your Architectural Principles must consider the introduction and adoption of cloud-based SaaS providers. Multi-tenanted SaaS offerings such as Salesforce CRM, Concur expense management, Xero accounting, Stripe Payments, Workday HR management and Atlassian productivity all provide a compelling offer over building such capabilities in-house. As such, it is inevitable that your Enterprise landscape will integrate with some of these systems. Your colleagues will suggest these, and quite often your business will begin using some SaaS systems without involving the technology team given the ease of sign up for free trials or paid plans using a credit card. It is often an after-thought to discuss the use of such systems when it becomes clear that the technology team should, of course, have been involved from the outset. Rather than reacting to these scenarios, it is best to plan for them and to socialise amongst your colleagues a set of guidelines for using SaaS systems and a process for on-boarding one to the organisation.
Interoperability of systems is a crucial consideration. When you choose a SaaS product, it is crucial to fully understand how your internal systems will integrate with the SaaS provider. In some cases, they may be allowed to run entirely in isolation, but it is common that some degree of integration will bring value to your organisation. Most modern SaaS providers provide an API for example; Some also allow for webhooks to be configured. In some cases, you may need to push data into a SaaS system for consistency, and in other cases, you need to pull data from the SaaS system into your data warehouse or Event Hubs such as Azure or Kafka. The business case for this data integration may be for Workflow automation, Business Intelligence, Reporting or Regulatory audits and filings.
The security model used by the SaaS provider also needs to be carefully assessed. Most modern providers use OAuth2 which provides a robust method that enables an administrator to authenticate to the SaaS provider utilizing a user interface flow or API calls. It is imperative to store the multiple login credentials and SaaS tokens securely. The security aspect alone requires deep architectural thinking to define the end to end solution that meets your organisation’s architectural and information security principles and guidelines.
A degree of complexity is inevitable within a complex environment. If your organisation utilises many SaaS applications, you need to have a defined process for risk-assessing, on-boarding, monitoring and maintaining the integration to each. The total cost of ownership needs to be fully assessed. Understanding what data is required across the application landscape should enable you to choose the data flows between SaaS systems and your internal environment. A common need is to bring data from multiple SaaS systems into a central data warehouse for business and regulatory reporting or business intelligence.
We consider the SaaS Data Aggregation design pattern to be the most suitable for this need. This is an extension of the Data Aggregation pattern. This integration pattern involves taking data from multiple SaaS sources and loading it into a single central repository. In the past, file-based data transfer using Extract-Transform-Load (ETL) was the most common process for this. With SaaS systems, APIs are more commonly used. Because SaaS systems are external, they must be categorised as a higher risk than internal systems. Therefore your architectural guidelines when dealing with SaaS systems will differ from standard Data Aggregation patterns. However, some of the key principles of ETL and Data Warehousing should not be discarded.
  • The process should use OAuth2 where possible to authenticate to the SaaS API layer.
  • The process should reside in a secure zone or within an isolated tenant of your cloud infrastructure.
  • The process should only take data that has changed since the last run.
  • The process connection to your database should be secured by firewall rules and ideally a VPN tunnel.
  • The process that is inserting the data to the internal database should be using an authentication credential with least privilege principal. The user should only have access to the staging tables that data will be inserted to.
  • The data should be loaded with minimal treatment to a table in a staging schema.
  • The user running the data import should not have any access to any other part of the system other than the required staging tables.
  • Errors should be logged to a specific log table within the staging schema -and not allowed access to core logging/auditing infrastructure.
  • A separate process running within your secure environment — and with higher privileges — should process the staging data into the operational data store and audit/error logging platform and then archive and truncate the staging tables.
Recovery and re-runs are an essential consideration with SaaS Data Aggregation. For example, a SaaS provider’s API may extend the available fields for an entity over time. When this occurs, you may wish to bring that new field into your data warehouse. You, therefore, need to design the ability in your processes to refresh all data or within a specified period. If a systems issue corrupts data and recovery from a backup is not feasible you may also want to restore the data to be sure of its integrity.
There are many technical solutions to achieve this.
  • Enterprise Integration platforms such as Mulesoft, TIBCO and Boomi are commonly used as they provide a framework for orchestrating data flows between multiple systems. These are developer led platforms that require a significant investment and on-going commitment to implement.
  • Custom built applications by your internal development teams can also be used to process SaaS data so that it is aggregated in your central data warehouse. These can be in the form of API calls pulling data or the SaaS platform pushing data to webhooks.
  • Specialist SaaS connector providers such as can be used to authenticate to your SaaS platform and push the data into a database in a secure, controlled manner with zero code required by your team.
How you decide to implement this design pattern will, of course, be determined by your specific business needs, budget and internal capability & capacity. The important thing is to assume it will happen and build this design pattern into your developmental processes from the outset — or retro-fit it in now.
Very few organisations operate without at least one SaaS application, and as such, it is imperative to work with these applications in a planned, controlled manner.

About the author:

Paul Smullen is a technologist focused on data integration and the founder of which is a cloud based platform that enables data aggregation from SaaS applications to your data warehouse with zero code.