There are many benefits that AWS services have to offer, including solutions that simplify the process of extracting data from different sources and injecting them into a new destination. The event bus architectural pattern, one of many architectural patterns, can be divided into three major components: The event source, the event listener and the bus channel.
The event source is where the information will come from, and this will normally trigger any process that should start. The event listener is tightly integrated with the event source and is going to be in charge of processing whatever information comes from the event source. Lastly, the bus channel is in charge of transferring information between multiple parties, where there can be various implementations adopting different data structures.
An ETL, or “extract transform and load,” is a procedure of copying data from one or more sources into a destination system that represents the data differently from the source. Extracting information could involve multiple steps like encoding, sorting, doing aggregations, changing formats and combining information. The information could also come from different sources, which means we can combine different options like microservices, text files, databases and much more. This can be done using tools like SQL Server integration services or C-data sync.
As an example, let’s imagine working on a complete project rewrite of an existing system with multiple data sources like web services, XML feeds or text files. It’s important that the data being used for development is as close to the original type as possible, as it will provide a better picture of how the application is functioning in comparison to the original.
As is the industry standard, this project would be a phased-approach release, and an agile methodology like scrum would be implemented in the development process. This would allow for a better process of pulling information from the source, as needed, and means that the first couple of sprints would let the team focus on only a couple of modules at a time.
An advantage of creating a process like the one described above is that this structure helps to manipulate data for many different purposes. One of the main core concepts of an ETL is to transform information, so if there is a limitation on what information the developers can see, you can use this process to change some values for testing purposes.
Now that you know what kind of environment you will work in, it’s time to design the ETL to pull information. This can be accomplished by selecting the trigger that will execute the entire process, which could be an action done by a user or cron expression. Next, you will define what is going to be the route and which stops this route will have. A stop can be a new data source from which to pull information, that means new passengers in the bus, or a destination data source, where the information will be dropped. You will then need to define what will be the “passengers,” meaning which data points will be fetched at the starting point, which ones will be dropped at a specific stop, or if new passengers will be picked up along the way.
The final step is more oriented toward the design of the system, and important considerations that should be taken care of in this process. These include:
- Data integrity: It’s important to map what data points will need to be processed first in order to maintain a sequence of events. The whole process can be understood as a database transaction. With each step/stop, we assume that the previous ones have been executed successfully, but then if a difficult problem occurs, we get an exception and need to be able to handle that gracefully. There are multiple strategies, ranging from a complex rollback to a simple retry policy.
- Detailed logging: Since you are moving to a serverless architecture with microservices, it’s important to have logging information that can provide as many meaningful details that will help developers to debug any problem in the future. And here is where Cloudwatch steps into since that’s the service that will become one of your allies to solve any problem.
- Concurrency problems: This is possibly the most difficult part of the whole process because you will need to think ahead of which scenarios can happen where the same resource can be accessed by multiple data points or multiple stops within the execution. For instance, if we have a microservice that uploads images to S3 and serves them to a CDN-like cloud front, it might be possible that invalidations will be required where there are special limits.
- Access restrictions: This is a general consideration in every software development process, but it is important to validate that the system has the right permissions to access all of the information that is needed.
Now that the ETL process has been defined, here is what AWS provides for us through three popular services:
- The first service that we will highlight is SQS, which is a message queuing service that enables you to decouple and scale microservices, distributed systems and serverless applications.
- The second service is AWS Lambda, which lets you run code without provisioning or managing servers. This means that you pay only for the compute time you consume.
- The third and final service is Cloudwatch, which provides you with data and actionable insights to monitor your applications, optimize resources and get a unified view of the overall health of your systems.
Now, combining the concept of a stop that we described before with the services above, this will be the composition of a stop.
To recap, you will have the data coming from the event message queue, then the code handler that will be in charge of processing the data and finally, you will have all the outside resources that there will be interaction with. These could be the databases where we put information, or the different sources from where we pull information, including microservices.
This entire process helps to facilitate the interaction of the data between all of the microservices by having a mapped route of how the information is managed, and then there is a clear understanding of how to interact with the services between them.
AWS provides a variety of services that can help to satisfy various requirements, in this case, leveraging their flexibility and scalability with their services to create a system that speeds up the process to extract information for your projects.