First Factory

  • About Us
    • History
    • Values
    • Nearshore Development
      • Why Costa Rica
    • Team
      • About Jason
    • Why First Factory
    • FAQs
  • Our Work
    • Technologies
    • Guarantee
  • Careers
    • Open Positions
    • Referral process
    • Working at First Factory
  • Blog
  • Contact Us

  • About Us
    • History
    • Values
    • Nearshore Development
      • Why Costa Rica
    • Team
      • About Jason
    • Why First Factory
    • FAQs
  • Our Work
    • Technologies
    • Guarantee
  • Careers
    • Open Positions
    • Referral process
    • Working at First Factory
  • Blog
  • Contact Us

Drive the bus: How to use the event bus pattern for your ETLs

March 25, 2021

Text that reads, "Drive the bus: Use the event pattern for your ETLs"There are many benefits that AWS services have to offer, including solutions that simplify the process of extracting data from different sources and injecting them into a new destination. The event bus architectural pattern, one of many architectural patterns, can be divided into three major components: The event source, the event listener and the bus channel. 

The event source is where the information will come from, and this will normally trigger any process that should start. The event listener is tightly integrated with the event source and is going to be in charge of processing whatever information comes from the event source. Lastly, the bus channel is in charge of transferring information between multiple parties, where there can be various implementations adopting different data structures. 

An ETL, or “extract transform and load,” is a procedure of copying data from one or more sources into a destination system that represents the data differently from the source. Extracting information could involve multiple steps like encoding, sorting, doing aggregations, changing formats and combining information. The information could also come from different sources, which means we can combine different options like microservices, text files, databases and much more.  This can be done using tools like SQL Server integration services or C-data sync. 

As an example, let’s imagine working on a complete project rewrite of an existing system with multiple data sources like web services, XML feeds or text files. It’s important that the data being used for development is as close to the original type as possible, as it will provide a better picture of how the application is functioning in comparison to the original.

As is the industry standard, this project would be a phased-approach release, and an agile methodology like scrum would be implemented in the development process. This would allow for a better process of pulling information from the source, as needed, and means that the first couple of sprints would let the team focus on only a couple of modules at a time.

An advantage of creating a process like the one described above is that this structure helps to manipulate data for many different purposes. One of the main core concepts of an ETL is to transform information, so if there is a limitation on what information the developers can see, you can use this process to change some values for testing purposes.

Now that you know what kind of environment you will work in, it’s time to design the ETL to pull information. This can be accomplished by selecting the trigger that will execute the entire process, which could be an action done by a user or cron expression. Next, you will define what is going to be the route and which stops this route will have.  A stop can be a new data source from which to pull information, that means new passengers in the bus, or a destination data source, where the information will be dropped. You will then need to define what will be the “passengers,”  meaning which data points will be fetched at the starting point, which ones will be dropped at a specific stop, or if new passengers will be picked up along the way.

The final step is more oriented toward the design of the system, and important considerations that should be taken care of in this process. These include:

  • Data integrity: It’s important to map what data points will need to be processed first in order to maintain a sequence of events. The whole process can be understood as a database transaction. With each step/stop, we assume that the previous ones have been executed successfully, but then if a difficult problem occurs, we get an exception and need to be able to handle that gracefully. There are multiple strategies, ranging from a complex rollback to a simple retry policy.
  • Detailed logging:  Since you are moving to a serverless architecture with microservices, it’s important to have logging information that can provide as many meaningful details that will help developers to debug any problem in the future. And here is where Cloudwatch steps into since that’s the service that will become one of your allies to solve any problem.
  • Concurrency problems: This is possibly the most difficult part of the whole process because you will need to think ahead of which scenarios can happen where the same resource can be accessed by multiple data points or multiple stops within the execution. For instance, if we have a microservice that uploads images to S3 and serves them to a CDN-like cloud front, it might be possible that invalidations will be required where there are special limits.
  • Access restrictions: This is a general consideration in every software development process, but it is important to validate that the system has the right permissions to access all of the information that is needed.

Now that the ETL process has been defined, here is what AWS provides for us through three popular services:

  • The first service that we will highlight is SQS, which is a message queuing service that enables you to decouple and scale microservices, distributed systems and serverless applications.
  • The second service is AWS Lambda,  which lets you run code without provisioning or managing servers. This means that you pay only for the compute time you consume.
  • The third and final service is Cloudwatch, which provides you with data and actionable insights to monitor your applications, optimize resources and get a unified view of the overall health of your systems.

Now, combining the concept of a stop that we described before with the services above, this will be the composition of a stop.

To recap, you will have the data coming from the event message queue, then the code handler that will be in charge of processing the data and finally, you will have all the outside resources that there will be interaction with. These could be the databases where we put information, or the different sources from where we pull information, including microservices.

This entire process helps to facilitate the interaction of the data between all of the microservices by having a mapped route of how the information is managed, and then there is a clear understanding of how to interact with the services between them. 

AWS provides a variety of services that can help to satisfy various requirements, in this case, leveraging their flexibility and scalability with their services to create a system that speeds up the process to extract information for your projects. 

BACK TO ALL POSTS

Related posts

The First Factory Candidate Experience

March 24, 2023

Employee Net Promoter Score (eNPS)

February 20, 2023

First Factory Makes the Inc. 5000 List for the Third Consecutive Year

August 26, 2022

First Factory Academy – Update

May 19, 2022

Client Experience: Committed to the Extra Mile

April 21, 2022

First Factory Academy – The Next Generation

February 11, 2022

Software Development Trends Beyond the Pandemic

February 07, 2022

The Factory Wall: News and Updates from First Factory

December 03, 2021

Mono to Micro: Challenges and Lessons Learned in a Transformation Journey

November 05, 2021

First Factory Makes the Inc. 5000 List for the Second Consecutive Year

September 09, 2021

Don’t Go Chasing Waterfall Methodologies

April 22, 2021

Drive the bus: How to use the event bus pattern for your ETLs

March 25, 2021

Why Choose Costa Rica for Nearshore Development?

February 19, 2021

Choosing the Right Nearshore Development Partner

January 22, 2021

The Factory Wall: Company Growth and Developer Skills

November 05, 2020

The Benefits of Staff Augmentation

September 17, 2020

First Factory Makes the Inc. 5000 List

August 18, 2020

How to Adopt Inclusive Hiring in Software Development

July 23, 2020

The Making of Your Nearshore Development Team

June 08, 2020

Navigating Remote Team Management

May 06, 2020

First Factory Software Testing Procedures

August 01, 2019

Education in Costa Rica

November 29, 2018

Software Development Talent Pool of Costa Rica

October 15, 2018

News

Employee Net Promoter Score (eNPS)

First Factory Makes the Inc. 5000 List for the Third Consecutive Year

First Factory Academy – Update

First Factory Academy – The Next Generation

Software Development Trends Beyond the Pandemic

The Factory Wall: News and Updates from First Factory

First Factory Makes the Inc. 5000 List for the Second Consecutive Year

The Factory Wall: Company Growth and Developer Skills

First Factory Makes the Inc. 5000 List

First Factory Software Testing Procedures

Education in Costa Rica

Software Development Talent Pool of Costa Rica

BASED IN:

NYC, NEW YORK, USA

NEARSHORE IN:

HEREDIA, COSTA RICA

PHONE:

+1.646.688.5070

HAVE A QUESTION?:

Send us a message

  • Home
  • Work with us
  • Team
  • Guarantee
  • Careers
  • FAQ
  • Contact

First Factory © 2023 · Privacy Policy