MODERNIZED DATA ARCHITECTURE BASED ON BIGQUERY FOR A REAL ESTATE PLATFORM

Boosting the performance and scalability of a legacy data architecture for a real estate platform

ABOUT
the project

Client:

Real Estate Company

Location:

USA

Company Size:

50+ Employees

Industry:

Real Estate

Solution:

Cloud Services

Services:

Cloud software development

Software architecture design

Technologies:

.NET 8.0

ASP.NET WebAPI

Azure SQL

GCP BigQuery

AppScript

An efficient data management architecture capable of collecting and processing large volumes of data from multiple sources. The solution deals with a complex data architecture involving numerous tables.

The data architecture has not been changed for several years, so the Leobit team had to deal with a complex system, completely restructure the data integration flows, and provide an innovative architecture devoid of performance bottlenecks, limited capacities, and poor organization, peculiar to its predecessor.

Our team showcased great speed and efficiency in improving the customer’s data management architecture. In three weeks, we provided the client with a modernized architecture allowing them to make data collection more performant and scalable, as well as preparing it for gradual updates that wouldn’t trigger the entire process.

Serhii H.

SDO at Leobit

Customer

The customer is a U.S.-based real estate company dealing with significant loads of data on properties, prices, taxes, post indexes, etc. from across the U.S. The customer runs a complex data infrastructure helping them structure business-critical data and get insights. However, the customer faced issues with the continuous growth of data volumes and decided to opt for a more efficient and scalable data infrastructure.

Business Challenge

Our customer faced issues with the performance of scripts responsible for data collection and preparation from multiple sources. The existing solution consisted of several tables with daily updated data hosted on Azure SQL. In the same database, there was a predefined store procedure with logic to select, aggregate, and join all tables. As an output, the store procedure returned a flat denormalized table.

This table was kept in memory and was uploaded to the GCP BigQuery table through the network utilizing WebAPI endpoints. BigQuery was used to analyze data and create pivots and charts in a separate Google spreadsheet. For several years, data in tables continuously grew, and the process of their selection and joining in the needed format caused a performance bottleneck. It took more than 15 minutes to gather and join 270000 records of data, and this time constantly increased because new data was added to all tables every hour.

Why Leobit

Our team satisfied the customer with our ability to plan creative and efficient solutions. We also proposed a fast and highly efficient solution to optimize the customer’s data infrastructure, demonstrating strong expertise in the key technologies needed to successfully manage the project.

Project
in detail

While it wasn’t a long-term project, it involved numerous stages and diverse workflows aimed at finding the main problems with the existing system, preparing the required improvements, implementing them, and system testing.

PROJECT INVESTIGATION

We started the project with a short investigation helping us to identify the main reasons behind the system’s performance issues. The investigation revealed that the main bottleneck of the existing architecture was in using in-memory computations for complex and large amounts of data.

This approach could work at the early stages, but it was not scalable enough to handle continuously growing data. Moreover, it required the customer to start a complete rebuilding process even after minor changes in one table or in a single row.

DEFINING THE PROPOSED SOLUTION

Upon investigating the existing system, our team started planning a new architecture that would cope with all of the customer’s challenges. We defined a proposed solution, along with its BigQuery schemas, tables, and data setup/changes for a new architecture. Our specialists also defined all features of the new architecture in detail, relying on our extensive expertise in database technologies.

IMPLEMENTING A NEW DATA ARCHITECTURE

Once the blueprint of the new architecture was prepared, we started working on the new solution. First, we introduced changes in the stored procedures and views. Our specialists set up a new architecture based on GCP services. We successfully moved all data operations from the system’s in-memory to the cloud infrastructure. Our .NET developers also started rewriting code behind changes and new implementations.

SYSTEM TESTING

After implementing the new architecture, we ran several types of testing to confirm that the system worked well. In particular, we ran integration testing to ensure that all parts of the solution ran properly together. We also ran performance testing, which revealed significant improvements in system performance and speed. After running the tests, we made several adjustments. In particular, we optimized the integration of Google Sheets used as a part of the system’s reporting layer.

INNOVATIVE DATA ARCHITECTURE

The Leobit team decided to rework the architecture to leverage GCP infrastructure benefits and computation resources. The designed approach is not related to machine specs and the amount of memory. Moreover, it scales within GCP infrastructure if needed and can be split between several nodes.

The new architecture synchronizes data from Azure SQL tables with corresponding tables in BigQuery. We can also select specific tables with new data for updates or just update all the tables within the infrastructure. BigQuery View effectively holds all the needed logic and uses related tables to gather data, which will later be imported to any Google Sheet for further analysis.

SIGNIFICANT BOOST IN SYSTEM PERFORMANCE

The redesign of the customer's data architecture significantly boosted its performance. Before and after implementation, we measured simple performance tests for full data processing, making full data rebuild from scratch.

The table below illustrates the performance improvements that we managed to achieve. Even with the growing data loads, BigQuery uses parallel job execution to speed up the processing of growing data.

FEATURES FOR SCALABILITY AND DYNAMIC DATA UPDATES

Due to the fact that the system runs in the cloud and is built with scalability in mind, there are great opportunities for the customer to scale the infrastructure up continuously. We also ensured a very dynamic approach to data updates. A full data rebuild is an infrequent operation, and usually, our client can just append or update some data to existing datasets/tables.

These cases, compared to a full rebuild, take a few seconds depending on the amount of updates. Appends are very simple as they work natively using BigQuery database insert principles. Due to the efficient data processing, uploading, and performing almost all operations in memory, we had a huge bottleneck in performance that, in the future, will increase in geometric progression.

Technology Solutions

The GCP-based architecture leverages cloud computing resources combined with efficient data storage and transfer, giving us good performance results even for complete data rebuild if needed.
Workflows that efficiently synchronize data from Azure SQL tables with corresponding tables in BigQuery, ensuring great data consistency across the entire infrastructure.
System performance was boosted more than 8 times with modernized workflows and parallel job executions.

Value Delivered

A new and efficient data architecture allows our client to boost the performance of data collection and processing and achieve greater scalability.
Convenient updating mechanics that allow the customer to introduce gradual updates to particular data without triggering the complete system workflow.
Innovative solution ready for operations after three weeks of development and testing.