Welcome to part 5 of our series, where we delve into the world of modernizing data platforms and exploring BigQuery concepts. Over the course of this installment and the upcoming ones, we will be focusing on essential BigQuery concepts specifically tailored for data warehousing professionals.


In the initial four installments of the series, our attention has been directed towards the concept of Modernization, Data Warehouse Modeling and Fundamentals, Key Attributes of a Modernized Data Platform, and the Architectural Framework that underpins large-scale data analytics platforms.


BigQuery, a cloud-based data warehouse solution, is revolutionizing the way users manage their data. With its modernized approach, BigQuery addresses the limitations and challenges commonly associated with traditional data warehouses. Offering a wide array of benefits, it serves as an effective solution to overcome these pain points.


BigQuery is a comprehensive solution for modern Datawarehousing, offering several important characteristics. It excels in processing queries at a faster rate, providing cost-effective storage options for petabytes of data. Its Serverless architecture eliminates the need for operational maintenance, while its compatibility with other renowned technologies and ease of migration make it highly adaptable. Additionally, BigQuery's ML capabilities further enhance its appeal as an all-encompassing solution.


BigQuery closely aligns with traditional Dwh concepts such as data marts, data lakes, tables, views, and grants/accesses. However, it offers a higher level of organization. In BigQuery, the equivalent of Data Marts in a traditional Datawarehouse are referred to as datasets. Additionally, the Data Lake in BigQuery is a robust and scalable solution that provides ample storage for diverse types of data.

BigQuery与传统的数据仓库概念(如数据集市、数据湖、表、视图和授予/访问)相比,并没有太大的差异,但其组织性更为强大。在传统数据仓库中,我们将数据集市称为BigQuery中的数据集,而原始数据存储选项Data Lake则等同于Google云存储。此外,通过外部数据源集成功能,我们可以直接从BigQuery查询Google Drive上的内容。值得一提的是,Google提供了精细化的身份和访问权限管理系统,可对BigQuery中的数据集进行严格控制。

BigQuery boasts the following essential features:


1. Data Loading and Exporting


2. Interrogating and Observing the data

2. 数据查询和浏览

3. Data management


The data may consist of transactional data, analytical data, and logs that can be streamed in for further analysis.


The data can be directly written into BigQuery from external sources, or the data processing can be solely performed using BigQuery's analytical engine without utilizing storage.


There are three ways for users to connect and interact with BigQuery:


As previously mentioned, BigQuery operates in a serverless manner, eliminating the need for us to concern ourselves with resource provisioning or allocation. Users simply need to follow the steps outlined below:


1. Initiate a project


BigQuery ensures that data is well-organized within datasets and tables, maintaining a consistent structure of project.dataset.table.


In Part 4, we discussed the fundamental technology behind BigQuery/BigData Analytical processing systems. Two key components, Dremel and Borg, are pivotal in managing resource allocation and distributing processing loads efficiently within the system.

还记得我们在第4部分中讨论的基于BigQuery / BigData的分析处理系统的核心技术吗?德雷梅尔和博格在资源分配和内部负载处理方面扮演着重要角色。

BigQuery's load distribution is based on a unit called Slot. A Slot encompasses CPU, Memory, and Networking resources. Each Slot essentially represents a virtual CPU with X amount of memory. When a user submits a query, BigQuery determines the number of Slots needed depending on the complexity of the query. The more complex the query, the greater number of Slots will be requested. Here are some key points to keep in mind regarding Slots:


1. BQ in-flight dynamically handles these situations, ensuring that if a query requires more Slots than are currently available, a portion of the work will be queued up.

1. BQ机具备动态处理功能,即当查询所需插槽数超过可用插槽数时,部分任务将被排队等待处理。

2. Slots can be allocated using an on-demand pricing model, where the cost is calculated based on the amount of data processed by the query.

2. 插槽的定价可以根据按需模型进行确定,其中的定价是基于查询处理所涉及的字节数量。

3. Slots can also be allocated based on a Flat pricing model, where a fixed number of Slots are reserved for the project and billed on a monthly basis.

3. 插槽的数量和价格是固定的,根据项目需要进行预留,并按月计费。

Run-length encoding (RLE):


In Part 4, we briefly discussed Columnar data storage. However, as an expansion on this topic, there are two remarkable methods to further compress the data: bitmaps and RLE (Run Length Encoding). Now, let's examine an example.




The compression method using bitmaps appears as follows:


Inside the square bracket, each binary value represents whether the individual transaction had a corresponding transaction value. A value of 1 indicates a positive response, while 0 signifies a negative response.


Elevating the game, the RLE technique is employed to achieve even greater data compression, presenting as follows:


Using RLE in the given representation, we can determine the number of rows in the entire dataset that had transaction values of 100, 110, and so forth.


Therefore, by merging all of these elements, the complete encoded view appears as follows:



All of these processes take place within BigQuery's Capacitor, where the data is encoded and subsequently stored in Colossus, Google's powerful distributed data storage system.

所有这些都发生在BigQuery的电容器内部。 数据编码后,全部存储在Colossus(Google的分布式数据存储)中。

BigQuery offers a crucial feature: the separation of compute and storage. This means that data is stored separately from the computing resources, allowing for more efficient processing. Additionally, BigQuery ensures data reliability through redundant replication of datasets, eliminating the risk of data loss. As mentioned in part 4, connectivity between storage and compute is facilitated by Jupiter - Google's high-speed petabit bandwidth network.


