重要概念:BigQuery的核心要点

访客 172 0

bigquery

Welcome to part 5 of our series, where we delve into the world of modernizing data platforms and exploring BigQuery concepts. Over the course of this installment and the upcoming ones, we will be focusing on essential BigQuery concepts specifically tailored for data warehousing professionals.

在这本系列的第5部分中,我们将深入探讨现代化数据平台和BigQuery的概念。接下来几个章节,我们将重点介绍一些对于Data仓库专家来说至关重要的BigQuery概念。

In the initial four installments of the series, our attention has been directed towards the concept of Modernization, Data Warehouse Modeling and Fundamentals, Key Attributes of a Modernized Data Platform, and the Architectural Framework that underpins large-scale data analytics platforms.

在前四篇文章中,我们着重介绍了现代化的概念、数据仓库建模和基础知识、现代化数据平台的特性以及驱动大数据分析平台的体系结构。

BigQuery, a cloud-based data warehouse solution, is revolutionizing the way users manage their data. With its modernized approach, BigQuery addresses the limitations and challenges commonly associated with traditional data warehouses. Offering a wide array of benefits, it serves as an effective solution to overcome these pain points.

BigQuery是一种基于云端的现代化数据仓库解决方案,具备广泛的优势,能够有效解决传统数据仓库给用户带来的各种困扰。

BigQuery is a comprehensive solution for modern Datawarehousing, offering several important characteristics. It excels in processing queries at a faster rate, providing cost-effective storage options for petabytes of data. Its Serverless architecture eliminates the need for operational maintenance, while its compatibility with other renowned technologies and ease of migration make it highly adaptable. Additionally, BigQuery's ML capabilities further enhance its appeal as an all-encompassing solution.

BigQuery是一种现代化的数据仓库解决方案,具备以下几个重要特点:高效处理查询速度、低成本存储PB级数据、无服务器架构和操作设施、帮助用户消除维护和运营开销,并且与其他知名技术兼容,易于迁移和具备机器学习功能。

BigQuery closely aligns with traditional Dwh concepts such as data marts, data lakes, tables, views, and grants/accesses. However, it offers a higher level of organization. In BigQuery, the equivalent of Data Marts in a traditional Datawarehouse are referred to as datasets. Additionally, the Data Lake in BigQuery is a robust and scalable solution that provides ample storage for diverse types of data.

BigQuery与传统的数据仓库概念(如数据集市、数据湖、表、视图和授予/访问)相比,并没有太大的差异,但其组织性更为强大。在传统数据仓库中,我们将数据集市称为BigQuery中的数据集,而原始数据存储选项Data Lake则等同于Google云存储。此外,通过外部数据源集成功能,我们可以直接从BigQuery查询Google Drive上的内容。值得一提的是,Google提供了精细化的身份和访问权限管理系统,可对BigQuery中的数据集进行严格控制。

重要概念:BigQuery的核心要点-第1张图片-谷歌商店上架 Access to public datasets for exploration 访问公共数据集进行探索 重要概念:BigQuery的核心要点-第2张图片-谷歌商店上架 Assigning specific roles to users while sharing a table 共享表时为用户分配特定角色

BigQuery boasts the following essential features:

BigQuery的主要功能包括以下几个方面:

1. Data Loading and Exporting

1.加载和导出数据

2. Interrogating and Observing the data

2. 数据查询和浏览

3. Data management

3.管理数据

The data may consist of transactional data, analytical data, and logs that can be streamed in for further analysis.

数据可以包括事务数据、分析数据和日志,可以通过流式传输进行进一步分析。

The data can be directly written into BigQuery from external sources, or the data processing can be solely performed using BigQuery's analytical engine without utilizing storage.

您可以直接将外部数据写入BigQuery,也可以单独使用BigQuery的分析引擎来处理数据,而无需进行存储。

There are three ways for users to connect and interact with BigQuery:

用户可以通过三种方式与BigQuery连接和互动:

1. 用户界面/控制台

1. 用户界面/控制台

2. REST API

2. REST API

3. 命令行

3.命令行

As previously mentioned, BigQuery operates in a serverless manner, eliminating the need for us to concern ourselves with resource provisioning or allocation. Users simply need to follow the steps outlined below:

正如之前所提到的,BigQuery是一种无服务器的解决方案,这意味着我们不需要担心资源的供应和分配。用户只需按照以下步骤操作即可:

1. Initiate a project

1.创建一个项目

2. Establish a dataset

2. 建立一个数据集

3. Generate a schema based on the design

3. 根据设计原则构建模式

4. Utilize a range of GCP's native ETL techniques to seamlessly import data into the tables.

4. 利用GCP内置的多种ETL技术,将数据高效地导入到表格中。

BigQuery ensures that data is well-organized within datasets and tables, maintaining a consistent structure of project.dataset.table.

BigQuery将数据整理在数据集和表格中,因此项目.数据集.表格的结构始终保持不变。

In Part 4, we discussed the fundamental technology behind BigQuery/BigData Analytical processing systems. Two key components, Dremel and Borg, are pivotal in managing resource allocation and distributing processing loads efficiently within the system.

还记得我们在第4部分中讨论的基于BigQuery / BigData的分析处理系统的核心技术吗?德雷梅尔和博格在资源分配和内部负载处理方面扮演着重要角色。

BigQuery's load distribution is based on a unit called Slot. A Slot encompasses CPU, Memory, and Networking resources. Each Slot essentially represents a virtual CPU with X amount of memory. When a user submits a query, BigQuery determines the number of Slots needed depending on the complexity of the query. The more complex the query, the greater number of Slots will be requested. Here are some key points to keep in mind regarding Slots:

BigQuery的负载分配是基于“插槽”进行划分的。每个“插槽”代表一组CPU、内存和网络资源的组合。一个插槽表示具有X内存的虚拟CPU。当用户输入查询时,BigQuery会根据查询的复杂性来确定所需的插槽数量。查询越复杂,所需插槽数量就越多。关于这个负载分配机制,请记住以下几点要点:

1. BQ in-flight dynamically handles these situations, ensuring that if a query requires more Slots than are currently available, a portion of the work will be queued up.

1. BQ机具备动态处理功能,即当查询所需插槽数超过可用插槽数时,部分任务将被排队等待处理。

2. Slots can be allocated using an on-demand pricing model, where the cost is calculated based on the amount of data processed by the query.

2. 插槽的定价可以根据按需模型进行确定,其中的定价是基于查询处理所涉及的字节数量。

3. Slots can also be allocated based on a Flat pricing model, where a fixed number of Slots are reserved for the project and billed on a monthly basis.

3. 插槽的数量和价格是固定的,根据项目需要进行预留,并按月计费。

Run-length encoding (RLE):

游程编码是一种数据压缩技术。

In Part 4, we briefly discussed Columnar data storage. However, as an expansion on this topic, there are two remarkable methods to further compress the data: bitmaps and RLE (Run Length Encoding). Now, let's examine an example.

在第4部分中,我们已经简要介绍了列式数据存储。然而,为了进一步压缩数据,有两种巧妙的方法可以使用。它们分别是位图和RLE(行程编码)。让我们通过一个例子来加深理解。

让我们来考虑一个电子商务数据集的例子,其中有一张表记录了每个客户的订单金额。列A代表客户ID,列B代表交易的总金额。假设,列B有100个值,其中75个是不同的值。这意味着总共有100笔交易,其中75笔具有唯一的交易金额。

让我们来考虑一个电子商务数据集的示例,其中包含一个表格,记录了每个客户订单的货币价值。在这个表格中,列A代表客户ID,列B代表交易总价值。可以说,在列B中有100个数值,其中75个是不同的。这意味着总共有100笔交易记录,其中75笔具有独特的交易价值。

The compression method using bitmaps appears as follows:

位图的压缩方式如下所列:

交易金额:100.00 [1,0,0,0,0,1,1,1,0,0,0,0,...]

交易价值:100.00 [1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,…………..0]

Transaction value: 110.00 [0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,…………..0]

交易价值:110.00 [0,1,1,0,0,0,0,0,0,0]

.

.

.

交易金额:170.00 [0,0,0,0,1,0,0,0,0,0,……,1]

交易价值:170.00 [0,0,0,0,1,0,0,0,0,0]

Inside the square bracket, each binary value represents whether the individual transaction had a corresponding transaction value. A value of 1 indicates a positive response, while 0 signifies a negative response.

每个方括号内的二进制值表示单个交易是否具有相应的交易值,其中1代表是,0代表否。

Elevating the game, the RLE technique is employed to achieve even greater data compression, presenting as follows:

将其推向一个全新的层次,RLE技术被应用于进一步压缩数据,具体如下所述:

交易价值:100 [1, 4, 3, 92] à [一个1,4个零,三个1和其余都是0]

交易价值:100 [1,4,3,92] à [一个1、四个零、三个1s和休息0s]

Transaction Value: 110 [1,2,97] à [1 zero, two 1s, rest 0s]

交易价值:110 [1,2,97] 转化为 [1个零,两个1s,休息0s]

Using RLE in the given representation, we can determine the number of rows in the entire dataset that had transaction values of 100, 110, and so forth.

在上面使用RLE的表示中,我们知道整个数据集中有多少行的事务值为100、110,依此类推。

Therefore, by merging all of these elements, the complete encoded view appears as follows:

因此,综合考虑所有这些因素,以下是整个编码视图的外观:

重要概念:BigQuery的核心要点-第3张图片-谷歌商店上架

All of these processes take place within BigQuery's Capacitor, where the data is encoded and subsequently stored in Colossus, Google's powerful distributed data storage system.

所有这些都发生在BigQuery的电容器内部。 数据编码后,全部存储在Colossus(Google的分布式数据存储)中。

BigQuery offers a crucial feature: the separation of compute and storage. This means that data is stored separately from the computing resources, allowing for more efficient processing. Additionally, BigQuery ensures data reliability through redundant replication of datasets, eliminating the risk of data loss. As mentioned in part 4, connectivity between storage and compute is facilitated by Jupiter - Google's high-speed petabit bandwidth network.

BigQuery在表格中的一个重要方面是将计算和存储分离。此外,数据集的冗余复制确保数据不会丢失。正如我们在第4部分中所讨论的那样,存储和计算之间的连接是通过Google的PB带宽网络Jupiter实现的。

引用来源:https://medium.com/front-end-weekly/key-bigquery-concepts-a80269118115 重要的BigQuery概念 以下是一些关键的BigQuery概念,可以帮助您更好地理解和使用这个强大的数据分析工具。

bigquery

标签: 数据 价值 用户 分配

发表评论 (已有0条评论)

还木有评论哦,快来抢沙发吧~