You could get better results with RedShift if:
You could see better results with BigQuery if:
RedShift is Amazon’s data warehouse, and it’s part of Amazon’s massive overall cloud architecture, AWS.
Amazon acquired the source code for RedShift from ParAccel, who was developing ParAccel Analytic Database — a PostgreSQL-based database that utilized columnar data organization. So, RedShift is an MPP data warehouse that’s built on a PostgreSQL fork.
While RedShift shares many commonalities with PostgreSQL (such as its relational qualities), it also is unique in that it’s column structure. It doesn’t support indexes, and uses distribution styles and keys for data organization. Amazon also has a unique query execution engine for RedShift that differs from PostgreSQL.
The important thing to note about RedShift being built on top of a PostgreSQL fork is that it maintains some of its transactional qualities — making it a hybrid database of sorts. RedShift can still roll-back on transactions, which is a semi-unique feature in the data warehouse market.
More than 10,000 companies use Redshift, according to HG Insights. Customers include well-known brands such as McDonald’s, Pfizer, and Lyft.
A Redshift database is a cluster of worker database nodes. These nodes are able to split up a query, execute its constituent parts in parallel, then combine the results. As a result, you can horizontally scale performance by adding nodes to your
cluster. If you check out the Redshift pricing page 11 you’ll see that adding a node will give you more processing power, as well as more storage.
This is sort of a problem though: sometimes you need a lot of storage space, but you don’t need the extra processing power. The converse may also be true. This coupling of “storage” and “compute” is one of the main drawbacks of Redshift compared to the other databases listed below
Redshift comes with a bunch of configurations to optimize query performance like sort keys 3 and dist keys. You can also configure compression 4 for your columns, or perform table maintenance tasks like vacuum 4 and analyze 2. It ends up being close to a full-time job to keep Redshift humming along, and honestly, I think that time could be better spent actually doing analytics. The databases below don’t have as many bells and whistles, but you end up not really missing them. Maintaining Redshift can be a real chore.
BigQuery is Google’s data warehouse, and it’s part of Google’s massive overall cloud architecture, Google Cloud.
BigQuery was one of the first major data warehouses on the market following C-Store and Monet DB. To function, BigQuery executes Dremel (a query engine developed by Google for read-only nested data that supports an SQL-like syntax) over a REST interface.
More than 3,000 companies use BigQuery, including The New York Times, HSBC Bank, and Spotify, according to HG Insights.
BigQuery is really, very bad at doing joins. The recommended way to handle one-to-many relationships is through repeated records 53. Repeated records are a radically new data storage pattern that Google just decided to make up. These repeated records are incredibly useful and they work very well, but you’ll need to internalize this feature of BigQuery to make effective use of its power.
Notice UPDATING TABLES must USE STAGING TABLE
Redshift can do upserts, but only via a convoluted process. Redshift doesn’t provide an UPSERT command to update a table. Users must load data into a staging table and then join the staging table with a target table for an UPDATE statement and an INSERT statement.
RedShift’s pricing model is extremely simple.
With RedShift, you can choose from either Dense computing or large Dense Storage. The cheapest node you can spin up will cost you $0.25 per/hour, and it’s 160GB with a dc2.large node. Dense Storage runs at $0.425 per TB per hour. This cost covers both storage and processing. RedShift So, the lowest price you can get on RedShift is $306 per TB per month.
And, you can pay upfront for massive discounts.
This makes RedShift interesting to work with. If you can calculate your run-times and how often you’ll need to spin up each node, you can cut costs dramatically — especially if you pay upfront. Since most businesses aren’t going to be constantly running their RedShift nodes, getting granular is usually in your best interest.
For example, you may only run RedShift during the day when people are interacting with your stack or service. If that’s the case, you can adjust your upfront buying habits to reflect that behavior.
BigQuery’s pricing is much more complicated. On the surface, BigQuery looks cheaper. Storage costs $20 per TB per month, a good $286 cheaper than RedShift. But, BigQuery charges separately for storage and querying. Queries cost $5/TB. So, while storage is cheaper, query costs can add up quickly.
There are some pros and cons to this method. Really, BigQuery is perfect for a certain type of customer. Let’s say your business deals with spiky workloads. You run rapid queries a few times a day. BigQuery would be a far better option since you have to pay by the hour for RedShift. BigQuery may also be the best solution for data scientists running ML or data mining operations — since you’re dealing with extremely large, spikey workloads.
BigQuery costs $20 per TB per month for the storage line and $5 per TB processed on that storage line.
RedShift costs $306 per TB per month for storage AND unlimited processing on that storage.
There isn’t a winner here. RedShift is more economical for everyday data warehouse operations for most businesses. But, BigQuery is better for businesses looking to do data mining or those who deal with extremely variant workloads.
In our experience with clients, RedShift is great at handling everyday business processes. This means spinning a node during work hours for BI tools and interfaces. It’s less expensive, has plenty of power to handle semi-complex schemas, and it’s easy-to-use.
BigQuery is great at handling niche business workloads that query big chunks in a small timeframe and for data scientists and ML/data mining.
In many cases, the difference between the two is going to depend upon your RedShift resources. So, if you’re paying for a single dc2.large node, BigQuery is most likely going to outperform RedShift. But, if you’re spinning up an expensive 8-node DC1.8XL, RedShift is probably going to outperform BigQuery.
RedShift supports standard SQL data types, and BigQuery works with some standard SQL data types and a small range of sub-standard SQL. One of the biggest benefits of BigQuery is that it treats nested data classes as first-class citizens due to its Dremel capabilities. With RedShift, you have to flatten out your data before running a query.
Both of them can handle updates and deletes when something goes wrong in the query. Since BigQuery and RedShift are append-only, many assume they can’t do updates and deletes. They can. On BigQuery, the update and delete process exists, but it’s relatively expensive, and your options are limited. So, it’s not a widely used feature. With RedShift, you can reclaim tables with Postgre Vacuuming (which has its own host of complications) so update and delete support is typically better with RedShift.
Also, RedShift gives users the ability to roll back on transactions, which BigQuery doesn’t.
Out-of-the-box, BigQuery is much simpler to use than RedShift. You don’t have to perform many tweaks, cluster management is a breeze, and the complexities of database config, etc. are handled by BigQuery.
When it comes to security, both systems are comparable. RedShift uses Amazon IAM for identity, and BigQuery uses Google Cloud IAM. Both services work perfectly for almost all business scenarios. Google does have great B2B identity management with OAuth, so you can give identity controls to 3rd parties without introducing them into your entire ecosystem.
Both Google and Amazon (unsurprisingly) have a wealth of integrations available. Redshift is built on a PostgreSQL fork
#bigdata #comparison #bigquery #redshift