Databricks and Snowflake got into a brief and public spat in November when Databricks claimed they were the world’s fasted data warehouse.
Snowflake took offense, saying Databricks was cheating.
Databricks countered, claiming that Snowflake was actually the one cheating.
Does this really matter? Probably not because benchmarks are often unrepresentative of real-world use. And even if you were trying to do an apples-to-apples comparison with real data under real usage scenarios, there are just too many confounding variables in performance benchmarking.
So what’s the bottom line? All the big players in the market, GCP BigQuery, AWS Redshift, Snowflake, and Azure Synapse, work on the same basic principles. They’re competitors, and none of them allows the others to run too far ahead of the pack in price or performance.
So when making a choice, let features, ecosystem, documentation, and ease of use dominate any discussions of which vendor is more suitable for your needs. And check out the next two articles to see what Databricks Photon and Snowflake Snowpark offer.
Photon is a ground up C++ rewrite of Spark’s SQL engine. Previously Databricks/Spark was seen as a tool for data engineers and data scientists, and it was not comparable to a traditional data warehouse. Photon is their attempt to break into that market.
The big difference between Databricks and Snowflake architecture is how they store their data. Databricks stores its data in a data lake using S3/Azure Storage in open formats like Parquet/Delta Lake. Snowflake keeps its data locked in a proprietary format inside their servers. There are pros and cons to both approaches, but that’s a topic for another day.
Read about Photon, The Next-Generation Query Engine for the Lakehouse
At its core, Snowpark is a way to programmatically create and execute SnowSQL. It is Snowflake’s attempt to break into the data engineering and data science market.
Instead of writing this:
SELECT COUNT(*) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = ‘PUBLIC’
You would write:
var dfTables = session.table("INFORMATION_SCHEMA.TABLES").filter(col("TABLE_SCHEMA") === "PUBLIC")
var tableCount = dfTables.count()
var currentDb = session.getCurrentDatabase.getOrElse("<no current database>")
It might seem kind of underwhelming on its own, though useful for people who are versed in Python/Scala, but . . . . Snowpark becomes more interesting when you start using user-defined functions. UDFs enable Snowflake to execute custom code. Right now these features are still very new, but they will become more robust in the next couple of years.
Find out what Snowpark is about
Power BI released its Premium Generation 2 , a new version of Power BI that is mostly about more flexible compute. Microsoft promises the analytics performance will be up to 16x faster than Gen1. They also lifted the limitations on how many reports can be refreshed at the same time. In addition they’ve added auto scaling features. Find more details in the attached article.
Read “What is Power BI Premium Gen2?”
Perform Batch and Streaming Using the BigQuery Storage Write API
Most data warehouses have struggled with large numbers of small inserts. Generally, they prefer large batch writes. This can be a problem for businesses that need closer-to-real-time reporting.
A couple years ago, Snowflake introduced Snowpipe, which, while still technically batch loads, were much smaller and faster. BigQuery’s Storage Write API is even more granular, allowing for a true stream of data into tables.
Read more about the Big Query Storage Write API
Outsiders think developers sit down at 8 a.m., start typing, take their hands off the keyboard at 5 p.m., then go home. We in the industry know that isn’t true. You spend much of your time in meetings, debugging, testing, and foraging for information.
Information foraging is a model of how we find information based off the way animals forage for food. Here is a concrete example.
You have a job running on a schedule in GCP and see that it failed this morning.
Step 1: Check the job logs. Hmm, it ran successfully. Weird.
Step 2: Check the scheduler logs. You see it had a timeout error.
Step 3: You Google “GCP scheduler timeout” and get a bunch of potential results back. You see five relevant stack overflow questions, so you choose one that sounds closest to your problem.
Step 4: You see that the scheduler by default times out at 1 minute, so you need to load the SDK to change it.
Step 5: You go to GCP documentation to find the command you need.
It took about 30 minutes of work to find one line of code.
In this episode of the Stack Overflow Podcast, the interviewee argues that one of the key skills of a great developer is the ability is to pick up the “information scent” and dig your way to the answer. There’s no easy way to learn this, and it comes through practice. We liked this chat from last summer, and you might find some interesting
Listen to the podcast on Information Foraging Tactics
Read the transcript: Information Foraging Tactics