The Data Planet - February 2022

February 2, 2022
February 4, 2022

Whether it’s recent news or just new to you, The Data Planet serves up fascinating insights and resources ABOUT the Data analytics and BI WORLD EVERY MONTH.

Our snack-size summaries skip straight to the point.

This month’s Data Planet includes:

  • The Cloud Data Warehouse Squabble and Our Advice
  • What is Databricks Photon?
  • What Does Snowflake Snowpark Do?
  • It’s Here: Next-Gen Microsoft Power BI Premium Platform
  • Perform Batch and Streaming Using the BigQuery Storage Write API
  • Podcast: Information Foraging – The Tactics Great Developers Use to Find Solutions

The Cloud Data Warehouse Squabble and Our Advice

Databricks and Snowflake got into a brief and public spat in November when Databricks claimed they were the world’s fasted data warehouse.

Snowflake took offense, saying Databricks was cheating.

Databricks countered, claiming that Snowflake was actually the one cheating.

Does this really matter? Probably not because benchmarks are often unrepresentative of real-world use. And even if you were trying to do an apples-to-apples comparison with real data under real usage scenarios, there are just too many confounding variables in performance benchmarking.

So what’s the bottom line? All the big players in the market, GCP BigQuery, AWS Redshift, Snowflake, and Azure Synapse, work on the same basic principles. They’re competitors, and none of them allows the others to run too far ahead of the pack in price or performance.

So when making a choice, let features, ecosystem, documentation, and ease of use dominate any discussions of which vendor is more suitable for your needs. And check out the next two articles to see what Databricks Photon and Snowflake Snowpark offer.

What is Databricks Photon?

Photon is a ground up C++ rewrite of Spark’s SQL engine. Previously Databricks/Spark was seen as a tool for data engineers and data scientists, and it was not comparable to a traditional data warehouse. Photon is their attempt to break into that market.

The big difference between Databricks and Snowflake architecture is how they store their data. Databricks stores its data in a data lake using S3/Azure Storage in open formats like Parquet/Delta Lake.  Snowflake keeps its data locked in a proprietary format inside their servers. There are pros and cons to both approaches, but that’s a topic for another day.

Read about Photon, The Next-Generation Query Engine for the Lakehouse

What Does Snowflake Snowpark Do?

At its core, Snowpark is a way to programmatically create and execute SnowSQL. It is Snowflake’s attempt to break into the data engineering and data science market.

Instead of writing this:

SELECT COUNT(*) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = ‘PUBLIC’

You would write:

var dfTables = session.table("INFORMATION_SCHEMA.TABLES").filter(col("TABLE_SCHEMA") === "PUBLIC")

var tableCount = dfTables.count()

var currentDb = session.getCurrentDatabase.getOrElse("<no current database>")

It might seem kind of underwhelming on its own, though useful for people who are versed in Python/Scala, but . . . . Snowpark becomes more interesting when you start using user-defined functions. UDFs enable Snowflake to execute custom code. Right now these features are still very new, but they will become more robust in the next couple of years.

Find out what Snowpark is about

It’s Here: Next-Gen Microsoft Power BI Premium Platform

Power BI released its Premium Generation 2 , a new version of Power BI that is mostly about more flexible compute.  Microsoft promises the analytics performance will be up to 16x faster than Gen1. They also lifted the limitations on how many reports can be refreshed at the same time. In addition they’ve added auto scaling features. Find more details in the attached article.

Read “What is Power BI Premium Gen2?”

Perform Batch and Streaming Using the BigQuery Storage Write API

Most data warehouses have struggled with large numbers of small inserts. Generally, they prefer large batch writes. This can be a problem for businesses that need closer-to-real-time reporting.  

A couple years ago, Snowflake introduced Snowpipe, which, while still technically batch loads, were much smaller and faster. BigQuery’s Storage Write API is even more granular, allowing for a true stream of data into tables.

Read more about the Big Query Storage Write API

Podcast: Information Foraging, The Tactics Great Developers Use to Find Solutions

Outsiders think developers sit down at 8 a.m., start typing, take their hands off the keyboard at 5 p.m., then go home.  We in the industry know that isn’t true. You spend much of your time in meetings, debugging, testing, and foraging for information.  

Information foraging is a model of how we find information based off the way animals forage for food.  Here is a concrete example.  

You have a job running on a schedule in GCP and see that it failed this morning.

Step 1: Check the job logs. Hmm, it ran successfully. Weird.

Step 2: Check the scheduler logs.  You see it had a timeout error.

Step 3: You Google “GCP scheduler timeout” and get a bunch of potential results back.  You see five relevant stack overflow questions, so you choose one that sounds closest to your problem.

Step 4:  You see that the scheduler by default times out at 1 minute, so you need to load the SDK to change it.

Step 5:  You go to GCP documentation to find the command you need.

It took about 30 minutes of work to find one line of code.  

In this episode of the Stack Overflow Podcast, the interviewee argues that one of the key skills of a great developer is the ability is to pick up the “information scent” and dig your way to the answer. There’s no easy way to learn this, and it comes through practice. We liked this chat from last summer, and you might find some interesting

Listen to the podcast on Information Foraging Tactics

Read the transcript: Information Foraging Tactics

Knowledge is power, and you want that power at your fingertips.
Stay tuned for the next edition of the Data Planet.

Stay in Touch with Onebridge

* Indicates required field
Thank you for subscribing! Check your email for a confirmation and link to your profile.
Oops! Something went wrong while submitting the form.
Hey there! We hope you've noticed that none of our content is "gated," meaning we don't force you to provide your information in order to read our content. We work hard to provide valuable information to serve our audience and our clients, and we're proud of it.

If you'd like to be notified of new content, events, and resources from Onebridge, sign up for our newsletter here. After signing up, you'll get a profile link where you can tell us what topics you want to hear about. With Onebridge, you control your data.

Please follow us on social media to see upcoming events and other resources, like blogs, eBooks, and more!