Profile avatar
andygrove.io
Apache Arrow & DataFusion PMC Member. Original creator of Apache DataFusion.
37 posts 2,571 followers 77 following
Prolific Poster
Conversation Starter

One month on, and I have zero regrets about quitting Facebook & Instagram. I have replaced the scrolling time with listening to podcasts. I now stay in touch with family overseas via email and photo sharing, and I use Snapchat for sharing photos with immediate family, privately. Works great.

Chris Riccomini (@chris.blue) shares his thoughts on Open Source foundations: Apache, CNCF, Commonhaus. He also explains why Commonhaus is a better fit for SlateDB cnr.sh/posts/compar...

Comet 0.6.0 has been released. This is a smaller release than usual now that we have moved to an approximately monthly release cadence to match core DataFusion. datafusion.apache.org/blog/2025/02...

Ballista 43.0.0 has been released, and now provides seamless integration with DataFusion. datafusion.apache.org/blog/2025/02...

Check out this excellent presentation from @robtandy.bsky.social on his work with the DataFusion Ray project from last week's DataFusion community meetup. It is a great overview of how to build a distributed system on top of DataFusion. www.youtube.com/watch?v=ceTo...

This Week in DataFusion Comet (Jan 26): github.com/apache/dataf...

I've finally decided to quit using Facebook. My feed is overwhelmed with nonsense content that I am not interested in and cannot seem to block. It is a real shame, though, because it was a good way to stay connected with family. Is there a viable alternative? What are others using instead?

This week in DataFusion Comet (Jan 18). Inspired by @andrewlamb1111.bsky.social's weekly updates in DataFusion core, I am going to start doing the same in Comet to help keep the community updated on current events. github.com/apache/dataf...

DataFusion Comet 0.5.0 has been released. See blog post for details. datafusion.apache.org/blog/2025/01...

2025 is shaping up to be a breakout year for fast query result transfer with Apache Arrow. But what exactly makes it so fast? David Li, Matt Topol, and I break it down in this new blog post: arrow.apache.org/blog/2025/01...

DataFusion Comet performance has been improving recently and now demonstrates a ~2x speedup compared to Spark for single node TPC-H. There is more to do, but this feels like a significant milestone. github.com/apache/dataf...

Buckle up because we're banging into the new year with my annual retrospective of the last year in databases! Highlights include license change blowback, Databricks vs. Snowflake gangwar, @duckdb.org's shotgun weddings, and buying a quarterback to impress your lover: www.cs.cmu.edu/~pavlo/blog/...

The latest in my set of career advice: How to maintain a healthy work-life balance by setting boundaries. I'm moving these over to blog posts on my website for easier sharing. #CareerPlusPlus #WorkAdvice

Yet another impressive DataFusion Python release. This time with interoperability improvements with PyCapsule and FFI. datafusion.apache.org/blog/2024/12...

1/5 Just made my first contribution to the #Datafusion #Comet - a native physical execution engine for #Apache #Spark! 🚀 While Spark with it's row oriented model and code generation approach is quite good on average, there is almost always a faster specific solution.

"We want to remove DataFusion from everything" 😭 It's good to hear detailed feedback for a use case where DataFusion didn't work out—some important lessons. youtu.be/Sor3KZpmbHg?...

Apache DataFusion Comet 0.4.0 has been released! See the blog post for details. datafusion.apache.org/blog/2024/11...

Apache DataFusion's Ballista distributed query engine has quietly been getting a makeover over the past several months. I'm excited to see the project being maintained again! Performance is 🔥. See the updated README for more information. github.com/apache/dataf...

What a brilliant use of AI! news.virginmediao2.co.uk/o2-unveils-d...

If you’re wondering what all the fuss is about lately, this is what’s driving the adoption of Iceberg to implement the Open Data Lake. sympathetic.ink/2024/11/07/T... It’s all columnar data in blob storage anyways. You may as well be the one taking advantage of it.

So I have this theory that DataFusion, despite being a SQL engine, will actually enable a new breed of data systems to create non-SQL languages for working with data. Here's the idea...🧵

I recently set up a small k8s cluster in my basement. It consists of two gaming PCs connected with a 2.5g switch. For a reasonable cost (<$4k), it gives me 64 cores and 256MB RAM. It has turned out to be a game changer for iterating on local performance tuning & benchmarking.

This is a really interesting talk on building a domain-specific database (Bioinformatics) with DataFusion. www.youtube.com/watch?v=fltZ...

I see that the NVIDIA RAPIDS team has published their first Rust crate (that I know of). I'd love to see them do the same for the cuDF DataFrame library. crates.io/crates/cuvs

DataFusion was named one of the 10 coolest open-source software tools of 2024. www.crn.com/news/softwar...

If you're interested in learning more about accelerating Apache Spark with Apache DataFusion's Comet subproject, check out this talk I recently gave as part of CMU's Database Building Blocks Seminar Series. We'd love for more people to try out Comet and give us feedback! youtu.be/o59s0d3HE1k?...

I'm excited to see the data community growing on Bluesky! 👋 to all of my new followers over the past 24 hours.

I made an infra engineer starter pack. Folks posting about databases, stream processing, durable execution, orchestrators, service meshes, and more. go.bsky.app/SCZe42X

DataFusion has a new subproject! github.com/apache/dataf... This project was formerly known as Ray SQL, and allows DataFusion queries to be scaled out on Ray clusters.

DataFusion Comet 0.2.0 has been released. See the blog post for details. Thanks to everyone who contributed! datafusion.apache.org/blog/2024/08...

Announcing the DataFusion Comet project for accelerating Apache Spark with DataFusion. arrow.apache.org/blog/2024/03...

The third chapter of my blog post 10+ years of building open source standards is now available on the sympathetic.ink blog. Onwards, OpenLineage: sympathetic.ink/2024/02/20/C...

New post! Picking at some of my stream processing scar tissue. Why Samza failed, how it led to Kafka Streams and Kafka Connect, and why I'm skeptical of Apache Flink.

Boring Data Tool (bdt) has now moved to the datafusion-contrib GitHub org. I think this is a nice example of building CLI data tools with ApacheArrow and DataFusion github.com/datafusion-c...

So how do I get to post to Bluesky and X without duplicating effort?

TIL about Apache DafaFusion Comet. Apple has replaced Spark's guts with DataFusion. And they're donating it. 🤯 github.com/apache/arrow... This is an alternative to Meta's Velox Spark implementation. facebookincubator.github.io/velox/spark_...

Now that Bluesky is open to all, I figured I should find out if all the data folks are over here yet. Who should I be following here?