God this article is 10000% better than the posted one. This is great:
> Names should not describe what you currently think the thing you’re naming is for. Imagine naming your newborn child "Doctor", or "SupportsMeInMyOldAge". Poor kid.
The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).
But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?
The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:
On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.
I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:
Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.
If I could ask you to speculate for a second, how do you think we will go from here to a clear successor to Parquet?
Will one of the new formats absorb the others' features? Will there be a format war a la iceberg vs delta lake vs hudi? Will there be a new consortium now that everyone's formats are out in the wild?
... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?
Also, back on topic - is your file format encryptable via that WASM embedding?
I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.
Presumably because everyone in MCF has been waiting for ITER for decades, and JET is being decommissioned after a last gasp. Every other tokamak is considerably smaller (or similar size like DIII-D or JT-60SA).
Much of the interesting tokamak engineering ideas were on small (so low-power) machines or just concepts using high-temperature superconducting magnets.
The really depressing part is if you plot rate of new delays against real time elapsed, the projected finishing date is even further.
This is why much of the fusion research community feel disillusioned with ITER, and so are more interested in these smaller (and supposedly more "agile") machines with high-temperature superconductors instead.
I wrote the article I wish I could have read back when I first heard of Zarr and cloud-native science back in 2018.
This explains how object storage and conventional filesystems are different, and the key properties that make Zarr work so well in cloud object storage.
Yes, that assumption is called the Ergodic Hypothesis, and generally justified in undergraduate statistical mechanics courses by proving and appealing to Liouville's theorem.
It's worth noting that there's more than just ergodicity at play, although that's a fundamental requirement. For instance, applying the Pauli Exclusion Principle gives rise to Fermi-Dirac statistics.
Isn't that more about enumerating the microstates? The Pauli exclusion principle just ends up forbidding some of the microstates (forbidding a significant fraction of them if you're in the low-temperature regime).
It is about enumerating the microstates, but in a way that takes into account how the particles interact with each other (aka making assumptions about the dynamics).
If we didn't take into account any interactions, we'd be unable to do anything with statistical mechanics beyond rederiving the ideal gas law.
The scientific community works primarily with array (or "tensor") data, using tools like numpy, xarray, and zarr. People familiar with modern relational database tools such as DuckDB and Parquet often ask why can't we just use those? This article explains why: it's massively inefficient to use tabular tools on array data, and demonstrates with a benchmark showing a 10x difference in query speed.
This entire stack also now exists for arrays as well as for tabular data. It's still S3 for storage, but Zarr instead of parquet, Icechunk instead of Iceberg, and Xarray for queries in python.
Surely Zarr is already a long-term storage format for multidimensional data? It can even be mapped directly to netCDF, GRIB and geoTIFF via VirtualiZarr[0].
Also if you like Iceberg and you like arrays you will really like Icechunk[1], which is Version-controlled Zarr!
I know icechunk and I’m a huge fan of earthmover. But a common binary format like parquet seems nice… with interop for e.g duckdb and geo queries, you can “just load” era5 and do something like get wind direction/speed along the following path for the last 5 years group by day etc…
If you know the exact tensor shape of your data ahead of time Zarr works well (we use it as the dataformat for our ml experiments). If you have dynamically growing data or irregular shapes zarr doesn't work as well.
> The future of Python's main open source data science ecosystem, numfocus, does not seem bright. Despite performance improvements, Python will always be a glue language.
Your first sentence is a scorching hot take, but I don't see how it's justified by your second sentence.
The community always understood that python is a glue language, which is why the bottleneck interfaces (with IO or between array types) are implemented in lower-level languages or ABIs. The former was originally C but often is now Rust, and Apache Arrow is a great example of the latter.
The strength of using Python is when you want to do anything beyond pure computation (e.g. networking) the rest of the world already built a package for that.
So without the two-lang problem, I think all of these low-level optimization efforts across dataframes, tensors, and distributed computing would be part of a unified ecosystem based on shared compatibility.
For example, the reason why numfocus is so great is that everything was designed to work with numpy as its underlying data structure.
> Names should not describe what you currently think the thing you’re naming is for. Imagine naming your newborn child "Doctor", or "SupportsMeInMyOldAge". Poor kid.