Parquet data source in Polypheny (Master Project, Ongoing)
Author
Description
Parquet is a file format for data analytics. Unlike traditional database systems, Parquet files use a columnar format. Parquet files have a mandatory schema, but unlike relational schemas, Parquet schemas can be nested (more like a document schema). The Parquet file format is designed such that it can be efficiently queried without having to read the entire file. It also contains statistics of the columns to further optimize queries.
One of the strengths of PolyDBMS systems is to unify access to different data sources. Given that it is only natural that Polypheny should also be able to use Parquet files.
A Parquet data source offers many challenges. The columnar format with nesting requires a source that exposes data both as relational tables and document collections. In the case of relational tables a suitable relational schema must be derived from the schema of the Parquet file. Further, the query planner should take advantage of the file format to optimize read queries. The integrated workflow engine would also benefit from supporting Parquet files, for both reading and writing.
Objectives
- Add Parquet files as a relational data source to Polypheny, appropriately mapping the schema.
- Push down filter and projection operations to efficiently query large Parquet files.
- Expose Parquet files as a document data source for easier access to nested data.
Optional objectives
- Adding Parquet input and output to the workflow engine.
- Read from remote Parquet files and support a unified view on distributed set of remote Parquet files.
- Integrate and use statistics stored in Parquet files for query optimization.
Requirements
- Add a data source that is both relational and document for Parquet files.
- Map the schema of Parquet files to an appropriate relational schema.
- Take advantage of the Parquet file design to optimize queries.
- The implementation has to be evaluated in regards of correctness and performance.
Start / End Dates
2026/03/11 - 2026/07/22