# parquets Fully asynchronous TypeScript implementation of the Parquet file format [](https://travis-ci.org/kbajalc/parquets) [](https://badge.fury.io/js/parquets) [](https://david-dm.org/kbajalc/parquets.svg) [](https://opensource.org/licenses/MIT) This package is derived from [parquet.js](https://github.com/ironSource/parquetjs), contains a fully asynchronous TypeScript implementation of the [Parquet](https://parquet.apache.org/) file format. The implementation conforms with the [Parquet specification](https://github.com/apache/parquet-format) and is being tested for compatibility with Apache's [reference implementation](https://github.com/apache/parquet-mr). **WARNING**: *There are compatibility issues with the reference implementation*: - only GZIP and SNAPPY compressions are compatible - [Parquet Tools](https://github.com/apache/parquet-mr/tree/master/parquet-tools) are command line tools that aid in the inspection of Parquet files. - always verify your table structure loaded with realistic data sample can be read by Parquet Tools! **What is Parquet?**: Parquet is a column-oriented file format; it allows you to write a large amount of structured data to a file, compress it and then read parts of it back out efficiently. The Parquet format is based on [Google's Dremel paper](http://www.vldb.org/pvldb/vldb2010/papers/R29.pdf). Installation ------------ To use parquets with node.js, install it using npm: ``` $ npm install parquets ``` _parquets requires node.js >= 7.6.0_ Usage: Writing files -------------------- Once you have installed the parquets library, you can import it as a single module: ``` ts import { ParquetSchema, ParquetWriter, ParquetReader } from 'parquets'; ``` Parquet files have a strict schema, similar to tables in a SQL database. So, in order to produce a Parquet file we first need to declare a new schema. Here is a simple example that shows how to instantiate a `ParquetSchema` object: ```ts // declare a schema for the `fruits` table let schema = new ParquetSchema({ name: { type: 'UTF8' }, quantity: { type: 'INT64' }, price: { type: 'DOUBLE' }, date: { type: 'TIMESTAMP_MILLIS' }, in_stock: { type: 'BOOLEAN' } }); ``` Note that the Parquet schema supports nesting, so you can store complex, arbitrarily nested records into a single row (more on that later) while still maintaining good compression. Once we have a schema, we can create a `ParquetWriter` object. The writer will take input rows as JSON objects, convert them to the Parquet format and store them on disk. ```ts // create new ParquetWriter that writes to 'fruits.parquet` let writer = await ParquetWriter.openFile(schema, 'fruits.parquet'); // append a few rows to the file await writer.appendRow({name: 'apples', quantity: 10, price: 2.5, date: new Date(), in_stock: true}); await writer.appendRow({name: 'oranges', quantity: 10, price: 2.5, date: new Date(), in_stock: true}); ``` Once we are finished adding rows to the file, we have to tell the writer object to flush the metadata to disk and close the file by calling the `close()` method: Usage: Reading files -------------------- A parquet reader allows retrieving the rows from a parquet file in order. The basic usage is to create a reader and then retrieve a cursor/iterator which allows you to consume row after row until all rows have been read. You may open more than one cursor and use them concurrently. All cursors become invalid once close() is called on the reader object. ```ts // create new ParquetReader that reads from 'fruits.parquet` let reader = await ParquetReader.openFile('fruits.parquet'); // create a new cursor let cursor = reader.getCursor(); // read all records from the file and print them let record = null; while (record = await cursor.next()) { console.log(record); } ``` When creating a cursor, you can optionally request that only a subset of the columns should be read from disk. For example: ```ts // create a new cursor that will only return the `name` and `price` columns let cursor = reader.getCursor(['name', 'price']); ``` It is important that you call close() after you are finished reading the file to avoid leaking file descriptors. ```ts await reader.close(); ``` Encodings --------- Internally, the Parquet format will store values from each field as consecutive arrays which can be compressed/encoded using a number of schemes. #### Plain Encoding (PLAIN) The most simple encoding scheme is the PLAIN encoding. It simply stores the values as they are without any compression. The PLAIN encoding is currently the default for all types except `BOOLEAN`: ```ts let schema = new ParquetSchema({ name: { type: 'UTF8', encoding: 'PLAIN' }, }); ``` #### Run Length Encoding (RLE) The Parquet hybrid run length and bitpacking encoding allows to compress runs of numbers very efficiently. Note that the RLE encoding can only be used in combination with the `BOOLEAN`, `INT32` and `INT64` types. The RLE encoding requires an additional `bitWidth` parameter that contains the maximum number of bits required to store the largest value of the field. ```ts let schema = new ParquetSchema({ age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 }, }); ``` Optional Fields --------------- By default, all fields are required to be present in each row. You can also mark a field as 'optional' which will let you store rows with that field missing: ```ts let schema = new ParquetSchema({ name: { type: 'UTF8' }, quantity: { type: 'INT64', optional: true }, }); let writer = await ParquetWriter.openFile(schema, 'fruits.parquet'); await writer.appendRow({name: 'apples', quantity: 10 }); await writer.appendRow({name: 'banana' }); // not in stock ``` Nested Rows & Arrays -------------------- Parquet supports nested schemas that allow you to store rows that have a more complex structure than a simple tuple of scalar values. To declare a schema with a nested field, omit the `type` in the column definition and add a `fields` list instead: Consider this example, which allows us to store a more advanced "fruits" table where each row contains a name, a list of colours and a list of "stock" objects. ```ts // advanced fruits table let schema = new ParquetSchema({ name: { type: 'UTF8' }, colours: { type: 'UTF8', repeated: true }, stock: { repeated: true, fields: { price: { type: 'DOUBLE' }, quantity: { type: 'INT64' }, } } }); // the above schema allows us to store the following rows: let writer = await ParquetWriter.openFile(schema, 'fruits.parquet'); await writer.appendRow({ name: 'banana', colours: ['yellow'], stock: [ { price: 2.45, quantity: 16 }, { price: 2.60, quantity: 420 } ] }); await writer.appendRow({ name: 'apple', colours: ['red', 'green'], stock: [ { price: 1.20, quantity: 42 }, { price: 1.30, quantity: 230 } ] }); await writer.close(); // reading nested rows with a list of explicit columns let reader = await ParquetReader.openFile('fruits.parquet'); let cursor = reader.getCursor([['name'], ['stock', 'price']]); let record = null; while (record = await cursor.next()) { console.log(record); } await reader.close(); ``` It might not be obvious why one would want to implement or use such a feature when the same can - in principle - be achieved by serializing the record using JSON (or a similar scheme) and then storing it into a UTF8 field: Putting aside the philosophical discussion on the merits of strict typing, knowing about the structure and subtypes of all records (globally) means we do not have to duplicate this metadata (i.e. the field names) for every record. On top of that, knowing about the type of a field allows us to compress the remaining data more efficiently. List of Supported Types & Encodings ----------------------------------- We aim to be feature-complete and add new features as they are added to the Parquet specification; this is the list of currently implemented data types and encodings:
Logical Type | Primitive Type | Encodings |
---|---|---|
UTF8 | BYTE_ARRAY | PLAIN |
JSON | BYTE_ARRAY | PLAIN |
BSON | BYTE_ARRAY | PLAIN |
BYTE_ARRAY | BYTE_ARRAY | PLAIN |
TIME_MILLIS | INT32 | PLAIN, RLE |
TIME_MICROS | INT64 | PLAIN, RLE |
TIMESTAMP_MILLIS | INT64 | PLAIN, RLE |
TIMESTAMP_MICROS | INT64 | PLAIN, RLE |
BOOLEAN | BOOLEAN | PLAIN, RLE |
FLOAT | FLOAT | PLAIN |
DOUBLE | DOUBLE | PLAIN |
INT32 | INT32 | PLAIN, RLE |
INT64 | INT64 | PLAIN, RLE |
INT96 | INT96 | PLAIN |
INT_8 | INT32 | PLAIN, RLE |
INT_16 | INT32 | PLAIN, RLE |
INT_32 | INT32 | PLAIN, RLE |
INT_64 | INT64 | PLAIN, RLE |
UINT_8 | INT32 | PLAIN, RLE |
UINT_16 | INT32 | PLAIN, RLE |
UINT_32 | INT32 | PLAIN, RLE |
UINT_64 | INT64 | PLAIN, RLE |