• Column storage format easier to run analytics on columns (features)
  • Works well with Hadoop, Spark and Hive

For example, say you have an S3 with Parquet files then you could use AWS Glue to catalog the Parquet files then you could use Amazon Athena to run queries on these files. Athena allows you to analyze data directly in S3 using SQL

Compression and encoding

  • Compress pages with Snappy and Gzip
  • Encodes data representation with Run-Length encoding (RLE) and Dictionary encoding

RLE

Example: “AAAAAA” becomes “A, 6”

Dictionary encoding

Example: It creates a dictionary of categorical data in a column then it replaces those values with ints

Schema Evolution?

When adding a new column, it only adds that column to pages that needed. All other pages remain the same. Then when joining, if a column doesn’t exist in a page then Parquet engines (eg: Spark, Hive, BigQuery) return NULL instead of breaking the query (cool!)

Read and write Parquet files

With Pandas

import pandas as pd
 
# Sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
 
# Write to Parquet file
df.to_parquet("data.parquet", engine="pyarrow", index=False)
 
# Read a Parquet file
df = pd.read_parquet("data.parquet")

Check amount of memory allocated

import pyarrow as pa
 
pool = pa.default_memory_pool()
pool.bytes_allocated()
pool.max_memory()

Check out the datacamp link for other useful operations

From the Datacamp link:

When to Use Apache Parquet In summary, here are a few situations where Parquet is the best choice:

Analytics-heavy workloads: Parquet’s columnar format allows us to fetch only the data we need, which speeds up queries and saves time. I’ve seen this hands-on when processing datasets with Apache Spark — queries that once took minutes ran in seconds because of Parquet’s efficient structure. Data lake architectures: When you build a data lake, storage costs quickly add up. But Parquet’s compression capabilities reduce the size of stored data to help you save on storage without sacrificing performance. Use cases involving large datasets: Parquet handles large and complex datasets neatly, especially those with nested or hierarchical structures. Its support for rich data types and schema evolution ensures you can adapt your data as requirements change.

References

https://www.datacamp.com/tutorial/apache-parquet https://parquet.apache.org/docs/file-format/data-pages/ https://arrow.apache.org/docs/python/index.html https://arrow.apache.org/docs/python/memory.html#memory-pools