- Column storage format → easier to run analytics on columns (features)
- Works well with Hadoop, Spark and Hive
For example, say you have an S3 with Parquet files then you could use AWS Glue to catalog the Parquet files then you could use Amazon Athena to run queries on these files. Athena allows you to analyze data directly in S3 using SQL
Compression and encoding
- Compress pages with Snappy and Gzip
- Encodes data representation with Run-Length encoding (RLE) and Dictionary encoding
RLE
Example: “AAAAAA” becomes “A, 6”
Dictionary encoding
Example: It creates a dictionary of categorical data in a column then it replaces those values with ints
Schema Evolution?
When adding a new column, it only adds that column to pages that needed. All other pages remain the same. Then when joining, if a column doesn’t exist in a page then Parquet engines (eg: Spark, Hive, BigQuery) return NULL instead of breaking the query (cool!)
Read and write Parquet files
With Pandas
import pandas as pd
# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
# Write to Parquet file
df.to_parquet("data.parquet", engine="pyarrow", index=False)
# Read a Parquet file
df = pd.read_parquet("data.parquet")Check amount of memory allocated
import pyarrow as pa
pool = pa.default_memory_pool()
pool.bytes_allocated()
pool.max_memory()Check out the datacamp link for other useful operations
From the Datacamp link:
When to Use Apache Parquet In summary, here are a few situations where Parquet is the best choice:
Analytics-heavy workloads: Parquet’s columnar format allows us to fetch only the data we need, which speeds up queries and saves time. I’ve seen this hands-on when processing datasets with Apache Spark — queries that once took minutes ran in seconds because of Parquet’s efficient structure. Data lake architectures: When you build a data lake, storage costs quickly add up. But Parquet’s compression capabilities reduce the size of stored data to help you save on storage without sacrificing performance. Use cases involving large datasets: Parquet handles large and complex datasets neatly, especially those with nested or hierarchical structures. Its support for rich data types and schema evolution ensures you can adapt your data as requirements change.
References
https://www.datacamp.com/tutorial/apache-parquet https://parquet.apache.org/docs/file-format/data-pages/ https://arrow.apache.org/docs/python/index.html https://arrow.apache.org/docs/python/memory.html#memory-pools