FeedCette semaineArticle
articleHuggingFace Blog

Parquet Content-Defined Chunking

Parquet Content-Defined Chunking (CDC) is now available in PyArrow and Pandas, enabling deduplication of Parquet files on Hugging Face’s Xet storage layer. CDC reduces data transfer and storage costs by uploading or downloading only changed data chunks. The article shows how to enable CDC with use_content_defined_chunking and outlines several deduplication scenarios with code examples.

publié 25 JUIL. 2025★★★★
Lire la sourcehuggingface.co/blog/parquet-cdc
[*] Ouvre dans un nouvel onglet · pas de tracking côté Lantern
Source
HuggingFace Blog
Ingéré
25 JUIL. 2025 · 19:10
Score édito
4.0 / 5