![]() If the data is persisted as a CSV file, then df.select("city") will transmit all the data across the wire. If the data is persisted as a Parquet file, then df.select("city") will only have to transmit one column worth of data across the wire. Suppose you have the following DataFrame ( df) and would like to query the city column. Row oriented file formats require data from all the rows to be transmitted over the wire for every analysis. OCR and Parquet are column oriented data formats. Column oriented formatsĬSV, JSON, and Avro are row oriented data formats. JSON is the worst file format for distributed systems and should be avoided whenever possible. TL DR Use Apache Parquet instead of CSV or JSON whenever possible, because it’s faster and better. Spark works with many file formats including Parquet, CSV, JSON, OCR, Avro, and text files.
0 Comments
Leave a Reply. |