Skip to content

DataFrames APIs

DataFrame operations:

  • printSchema()
  • select()
  • show()
  • count()
  • groupBy()
  • sum()
  • limit()
  • orderBy()
  • filter()
  • withColumnRenamed()
  • join()
  • withColumn()


// In the Regular Expression below:
// ^  - Matches beginning of line
// .* - Matches any characters, except newline

 .show() // By default, show will return 20 rows

// Import the sql functions package, which includes statistical functions like sum, max, min, avg, etc.
import org.apache.spark.sql.functions._



A new column is constructed based on the input columns present in a dataframe:

df("columnName")        // On a specific DataFrame.
col("columnName")       // A generic column no yet associated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName"           // Scala short hand for a named column.
expr("a + 1")           // A column that is constructed from a parsed SQL Expression.
lit("abc")              // A column that produces a literal (constant) value.

Column objects can be composed to form complex expressions:

$"a" + 1
$"a" === $"b"

File Read

CSV - Create a DataFrame with the anticipated structure

val clickstreamDF ="com.databricks.spark.csv")
  .option("header", "true")
  .option("delimiter", "\\t")
  .option("mode", "PERMISSIVE")
  .option("inferSchema", "true")

PARQUET - To create Dataset[Row] using SparkSession

val people ="...")
val department ="...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))

Repartitioning / Caching

val clickstreamNoIDs8partDF = clickstreamNoIDsDF.repartition(8)

An ideal partition size in Spark is about 50 MB - 200 MB. The cache gets stored in Project Tungsten binary compressed columnar format.