Azure HDInsight

Learn how to create an Apache Spark cluster in HDInsight and run interactive Spark SQL queries using Jupyter notebook.

Prerequisites

What is Apache Spark on Azure HDInsight?

Spark clusters on HDInsight offer a fully managed Spark service. Benefits of creating a Spark cluster on HDInsight are listed here.

Kernels for Jupyter notebook on Spark clusters in Azure HDInsight

HDInsight Spark clusters provide kernels that you can use with the Jupyter notebook on Spark for testing your applications. A kernel is a program that runs and interprets your code. The three kernels are:

  • PySpark - for applications written in Python2

  • PySpark3 - for applications written in Python3

  • Spark - for applications written in Scala

Benefits of using the kernels

Here are a few benefits of using the new kernels with Jupyter notebook on Spark HDInsight clusters.

  • Preset contexts. With PySpark, PySpark3, or the Spark kernels, you do not need to set the Spark or Hive contexts explicitly before you start working with your applications. These are available by default. These contexts are:

    • sc - for Spark context

    • sqlContext - for Hive context

      So, you don't have to run statements like the following to set the contexts:

       sc = SparkContext('yarn-client')
       sqlContext = HiveContext(sc)

      Instead, you can directly use the preset contexts in your application.

  • Cell magics. The PySpark kernel provides some predefined “magics”, which are special commands that you can call with %% (for example, %%MAGIC ). The magic command must be the first word in a code cell and allow for multiple lines of content. The magic word should be the first word in the cell. Adding anything before the magic, even comments, causes an error. For more information on magics, see here.

The following table lists the different magics available through the kernels.

Info In addition to the magics added by the PySpark kernel, you can also use the built-in IPython magics, including %%sh. You can use the %%sh magic to run scripts and block of code on the cluster headnode.

  • Auto visualization. The Pyspark kernel automatically visualizes the output of Hive and SQL queries. You can choose between several different types of visualizations including Table, Pie, Line, Area, Bar.

This section is from: Microsoft Azure Docs

Last updated