Create a new PySpark3 notebook attached to the Spark cluster.
A new notebook is created and opened with the name Untitled.pynb. Click the notebook name at the top, and enter a friendly name viz.
Because you created a notebook using the PySpark kernel, you do not need to create any contexts explicitly. The Spark and Hive contexts will be automatically created for you when you run the first code cell. You can start by importing the types that are required for this scenario. Paste the following snippet in an empty cell, and then press SHIFT + ENTER.
from pyspark.sql import Rowfrom pyspark.sql.types import*
Create an RDD using the sample log data already available on the cluster. You
can access the data in the default storage account associated with the cluster
at \HdiSamples\HdiSamples\WebsiteLogSampleData\SampleLog\909f2b.log.
Retrieve a sample log set to verify that the previous step completed
successfully.
logs.take(5)
You should see an output similar to the following:
# -----------------
# THIS IS AN OUTPUT
# -----------------
[u'#Software: Microsoft Internet Information Services 8.0',
u'#Fields: date time s-sitename cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken',
u'2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step2.png X-ARR-LOG-ID=2ec4b8ad-3cf0-4442-93ab-837317ece6a1 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 53175 871 46',
u'2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step3.png X-ARR-LOG-ID=9eace870-2f49-4efd-b204-0d170da46b4a 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 51237 871 32',
u'2014-01-01 02:01:09 SAMPLEWEBSITE GET /blogposts/mvc4/step4.png X-ARR-LOG-ID=4bea5b3d-8ac9-46c9-9b8c-ec3e9500cbea 80 - 1.54.23.196 Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/31.0.1650.63+Safari/537.36 - http://weblogs.asp.net/sample/archive/2007/12/09/asp-net-mvc-framework-part-4-handling-form-edit-and-post-scenarios.aspx www.sample.com 200 0 0 72177 871 47']
Analyze log data using a custom Python library
In the output above, the first couple lines include the header information and each remaining line matches the schema described in that header. Parsing such logs could be complicated. So, we use a custom Python library
(iislogparser.py) that makes parsing such logs much easier. By default, this library is included with your Spark cluster on HDInsight at /HdiSamples/HdiSamples/WebsiteLogSampleData/iislogparser.py.
However, this library is not in the PYTHONPATH so we cannot use it by using an import statement like import iislogparser. To use this library, we must distribute it to all the worker nodes. Run the following snippet.
You can also present this information in the form of plot. As a first step to create a plot, let us first create a temporary table AverageTime. The table groups the logs by time to see if there were any unusual latency spikes at any particular time.
You can then run the following SQL query to get all the records in the AverageTime table.
%%sql -o averagetime
SELECT * FROM AverageTime
The %%sql magic followed by -o averagetime ensures that the output of the query is persisted locally on the Jupyter server (typically the headnode of the cluster). The output is persisted as a Pandas dataframe with the specified name averagetime.
You can now use Matplotlib, a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted averagetime dataframe, the code snippet must begin with the %%local magic. This ensures that the code is run locally on the Jupyter server.
%%local%matplotlib inlineimport matplotlib.pyplot as pltplt.plot(averagetime['Minutes'], averagetime['Time'], marker='o', linestyle='--')plt.xlabel('Time (min)')plt.ylabel('Average time taken for request (ms)')
You should see an output like the following:
After you have finished running the application, you should shutdown the notebook to release the resources. To do so, from the File menu on the notebook, click Close and Halt. This will shutdown and close the notebook.