Thursday, July 21, 2016

How to integrate Jupyter notebook and Apache spark in windows ? [Tested on Spark 1.6.1]

Download Anaconda python and install it (https://www.continuum.io/downloads)
Open command prompt run command " ipython notebook" or "jupyter notebook"
Create a new python notebook and copy paste the below commands

import os

import sys

os.environ['SPARK_HOME'] = "C:/Spark1.6.1/spark-1.6.1-bin-hadoop2.6"

sys.path.append("C:/Spark1.6.1/spark-1.6.1-bin-hadoop2.6/bin")

sys.path.append("C:/Spark1.6.1/spark-1.6.1-bin-hadoop2.6/python")

sys.path.append("C:/Spark1.6.1/spark-1.6.1-bin-hadoop2.6/python/pyspark")

sys.path.append("C:/Spark1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib")

sys.path.append("C:/Spark1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib/pyspark.zip")

sys.path.append("C:/Spark1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip")

sys.path.append("C:/Program Files/Java/jdk1.8.0_73")

from pyspark import SparkContext

from pyspark import SparkConf

sc = SparkContext("local","test")

replace SPARK_HOME with your spark's home location similarly change the rest of the commands also.

Testing

textFile = sc.textFile("README.md")

textFile.count()

Saturday, June 27, 2015

How to upgrade Spark 1.3.1 to Spark 1.4.0

1) Download Spark 1.4.0 from https://spark.apache.org/downloads.html

2)Check the dependencies Scala 2.11, Maven 3.3.3 using commands
scala -version & mvn -version
3)Now we need to build Spark using Apache Maven, run
mvn -DskipTests clean package

4)Wait till build gets success and this process would take around 45mins.

5)Check scala shell and pyspark shell using commands

./bin/spark-shell (Scala)

./bin/pyspark (Python)

6)If you need to use Spark 1.4 with Ipython notebook

Use this link to setup Spark on Ipython notebook ( http://advancedatascience.blogspot.com/2015/06/how-to-use-spark-on-ipython-notebook.html) and change the spark home directory path in this file \00-pyspark-setup.py

Thursday, June 18, 2015

How to use Spark on Ipython notebook

Note: Assuming you have installed anaconda python distribution.

If not please install https://store.continuum.io/cshop/anaconda

1. Create a Ipython notebook profile for spark configuration.

ipython profile create spark

2. It will create a folder profile_spark in your iPython profile

3. Now create a file C:\Users\Bellamkonda\.ipython\profile_spark\startup\00-pyspark-setup.py and add the following:

import os

import sys

# Configure the environment

if 'SPARK_HOME' not in os.environ:

os.environ['SPARK_HOME'] = 'C:\Spark\spark-1.3.1' (Insert your spark location)

# Create a variable for our root path

SPARK_HOME = os.environ['SPARK_HOME']

# Add the PySpark/py4j to the Python Path

sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))

sys.path.insert(0, os.path.join(SPARK_HOME, "python"))

4. Now Start up an IPython notebook with the profile we just created

ipython notebook --profile spark

5. The above command will initiate your Ipython notebook.

6. Do check the Pyspark by importing libraries or by running commands

        Example: from pyspark import  SparkContext

source:https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

Tuesday, May 19, 2015

How to install Apache Spark on windows machine

1) Download Apache spark from http://spark.apache.org/downloads.html

2) Extract files and open Readme.md documentation.

3) Building Spark using Maven requires Maven 3.0.4 or newer,Java 6+ and scala.

4) Install JDK and Maven and setup the paths in system variables.(http://maven.apache.org/download.cgi)

5) Run command mvn -DskipTests clean package for building the package

6) Install scala http://www.scala-lang.org/

7) Update scala system path in system variables.

8) After building completes invoke spark shell using ./bin/spark-shell

Saturday, May 16, 2015

How to use power shell in console2

Power shell can use speed of command line and power of scripting language. To use power shell in console2.

In console2 setting set shell to
%SystemRoot%\syswow64\WindowsPowerShell\v1.0\powershell.exe

Sunday, February 8, 2015

How to Insall Java8 in Ubuntu

Oracle JAVA 8 Stable release has been released on Mar,18 2014 and available to download and install on official download page. Oracle Java PPA for Ubuntu and LinuxMint is being maintained by Webupd8 Team. JAVA 8 is released with many of new features and security updates, read more about whats new in Oracle Java 8.

This article will help you to Install Oracle JAVA 8 (JDK/JRE 8u25) on Ubuntu 14.04 LTS, 12.04 LTS and 10.04 and LinuxMint systems using PPA File. To Install Java 8 in CentOS, Redhat and Fedora read This Article.

Installing Java 8 on Ubuntu

Add the webupd8team Java PPA repository in your system and install Oracle java8 using following set of commands.

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Verify Installed Java Version

After successfully installing oracle Java using above step verify installed version using following command.

$ java -version

java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

Configuring Java Environment

Webupd8team is providing a package to set environment variables, Install this package using following command.

$ sudo apt-get install oracle-java8-set-default

Source :TechAdmin

Wednesday, December 24, 2014

Hadoop: The Definitive Guide, 3rd Edition

Book Description

With this digital Early Release edition of Hadoop: The Definitive Guide, you get the entire book bundle in its earliest form - the author's raw and unedited content - so you can take advantage of this content long before the book's official release. You'll also receive updates when significant changes are made. Ready to unleash the power of your massive dataset? With the latest edition of this comprehensive resource, you'll learn how to use Apache Hadoop to build and maintain reliable, scalable, distributed systems. It's ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

This third edition covers recent changes to Hadoop, including new material on the new MapReduce API, as well as version 2 of the MapReduce runtime (YARN) and its more flexible execution model. You'll also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems.

source: http://it-ebooks.info/book/635/