
Download spark zip file - confirm
Download spark zip file - phrase
Basic set-up for distributed machine learning
After a struggle for a few hours, I finally installed java 8, spark and configured all the environment variables. I went through a lot of medium articles and StackOverflow answers but not one particular answer or post did solve my problems. So this is just a small effort of mine to put everything together.
My machine has ubuntu and I am using java 8 along with anaconda3. If you follow the steps, you should be able to install PySpark without any problem.
- Make sure that you have java installed.
If you don’t, run the following command in terminal:
sudo apt install openjdkjdkAfter installation, if you type java -version in the terminal you will get:
openjdk version "_"OpenJDK Runtime Environment (build _ububuntub03)
OpenJDK Bit Server VM (build b03, mixed mode)
2. Download spark from www.cronistalascolonias.com.ar
Remember the directory where you downloaded. I got it in my default downloads folder where I will install spark.
3. Set the $JAVA_HOME environment variable.
For this, run the following in the terminal:
sudo vim /etc/environmentIt will open the file in vim. Then, in a new line after the PATH variable add
JAVA_HOME="/usr/lib/jvm/javaopenjdk-amd64"Type wq! and exit. This will save the edit in the file. Later, in the terminal run
source /etc/environmentDon’t forget to run the last line in the terminal, as that will create the environment variable and load it in the currently running shell. Now, if you run
echo $JAVA_HOMEThe output should be:
/usr/lib/jvm/javaopenjdk-amd64Just like it was added. Now some versions of ubuntu do not run the file every time we open the terminal so it’s better to add it in .bashrc file as .bashrc file is loaded to the terminal every time it’s opened. So run the following command in the terminal,
vim ~/.bashrcFile opens. Add at the end
source /etc/environmentWe will add spark variables below it later. Exit for now and load the .bashrc file in terminal again by running the following command.
source ~/.bashrcOr you can exit this terminal and create another. By now, if you run echo $JAVA_HOME you should get the expected output.
4. Installing spark.
Go to the directory where spark zip file was downloaded and run the command to install it:
cd Downloadssudo tar -zxvf sparkbin-hadooptgz
Note : If your spark file is of different version correct the name accordingly.
5. Configure environment variables for spark.
vim ~/.bashrcAdd the following at the end,
export SPARK_HOME=~/Downloads/sparkbin-hadoopexport PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:~/anaconda3/bin
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$PATH:$JAVA_HOME/jre/bin
Save the file and exit. Finally, load the .bashrc file again in the terminal by
source ~/.bashrcNow run:
pysparkThis should open a jupyter notebook for you. I uninstalled and ran the commands again to install spark and java. Finally, if you do:
cd $SPARK_HOMEcd bin
spark-shell --version
You will see an image like.
Conclusion
I hope the article helps you in installing spark. Today, Spark has become a very important tool for distributed machine learning and it’s a must-have on resume for any data scientist or machine learning job. Setting up spark might be first thing to do followed by learning spark data frames and then using it in any project.
Contact
If you love data science, let’s connect on linkedin or follow me here. If you liked the story, please appreciate it by clapping. Thanks :)
-
-
-