Spark run on windows

Spark run on windows

Choose a Spark release:

Choose a package type:

Verify this release using the and project release KEYS.

Note that, Spark 2.x is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12. Spark 3.0+ is pre-built with Scala 2.12.

Latest Preview Release

Preview releases, as the name suggests, are releases for previewing upcoming features. Unlike nightly packages, preview releases have been audited by the project’s management committee to satisfy the legal requirements of Apache Software Foundation’s release policy. Preview releases are not meant to be functional, i.e. they can and highly likely will contain critical bugs or documentation errors. The latest preview release is Spark 3.0.0-preview2, published on Dec 23, 2019.

Spark artifacts are hosted in Maven Central. You can add a Maven dependency with the following coordinates:

Installing with PyPi

PySpark is now available in pypi. To install just run pip install pyspark .

Release Notes for Stable Releases

Archived Releases

As new Spark releases come out for each development stream, previous ones will be archived, but they are still available at Spark release archives.

NOTE: Previous releases of Spark may be affected by security issues. Please consult the Security page for a list of known issues that may affect the version you download before deciding to use it.

Install Spark on Windows (PySpark)

Apr 3, 2017 · 2 min read

The video above walks through installing spark on windows following the set of instructions below. You can either leave a comment here or leave me a comment on youtube (please subscribe if you can) if you have any questions!

Prerequisites: Anaconda and GOW. If you already have anaconda and GOW installed, skip to step 5.

  1. Download and install Gnu on windows (GOW) from the following link. Basically, GOW allows you to use linux commands on windows. In this install, we will need curl, gzip, tar which GOW provides.

2. Download and install Anaconda. If you need help, please see this tutorial.

3. Close and open a new command line (CMD).

4. Go to the Apache Spark website (link)

a) Choose a Spark release

b) Choose a package type

c) Ch o ose a download type: (Direct Download)

d) Download Spark. Keep in mind if you download a newer version, you will need to modify the remaining commands for the file you downloaded.

5. Move the file to where you want to unzip it.

mv C:\Users\mgalarny\Downloads\spark-2.1.0-bin-hadoop2.7.tgz C:\opt\spark\spark-2.1.0-bin-hadoop2.7.tgz

6. Unzip the file. Use the bolded commands below

gzip -d spark-2.1.0-bin-hadoop2.7.tgz

tar xvf spark-2.1.0-bin-hadoop2.7.tar

7. Download winutils.exe into your spark-2.1.0-bin-hadoop2.7\bin

8. Make sure you have Java 7+ installed on your machine.

9. Next, we will edit our environmental variables so we can open a spark notebook in any directory.

setx SPARK_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7

setx HADOOP_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7

setx PYSPARK_DRIVER_PYTHON ipython

setx PYSPARK_DRIVER_PYTHON_OPTS notebook

Add ;C:\opt\spark\spark-2.1.0-bin-hadoop2.7\bin to your path.

See the video if you want to update your path manually.

10. Close your terminal and open a new one. Type the command below.

Notes: The PYSPARK_DRIVER_PYTHON parameter and the PYSPARK_DRIVER_PYTHON_OPTS parameter are used to launch the PySpark shell in Jupyter Notebook. The — master parameter is used for setting the master node address. Here we launch Spark locally on 2 cores for local testing.

Done! Please let me know if you have any questions here or through Twitter. You can view the ipython notebook used in the video to test PySpark here!

Читайте также:  Ассемблер с нуля windows

Spark 2: How to install it on Windows in 5 steps

This is a very easy tutorial that will let you install Spark in your Windows PC without using Docker.

By the end of the tutorial you’ll be able to use Spark with Scala or Python.

Before we begin:

It’s important that you replace all the paths that include the folder “Program Files” or “Program Files (x86)” as explained below to avoid future problems when running Spark.

If you have Java already installed, you still need to fix the JAVA_HOME and PATH variables

    Replace “Program Files” with “Progra

1”
Replace “Program Files (x86)” with “Progra

2”
Example: “C:\Program FIles\Java\jdk1.8.0_161” —> “C:\Progra

1. Prerequisite — Java 8

Before you start make sure you have Java 8 installed and the environment variables correctly defined:

  1. Download Java JDK 8 from Java’s official website
  2. Set the following environment variables:
  • JAVA_HOME = C:\Progra

    1\Java\jdk1.8.0_161
    PATH += C:\Progra

    1\Java\jdk1.8.0_161\bin

  • Optional: _JAVA_OPTIONS = -Xmx512M -Xms512M (To avoid common Java Heap Memory problems whith Spark)

1 is the shortened path for “Program Files”.

2. Spark: Download and Install

  1. Download Spark from Spark’s official website
  • Choose the newest release ( 2.3.0 in my case)
  • Choose the newest package type ( Pre-built for Hadoop 2.7 or later in my case)
  • Download the . tgz file

2. Extract the .tgz file into D:\Spark

Note: In this guide I’ll be using my D drive but obviously you can use the C drive also

3. Set the environment variables:

  • SPARK_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7
  • PATH += D:\Spark\spark-2.3.0-bin-hadoop2.7\bin

3. Spark: Some more stuff (winutils)

  1. Download winutils.exe from here: https://github.com/steveloughran/winutils
  • Choose the same version as the package type you choose for the Spark .tgz file you chose in section 2 “Spark: Download and Install” (in my case: hadoop-2.7.1)
  • You need to navigate inside the hadoop-X.X.X folder, and inside the bin folder you will find winutils.exe
  • If you chose the same version as me (hadoop-2.7.1) here is the direct link: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

2. Move the winutils.exe file to the bin folder inside SPARK_HOME,

  • In my case: D:\Spark\spark-2.3.0-bin-hadoop2.7\bin

3. Set the folowing environment variable to be the same as SPARK_HOME:

4. Optional: Some tweaks to avoid future errors

This step is optional but I highly recommend you do it. It fixed some bugs I had after installing Spark.

Hive Permissions Bug

  1. Create the folder D:\tmp\hive
  2. Execute the following command in cmd started using the option Run as administrator

  • cmd> winutils.exe chmod -R 777 D:\tmp\hive

3. Check the permissions

5. Optional: Install Scala

If you are planning on using Scala instead of Python for programming in Spark, follow this steps:

  • Download the Scala binaries for Windows ( scala-2.12.4.msi in my case)

2. Install Scala from the .msi file

3. Set the environment variables:

2\scala
PATH += C:\Progra

2 is the shortened path for “Program Files (x86)”.

4. Check if scala is working by running the following command in the cmd

Testing Spark

PySpark (Spark with Python)

To test if Spark was succesfully installed, run the following code from pyspark’s shell (you can ignore the WARN messages):

Scala-shell

To test if Scala and Spark where succesfully installed, run the following code from spark-shell (Only if you installed Scala in your computer):

PD: The query will not work if you have more than one spark-shell instance open

PySpark with PyCharm (Python 3.x)

If you have PyCharm installed, you can also write a “Hello World” program to test PySpark

PySpark with Jupyter Notebook

You will need to use the findSpark package to make a Spark Context available in your code. This package is not specific to Jupyter Notebook, you can use it in you IDE too.

Install and launch the notebook from the cmd:

Create a new Python notebook and write the following at the beginning of the script:

Now you can add your code to the bottom of the script and run the notebook.

Big Data Engineering

Best technical posts about data (we love both small and…

Installing Apache Spark on Windows 10

Quick Dirty Self Note on Installing Spark

So I just got hold of some election data and when I tried crunching some numbers, well my computer wasn’t too happy. So finally I decided that I needed to learn Spark because someone needs to look into this election data and make cool maps — obviously me.

I’ve been browsing the web trying to find the easiest way to install Spark on my Windows machine. It looks like most guides require tons of steps and I’m not about to invest a significant amount of time trying to follow them to then fail. Here is the simplest way to do it, assuming you have Anaconda already installed.

Note: In the case you’re starting from scratch, I will advise you to follow this article and install a machine learning environment with Anaconda.

Installing the Java Development Kit

After you already have installed Anaconda, we will proceed on installing the Java Development Kit (JDK). This is necessary step because Spark runs on top of the Scala programming language and Scala runs on top of the JDK. So head over to Google and search for jdk and click on the first result.

This will take you to Java downloads. Scroll down until you see the section below and click on the Download button.

This will take you to the download page. Scroll down to the section shown below and accept the License Agreement and select the download option for your operating system.

Once you select the JDK for you operating system, you will need to sign in or create an account in order to download the file. I thought this was weird but whatever it takes like 30 seconds to make an account.

Launch the exe file you downloaded. In my case the file name is:

This window will pop open. Just click Next.

How to Install Apache Spark on Windows 10

PHP Code

  1. Home
  2. Web Servers
  3. How to Install Apache Spark on Windows 10

Introduction

Apache Spark is an open-source framework that processes large volumes of stream data from multiple sources. Spark is used in distributed computing with machine learning applications, data analytics, and graph-parallel processing.

This guide will show you how to install Apache Spark on Windows 10 and test the installation.

Prerequisites

  • A system running Windows 10
  • A user account with administrator privileges (required to install software, modify file permissions, and modify system PATH)
  • Command Prompt or Powershell
  • A tool to extract .tar files, such as 7-Zip

Install Apache Spark on Windows

Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.

Step 1: Install Java 8

Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.

Open the command line by clicking Start > type cmd > click Command Prompt.

Type the following command in the command prompt:

If Java is installed, it will respond with the following output:

Your version may be different. The second digit is the Java version – in this case, Java 8.

If you don’t have Java installed:

1. Open a browser window, and navigate to https://java.com/en/download/.

2. Click the Java Download button and save the file to a location of your choice.

3. Once the download finishes double-click the file to install Java.

Note: At the time this article was written, the latest Java version is 1.8.0_251. Installing a later version will still work. This process only needs the Java Runtime Environment (JRE) – the full Development Kit (JDK) is not required. The download link to JDK is https://www.oracle.com/java/technologies/javase-downloads.html.

Step 2: Install Python

1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.

2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of writing the article.

3. Once the download finishes, run the file.

4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other box checked.

5. Next, click Customize installation.

6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.

7. Click Next.

8. Select the box Install for all users and leave other boxes as they are.

9. Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it Python.

10. Select that folder and click OK.

11. Click Install, and let the installation complete.

12. When the installation completes, click the Disable path length limit option at the bottom and then click Close.

13. If you have a command prompt open, restart it. Verify the installation by checking the version of Python:

The output should print Python 3.8.3 .

Note: For detailed instructions on how to install Python 3 on Windows or how to troubleshoot potential issues, refer to our Install Python 3 on Windows guide.

Step 3: Download Apache Spark

2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.

    • In our case, in Choose a Spark releasedrop-down menu select 2.4.5 (Feb 05 2020).
    • In the second drop-down Choose a package type, leave the selection Pre-built for Apache Hadoop 2.7.

3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.

4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the list and save the file to your Downloads folder.

Step 4: Verify Spark Software File

1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working with unaltered, uncorrupted software.

2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.

3. Next, open a command line and enter the following command:

4. Change the username to your username. The system displays a long alphanumeric code, along with the message Certutil: -hashfile completed successfully .

5. Compare the code to the one you opened in a new browser tab. If they match, your download file is uncorrupted.

Step 5: Install Apache Spark

Installing Apache Spark involves extracting the downloaded file to the desired location.

1. Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:

2. In Explorer, locate the Spark file you downloaded.

3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.

Step 6: Add winutils.exe File

Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.

1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and click it.

2. Find the Download button on the right side to download the file.

3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.

4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.

Step 7: Configure Environment Variables

Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH. It allows you to run the Spark shell directly from a command prompt window.

1. Click Start and type environment.

2. Select the result labeled Edit the system environment variables.

3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then click New in the next window.

4. For Variable Name type SPARK_HOME.

5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder path, use that one instead.

6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid deleting any entries already on the list.

7. You should see a box with entries on the left. On the right, click New.

8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.

9. Repeat this process for Hadoop and Java.

    • For Hadoop, the variable name is HADOOP_HOME and for the value use the path of the folder you created earlier: C:\hadoop. Add C:\hadoop\bin to the Path variable field, but we recommend using %HADOOP_HOME%\bin.
    • For Java, the variable name is JAVA_HOME and for the value use the path to your Java JDK directory (in our case it’s C:\Program Files\Java\jdk1.8.0_251).

10. Click OK to close all open windows.

Note: Star by restarting the Command Prompt to apply changes. If that doesn’t work, you will need to reboot the system.

Читайте также:  Linux как подключиться графически
Оцените статью