Download all site linux

Содержание

How to download a website page on Linux terminal?
Check if wget already available
Check if cURL already available
Скачиваем сайты целиком — утилита wget
How To Download A Website With Wget The Right Way
Overview
Alternatives
1. Recursive traversal
Wget needed parameters
Summary
Example
Wget mirror
2. Using website’s sitemap
Filtering url from sitemap
Removing loc tags
Putting it all together
Conclusion
How to Use the wget Linux Command to Download Web Pages and Files
Download directly from the Linux command line
What to Know
Features of the wget Command
How to Download a Website Using wget
Run wget as a Background Command
Logging
Download From Multiple Sites
Retry Options
Protect Download Limits
Get Through Security
Other Download Options
How to Download Certain File Types
Cliget
Summary

How to download a website page on Linux terminal?

The Linux command line provides greta features for web crawling in addition to its inherent capabilities to handle web servers and web browsing. In this article we will check for few tools which are wither available or can be installed and used in the Linux environment for offline web browsing. This is achieved by basically downloading the webpage or many webpages.

Wget is probably the most famous one among all the downloading options. It allows downloading from http, https, as well as FTP servers. It can download the entire website and also allows proxy browsing.

Below are the steps to get it installed and start using it.

Check if wget already available

Running the above code gives us the following result:

If the exit code($?) is 1 then we runt he below command to install wget.

Now we run the wget command for a specific webpage or a website to be downloaded.

Running the above code gives us the following result. We show the result only for the web page and not the whole website. Thee downloaded file gets saved in the current directory.

cURL is a client side application. It supports downloading files from http, https,FTP,FTPS, Telnet, IMAP etc. It has additional support for different types of downloads as compared to wget.

Below are the steps to get it installed and start using it.

Check if cURL already available

Running the above code gives us the following result:

The value of 1 indicates cURL is not available in the system. So we will install it using the below command.

Running the above code gives us the following result indicating the installation of cURL.

Next we user cURL to download a webpage.

Running the above code gives us the following result. You can locate the downloaded in the current working directory.

Источник

Скачиваем сайты целиком — утилита wget

wget — это утилита, которая предназначена для загрузки файлов по сети (из интернета). Я расскажу, как использовать wget, чтобы скачивать сайты из интернета целиком и затем просматривать их в автономном режиме.

С помощью wget можно скачивать сайты, включая картинки, всего лишь указав адрес сайта и определенные параметры. wget будет автоматически переходить по ссылкам на сайте и скачивать страницу за страницей. Рассмотрим несколько примеров от простых к более сложным использования wget для скачивания сайтов.

Чтобы скачать сайт целиком с помощью wget нужно выполнить команду:

После выполнения данной команды в директорию site.com будет загружена локальная копия сайта http://site.com. Чтобы открыть главную страницу сайта нужно открыть файл index.html.

Рассмотрим используемые параметры:

-r	—	указывает на то, что нужно рекурсивно переходить по ссылкам на сайте, чтобы скачивать страницы.
-k	—	используется для того, чтобы wget преобразовал все ссылки в скаченных файлах таким образом, чтобы по ним можно было переходить на локальном компьютере (в автономном режиме).
-p	—	указывает на то, что нужно загрузить все файлы, которые требуются для отображения страниц (изображения, css и т.д.).
-l	—	определяет максимальную глубину вложенности страниц, которые wget должен скачать (по умолчанию значение равно 5, в примере мы установили 7). В большинстве случаев сайты имеют страницы с большой степенью вложенности и wget может просто «закопаться», скачивая новые страницы. Чтобы этого не произошло можно использовать параметр -l.
-E	—	добавлять к загруженным файлам расширение .html.
-nc	—	при использовании данного параметра существующие файлы не будут перезаписаны. Это удобно, когда нужно продолжить загрузку сайта, прерванную в предыдущий раз.

Мы рассмотрели лишь одно из возможных применений утилиты wget. На самом деле область применения wget значительно шире и wget обладает большим числом дополнительных параметров. За более подробной информацией обращайтесь к руководству, выполнив в командной строке: man wget.

Источник

How To Download A Website With Wget The Right Way

Overview

To download an entire website from Linux it is often recommended to use wget , however, it must be done using the right parameters or the downloaded website won’t be similar to the original one, with probably relative broken links. This tutorial explores the right combination to download a website:

converting relative links to full paths so they can be browsed offline.
preventing requesting too many web pages too fast, overloading the server and possibly being blocked from requesting more.
avoid overwriting or creating duplicates of already downloaded files.

Alternatives

The download can be made using a recursive traversal approach or visiting each URL of the sitemap.

1. Recursive traversal

For this we use the well known command wget .

GNU Wget is a free utility for non-interactive download of files from the Web

Wget needed parameters

The wget command is very popular in Linux and present in most distributions.

To download an entire website we use the following Wget download options :

—wait=2 Wait the specified number of seconds between the retrievals. . In this case 2 seconds. —limit-rate=20K Limit the download speed to amount bytes per second. —recursive Turn on recursive retrieving. The default maximum depth is 5. If the website has more levels than 5, then you can specify it with —level=depth —page-requisites download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets. —user-agent=Mozilla Identify as Mozilla to the HTTP server. —no-parent Do not ever ascend to the parent directory when retrieving recursively. —convert-links After the download is complete, convert the links in the document to make them suitable for local viewing. —adjust-extension If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]? , this option will cause the suffix .html to be appended to the local filename. —no-clobber When running Wget with -r, re-downloading a file will result in the new copy simply overwriting the old. Adding -nc will prevent this behavior, instead causing the original version to be preserved and any newer copies on the server to be ignored. -e robots=off turn off the robot exclusion —level Specify recursion maximum depth level depth. Use inf as the value for inifinite.

Summary

Summarizing, these are the needed parameters:

Example

Let’s try to download the https://example.com website (single page) to see how verbose is wget and how it behaves.

Wget mirror

Wget already comes with a handy —mirror paramater that is the same to use -r -l inf -N . That is:

recursive download
with infinite depth
turn on time-stamping.

2. Using website’s sitemap

Another approach is to avoid doing a recursive traversal of the website and download all the URLs present in website’s sitemap.xml .

Filtering url from sitemap

A sitemap file typically has the form:

We need to get all the URLs present in sitemap.xml , using grep : grep “ ” sitemap.xml .

Removing loc tags

Now to remove the superfluous tags: sed -e ’s/ ]*>//g’`

Putting it all together

After the previous two command we have a list of URLs, and that is the parameter read by wget -i :

wget -i grep » » sitemap.xml| sed -e ‘s/ ]*>//g’

And wget will start downloading them sequentially.

Conclusion

wget is a fantastic command line tool, it has everything you will ever need without having to use any other GUI tool, just be sure to browse its manual for the right parameters you want.

The above parameters combination will make you have a browseable website locally.

You should be careful to check that .html extensions works for your case, sometimes you may want that wget generates them based on the Content Type but sometimes you should avoid wget generating them as is the case when using pretty urls.

Источник

How to Use the wget Linux Command to Download Web Pages and Files

Download directly from the Linux command line

What to Know

To download a full site, use the following command with the web address of the site: wget -r [site address]
To run wget as a background command use: wget -b [site address]

Features of the wget Command

You can download entire websites using wget and convert the links to point to local sources so that you can view a website offline. The wget utility also retries a download when the connection drops and resumes from where it left off, if possible when the connection returns.

Other features of wget are as follows:

Download files using HTTP, HTTPS, and FTP.
Resume downloads.
Convert absolute links in downloaded web pages to relative URLs so that websites can be viewed offline.
Supports HTTP proxies and cookies.
Supports persistent HTTP connections.
It can run in the background even when you aren’t logged on.
Works on Linux and Windows.

How to Download a Website Using wget

The wget utility downloads web pages, files, and images from the web using the Linux command line. You can use a single wget command to download from a site or set up an input file to download multiple files across multiple sites.

According to the manual page, wget can be used even when the user has logged out of the system. To do this, use the nohup command.

For this guide, you will learn how to download this Linux blog:

Before you begin, create a folder on your machine using the mkdir command, and then move into the folder using the cd command.

mkdir everydaylinuxuser
cd everydaylinuxuser
wget www.everydaylinuxuser.com

The result is a single index.html file that contains the content pulled from Google. The images and stylesheets are held on Google.

To download the full site and all the pages, use the following command:

wget -r www.everydaylinuxuser.com

This downloads the pages recursively up to a maximum of 5 levels deep. Five levels deep might not be enough to get everything from the site. Use the -l switch to set the number of levels you wish to go to, as follows:

wget -r -l10 www.everydaylinuxuser.com

If you want infinite recursion, use the following:

wget -r -l inf www.everydaylinuxuser.com

You can also replace the inf with 0, which means the same thing.

There is one more problem. You might get all the pages locally, but the links in the pages point to the original place. It isn’t possible to click locally between the links on the pages.

To get around this problem, use the -k switch to convert the links on the pages to point to the locally downloaded equivalent, as follows:

wget -r -k www.everydaylinuxuser.com

If you want to get a complete mirror of a website, use the following switch, which takes away the necessity for using the -r, -k, and -l switches.

wget -m www.everydaylinuxuser.com

If you have a website, you can make a complete backup using this one simple command.

Run wget as a Background Command

You can get wget to run as a background command leaving you able to get on with your work in the terminal window while the files download. Use the following command:

wget -b www.everydaylinuxuser.com

You can combine switches. To run the wget command in the background while mirroring the site, use the following command:

wget -b -m www.everydaylinuxuser.com

You can simplify this further, as follows:

wget -bm www.everydaylinuxuser.com

Logging

If you run the wget command in the background, you don’t see any of the normal messages it sends to the screen. To send those messages to a log file so that you can check on progress at any time, use the tail command.

To output information from the wget command to a log file, use the following command:

wget -o /path/to/mylogfile www.everydaylinuxuser.com

The reverse is to require no logging at all and no output to the screen. To omit all output, use the following command:

wget -q www.everydaylinuxuser.com

Download From Multiple Sites

You can set up an input file to download from many different sites. Open a file using your favorite editor or the cat command and list the sites or links to download from on each line of the file. Save the file, and then run the following wget command:

wget -i /path/to/inputfile

Apart from backing up your website or finding something to download to read offline, it is unlikely that you will want to download an entire website. You are more likely to download a single URL with images or download files such as zip files, ISO files, or image files.

With that in mind, you don’t have to type the following into the input file as it is time consuming:

http://www.myfileserver.com/file1.zip
http://www.myfileserver.com/file2.zip
http://www.myfileserver.com/file3.zip

If you know the base URL is the same, specify the following in the input file:

You can then provide the base URL as part of the wget command, as follows:

wget -B http://www.myfileserver.com -i /path/to/inputfile

Retry Options

If you set up a queue of files to download in an input file and you leave your computer running to download the files, the input file may become stuck while you’re away and retry to download the content. You can specify the number of retries using the following switch:

wget -t 10 -i /path/to/inputfile

Use the above command in conjunction with the -T switch to specify a timeout in seconds, as follows:

wget -t 10 -T 10 -i /path/to/inputfile

The above command will retry 10 times and connect for 10 seconds for each file link.

It is also inconvenient when you download 75% of a 4-gigabyte file on a slow broadband connection, only for the connection to drop. To use wget to retry from where it stopped downloading, use the following command:

wget -c www.myfileserver.com/file1.zip

If you hammer a server, the host might not like it and might block or kill your requests. You can specify a waiting period to specify how long to wait between each retrieval, as follows:

wget -w 60 -i /path/to/inputfile

The above command waits 60 seconds between each download. This is useful if you download many files from a single source.

Some web hosts might spot the frequency and block you. You can make the waiting period random to make it look like you aren’t using a program, as follows:

wget —random-wait -i /path/to/inputfile

Protect Download Limits

Many internet service providers apply download limits for broadband usage, especially for those who live outside of a city. You may want to add a quota so that you don’t go over your download limit. You can do that in the following way:

wget -q 100m -i /path/to/inputfile

The -q command won’t work with a single file. If you download a file that is 2 gigabytes in size, using -q 1000m doesn’t stop the file from downloading.

The quota is only applied when recursively downloading from a site or when using an input file.

Get Through Security

Some sites require you to log in to access the content you wish to download. Use the following switches to specify the username and password.

wget —user=yourusername —password=yourpassword

On a multi-user system, when someone runs the ps command, they can see your username and password.

Other Download Options

By default, the -r switch recursively downloads the content and creates directories as it goes. To get all the files to download to a single folder, use the following switch:

The opposite of this is to force the creation of directories, which can be achieved using the following command:

How to Download Certain File Types

If you want to download recursively from a site, but you only want to download a specific file type such as an MP3 or an image such as a PNG, use the following syntax:

The reverse of this is to ignore certain files. Perhaps you don’t want to download executables. In this case, use the following syntax:

Cliget

There is a Firefox add-on called cliget. To add this to Firefox:

Visit https://addons.mozilla.org/en-US/firefox/addon/cliget/ and click the add to Firefox button.

Click the install button when it appears, and then restart Firefox.

To use cliget, visit a page or file you wish to download and right-click. A context menu appears called cliget, and there are options to copy to wget and copy to curl.

Click the copy to wget option, open a terminal window, then right-click and choose paste. The appropriate wget command is pasted into the window.

This saves you from having to type the command yourself.

Summary

The wget command has several options and switches. To read the manual page for wget, type the following in a terminal window:

Источник