Linux html to text

Содержание

html2text
Synopsis
add an example, a script, a trick and tips
examples
description
options
conforming to
files
restrictions
How can this site be more helpful to YOU ?
Love it ? Hate it ? Say it !!
html2text(1) — Linux man page
Synopsis
Description
Options
Files
Conforming To
Restrictions
Author
☩ Walking in Light with Christ – Faith, Computing, Diary
How to convert html pages to text in console / terminal on GNU / Linux and FreeBSD

html2text

Synopsis

html2text -help
html2text -version
html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] [ -rcfile path ] [ -style ( compact | pretty ) ] [ -width width ] [ -o output-file ] [ -nobs ] [ -ascii | -utf8 ] [ -nometa ] [ input-url . ]

add an example, a script, a trick and tips

Thanks for this example ! — It will be moderated and published shortly.

Feel free to post other examples Oops ! There is a tiny cockup. A damn 404 cockup. Please contact the loosy team who maintains and develops this wonderful site by clicking in the mighty feedback button on the side of the page. Say what happened. Thanks!

Thanks for this example ! — It will be moderated and published shortly.

It will surely help many people ! Feel free to post other examples Oops ! There is a tiny cockup. A damn 404 cockup. Please contact the loosy team who maintains and develops this wonderful site by clicking in the mighty button on the bottom right corner of this page. Say what happened. Thanks!

examples

description

html2text reads HTML documents from the input-urls, formats each of them into a stream of plain text characters, and writes the result to standard output (or into output-file, if the -o command line option is used).

If no input-urls are specified on the command line, html2text reads from standard input. A dash as the input-url is an alternate way to specify standard input.

html2text understands all HTML 3.2 constructs, but can render only part of them due to the limitations of the text output format. However, the program attempts to provide good substitutes for the elements it cannot render. html2text parses HTML 4 input, too, but not always as successful as other HTML processors. It also accepts syntactically incorrect input, and attempts to interpret it «reasonably».

The way html2text formats the HTML documents is controlled by formatting properties read from an RC file. html2text attempts to read $HOME/.html2textrc (or the file specified by the -rcfile command line option); if that file cannot be read, html2text attempts to read /etc/html2textrc. If no RC file can be read (or if the RC file does not override all formatting properties), then «reasonable» defaults are assumed. The RC file format is described in the html2textrc(5) manual page.

Debian version of html2text also can do input and output recoding (see /usr/share/doc/html2text/README.Debian for more info). html2text tries to fetch encoding from HTML document. If encoding is not specified, you can use -ascii and -utf8 options. Output is converted to user’s locale charset (LC_CTYPE).

options

By default, Debian version of html2text use ’meta http-equiv’ tag for input recoding. This option cancels this behavior.

By default, when -nometa is supplied, html2text uses UTF-8 for the output. Specifying this option, plain ASCII is used instead. To find out how non-ASCII characters are rendered, refer to the file «ascii.substitutes».

By default, when -nometa is supplied, html2text uses ISO 8859-1 for the input. Specifying this option, UTF-8 is used instead (both for input and output). This option implies -nobs.

Let html2text report on the tokens being shifted, rules being applied, etc., while scanning the HTML document. This option is for diagnostic purposes.

Let html2text report on each lexical token scanned, while scanning the HTML document. This option is for diagnostic purposes.

Print command line summary and exit.

By default, original html2text renders underlined letters with sequences like «underscore-backspace-character» and boldface letters like «character-backspace-character». Because of issues with UTF-8, Debian version of html2text doesn’t produce backspaces, so this option really does nothing.

Write the output to output-file instead of standard output. A dash as the output-file is an alternate way to specify the standard output.

Attempt to read the file specified in path as RC file.

Style pretty changes some of the default values of the formatting parameters documented in html2textrc(5). To find out which and how the formatting parameter defaults are changed, check the file «pretty.style». If this option is omitted, style compact is assumed as default.

This option is for diagnostic purposes: Instead of formatting the parsed document, generate HTML code, that is guaranteed to be syntactically correct. If html2text has problems parsing a syntactically incorrect HTML document, this option may help you to understand what html2text thinks that the original HTML code means.

Print program version and exit.

By default, html2text formats the HTML documents for a screen width of 79 characters. If redirecting the output into a file, or if your terminal has a width other than 80 characters, or if you just want to get an idea how html2text deals with large tables and different terminal widths, you may want to specify a different width.

conforming to

HTML 3.2 (HTML 3.2 Reference Specification — http://www.w3.org/TR/REC-html32),

files

System wide parser configuration file.

Personal parser configuration file, overrides the system wide values.

restrictions

Debian version of html2text have no http support. Use html2text through pipes with curl or wget instead. See README.Debian for more information.

html2text was written to convert HTML 3.2 documents. When using it with HTML 4 or even XHTML 1 documents, some constructs present only in these HTML versions might not be rendered.

html2text was written up to version 1.2.2 by Arno Unkrig for GMRS Software GmbH, Unterschleissheim.

Current maintainer and primary download location is:
Martin Bayer
http://www.mbayer.de/html2text/files.shtml

How can this site be more helpful to YOU ?

Love it ? Hate it ? Say it !!

A problem ? An idea for a new feature ? An advice ? A command is missing ?

Your opinion does matter !

Источник

html2text(1) — Linux man page

html2text — an advanced HTML-to-text converter

Synopsis

Description

Documents that are specified by a URL (RFC 1738) that begins with «http:» are retrieved with the Hypertext Transfer Protocol (RFC 1945). URLs that begin with «file:» and URLs that do not contain a colon specify local files. All other URLs are invalid.

If no input-urls are specified on the command line, html2text reads from standard input. A dash as the input-url is an alternate way to specify standard input.

Options

By default, html2text uses ISO 8859-1 for the output. Specifying this option, plain ASCII is used instead. To find out how non-ASCII characters are rendered, refer to the file «ascii.substitutes».

This option is for diagnostic purposes: The HTML document is only parsed and not processed otherwise. In this mode of operation, html2text will report on parse errors and scan errors, which it does not in other modes of operation. Note that parse and scan errors are not fatal for html2text, but may cause mis-interpretation of the HTML code and/or portions of the document being swallowed. -debug-parser Let html2text report on the tokens being shifted, rules being applied, etc., while scanning the HTML document. This option is for diagnostic purposes. -debug-scanner Let html2text report on each lexical token scanned, while scanning the HTML document. This option is for diagnostic purposes. -help

Print command line summary and exit.

By default, html2text renders underlined letters with sequences like «underscore-backspace-character» and boldface letters like «character-backspace-character», which works fine when the output is piped into more(1), less(1), or similar. For other applications, or when redirecting the output into a file, it may be desirable not to render character attributes with such backspace sequences, which can be accomplished with this command line option. -o output-file Write the output to output-file instead of standard output. A dash as the output-file is an alternate way to specify the standard output. -rcfile path Attempt to read the file specified in path as RC file. -style ( compact | pretty ) Style pretty changes some of the default values of the formatting parameters documented in html2textrc(5). To find out which and how the formatting parameter defaults are changed, check the file «pretty.style». If this option is omitted, style compact is assumed as default. -unparse This option is for diagnostic purposes: Instead of formatting the parsed document, generate HTML code, that is guaranteed to be syntactically correct. If html2text has problems parsing a syntactically incorrect HTML document, this option may help you to understand what html2text thinks that the original HTML code means. -version Print program version and exit. -width width By default, html2text formats the HTML documents for a screen width of 79 characters. If redirecting the output into a file, or if your terminal has a width other than 80 characters, or if you just want to get an idea how html2text deals with large tables and different terminal widths, you may want to specify a different width.

Files

/etc/html2textrc System wide parser configuration file. $HOME/.html2textrc Personal parser configuration file, overrides the system wide values.

Conforming To

HTML 3.2 (HTML 3.2 Reference Specification — http://www.w3.org/TR/REC-html32),
RFC 1945 (Hypertext Transfer Protocol — HTTP).

Restrictions

html2text provides only a basic implementation of the Hypertext Transfer Protocol (HTTP). It requires the complete and exactly matching URL to be given as argument and will not follow redirections (HTTP 301/ 307).

html2text was written to convert HTML 3.2 documents. When using it with HTML 4 or even XHTML 1 documents, some constructs present only in these HTML versions might not be rendered.

Читайте также: Линукс что это значит

Author

html2text was written up to version 1.2.2 by Arno Unkrig for GMRS Software GmbH, Unterschleissheim.

Источник

☩ Walking in Light with Christ – Faith, Computing, Diary

How to convert html pages to text in console / terminal on GNU / Linux and FreeBSD

Thursday, 8th December 2011

I’m realizing the more I’m converting to a fully functional GUI user, the less I’m doing coding or any interesting stuff…
I remembered of the old glorious times, when I was full time console user and got a memory on a nifty trick I was so used to back in the day.
Back then I was quite often writing shell scripts which were fetching (html) webpages and converting the html content into a plain TEXT (TXT) files

In order to fetch a page back in the days I used lynx – (a very simple UNIX text browser, which by the way lacks support for any CSS or Javascipt) in combination with html2text – (an advanced HTML-to-text converter).

Let’s say I wanted to fetch a my personal home page https://www.pc-freak.net/, I did that via the command:

$ lynx -source https://www.pc-freak.net/ | html2text > pcfreak_page.txt

The content from www.pc-freak.net got spit by lynx as an html source and passed html2pdf wchich saves it in plain text file pcfreak_page.txt
The bit more advanced elinks – (lynx-like alternative character mode WWW browser) provides better support for HTML and even some CSS and Javascript so to properly save the content of many pages in plain html file its better to use it instead of lynx, the way to produce .txt using elinks files is identical, e.g.:

$ elinks -source https://www.pc-freak.net/blog/ | html2text > pcfreak_blog_page.txt

By the way back in the days I was used more to links , than the superior elinks , nowdays I have both of the text browsers installed and testing to fetch an html like in the upper example and pipe to html2text produced garbaged output.

Here is the time to tell its not even necessery to have a text browser installed in order to fetch a webpage and convert it to a plain text TXT!. wget file downloading tools supports source dump as well, for all those who did not (yet) tried it and want to test it:

$ wget -qO- https://www.pc-freak.net | html2text Anyways of course, some pages convertion of text inside HTML tags would not properly get saved with neither lynx or elinks cause some texts might be embedded in some elinks or lynx unsupported CSS or JavaScript. In those cases the GUI browser is useful. You can use any browser like Firefox, Epiphany or Opera ‘s File -> Save As (Text Files) embedded functionality, below is a screenshot showing an html page which I’m about to save as a plain Text File in Mozilla Firefox:

Besides being handy in conjunction with text browsers, html2text is also handy for converting .html pages already existing on the computer’s hard drive to a plain (.TXT) text format.
One might wonder, why would ever one would like to do that?? Well I personally prefer reading plain text documents instead of htmls 😉
Converting an html files already existing on hard drive with html2text is done with cmd:

$ html2text index.html >index.txt

To convert a whole directory full of .html (documentation) or whatever files to plain text .TXT , cd the directory with HTMLs and issue the one liner bash loop command:

$ cd html/
html$ for i in $(echo *.html); do html2text $i > $(echo $i | sed -e ‘s#.html#.txt#g’); done

Now lay off your back and enjoy reading the dox like in the good old hacker days when .TXT files were fashionable 😉

Источник