What is text processing in linux

Содержание

Basic Data Extraction and Text Processing in Linux
cut, sort, uniq, wc
Introduction
Prerequisites
The cut command [man cut]
The sort command [man sort]
Example
The uniq command [man uniq]
The wc command [man wc]
Putting it all together
What’s the best tool to do text processing in Linux or Mac? [closed]
5 Answers 5
Text processing using Linux
2 Answers 2
Text Processing

Basic Data Extraction and Text Processing in Linux

cut, sort, uniq, wc

2 hours ago · 3 min read

Introduction

By themselves, these linux tools can’t accomplish much. But when used together, the possibilities for what we can accomplish with data processing are endless. Naturally, these tools are used by experts in fields from data science to artificial intelligence. Familiarizing yourself with these tools will unlock a new level of competency for you, and give you a leg up on the competition. Other data filtering tools not mentioned here but are worth mentioning are sed , grep , and awk .

Prerequisites

Understand how piping commands works in linux
Know how to use the cat command

The cut command [man cut]

Example: cut -d » » -f 1 In this example, the cut command defines a delimiter -d » » and selects the first field -f 1. In other words, the cut operation will find the first instance of a space and then cut everything after that space, for every line. The way field parameter works is that each delimiter ( » » in this case) creates a new field and you can decide which fields to keep. > out.txt will cat the output of the previous piped commands into a file called out.txt instead of ouputting to the shell.

The sort command [man sort]

If using the sort command without any arguments, its default sort rules are as follows:

numbers > letters
lowercase > uppercase

Example

If we have a list like

T h e default sort rules would output this

Notice that the numbers don’t seem to be in order. The default sort compares each character individually. To sort numbers by integer order, you can use sort -n .

The uniq command [man uniq]

Running some text through the default uniq command will output the text file without duplicate lines. Another useful argument, uniq -c will output each line prepended with a count of how many times that line occurred. This is useful if you want to analyze the frequency of certain lines in your data.

The wc command [man wc]

Wc stands for wordcount, but the wc command can count more than just words. In many cases with text processing and analysis, we want a line count. wc -l will do just that. Check out the man link above for more things wc can do.

Putting it all together

Here’s the structure of a sample access.log file from an IIS webserver

If we wanted to get the frequency of each http status code we would use a command like cat access.log | cut -d » » -f 5 | sort -n | uniq -c Which would return

So there are 5 occurences of 200 status codes, 1 occurance of a 304, and 1 occurance of a 404.

Источник

What’s the best tool to do text processing in Linux or Mac? [closed]

Want to improve this question? Update the question so it’s on-topic for Stack Overflow.

Closed 6 years ago .

I generally need to do a fair amount of text processing for my research, such as removing the last token from all lines, extracting the first two tokens from each line, splitting each line into tokens, etc.

What is the best way to perform this? Should I learn Perl for this? Or should I learn some kind of shell commands? The main concern is speed. If I need to write long code for such stuff, it defeats the purpose.

I started learning sed on @Mimisbrunnr ‘s recommendation and already could do what I needed to. But it seems people favor awk more. So, will try that. Thanks for all your replies.

5 Answers 5

Perl and awk come to mind, although Python will do, if you’d rather not learn a new language.

Perl’s a general purpose language, awk’s more oriented to text processing of the type you’ve described.

For doing simple steam editing sed is a great utility that comes standard on most *nix boxes, but for anything much more complex than that I would suggest getting into Perl. The curve isn’t that bad and it’s great for writing most forms of regular text parsing. A great reference can be found here.

*nix tools such as awk/grep/tail/head/sed etc are good file processing tools. If you want to search for patterns in files and process them, you can use awk. For big files, you can use a combination of grep+awk. Grep for its speed in pattern searching and awk for its ability to manipulate text. with regards to sed, oftern what sed does, awk can already do them, so i find it redundant to use sed for file processing.

In terms of speed of processing files, awk is often on par, or sometimes better than Perl or other languages.

Also, 2 very good tools for getting the front and back portion of a file FAST, are tail and head . So to get last lines, you can use tail .

Best tool depends on the task to be performed, of course. Beside the usual *nix tools like sed/awk etc and programming languages (Perl, Python) cited by others, currently for the text processing I need where the original data format doesn’t follow rigid parsing rules but may vary slightly, I found myself very well with Vim macros and Vimscript functions which I call inside the Vim editor.

Something like this (for the Vim uninitiated): you write the processing function(s) eg. TxtProcessingToBeDone1() on a file script.vim, source it with :source script.vim, then open the file(s) you want to edit and:

on the whole buffer at once or as one-shot operation to be repeated on spot with @: and @@ keys. Also multiple buffers/files can be processed at the same time with :bufdo and :argdo.

With a Vimscript function you can repeat all the tasks you would do on a regular editing session (search a pattern, reg-ex, substitution, move to, delete, yank, etc, etc), automate it and also apply some programming control flow (if/then).

Similar considerations apply to other advanced scriptable editors as well.

Источник

Text processing using Linux

I need to write a Linux program, which reads portions of data from a csv file and write into a text file in the following pattern.

NAME : FROM= -100 -346 -249 -125 TO= -346 -249 -125 100 COLOR= COLOR1 COLOR2 COLOR3 COLOR4

NAME will be a fixed row,

FROM and TO information should be retreived from csv file and

COLOR information can be hard coded array of colors from program itself.

From csv data below, the first value(-100) under MIN will be the first value(-100) under FROM of text file. The last value(100) from excel MAX column will be the last value(100) under text file TO column. The values under VALUE column in excel will be rounded and used as TO and FROM per pattern shown.

2 Answers 2

awk solution (for your current input file):

I’m assuming there may be more rows (but not columns) of data, both in the first and second section of the file, and that there should be as many COLOR entries on the COLOR row as there are data values on the FROM and TO lines in the output.

Non-numeric lines are skipped by the !/^1/ block.

The data that is repeated in the output is picked up by the third block ( $3 == «» ). That block creates a data and a col string with the appropriate values. Rounding is performed using sprintf() with a format specifying a floating point number with no decimal places.

The minimum and maximum values are picked up from the later section of the input file as the minimum of the second column and the maximum of the third column.

The END block prints out the resulting report.

Источник

Text Processing

The rich set of text processing commands is comprehensive and time saving. Knowing even their existence is enough to avoid the need of writing yet another script (which takes time and effort plus debugging) – a trap which many beginners fall into. An extensive list of text processing commands and examples can be found here

As the name implies, this command is used to sort files. How about alphabetic sort and numeric sort? Possible. How about sorting a particular column? Possible. Prioritized multiple sorting order? Possible. Randomize? Unique? Just about any sorting need is catered by this powerful command

Options

-R random sort
-r reverse the sort order
-o redirect sorted result to specified filename, very useful to sort a file inplace
-n sort numerically
-V version sort, aware of numbers within text
-h sort human readable numbers like 4K, 3M, etc
-k sort via key
-u sort uniquely
-b ignore leading white-spaces of a line while sorting
-t use SEP instead of non-blank to blank transition

Examples

sort dir_list.txt display sorted file on standard output
sort -bn numbers.txt -o numbers.txt sort numbers.txt numerically (ignoring leading white-spaces) and overwrite the file with sorted output
sort -R crypto_keys.txt -o crypto_keys_random.txt sort randomly and write to new file
- shuf crypto_keys.txt -o crypto_keys_random.txt can also be used
du -sh * | sort -h sort file/directory sizes in current directory in human readable format

Further Reading

report or omit repeated lines

This command is more specific to recognizing duplicates. Usually requires a sorted input as the comparison is made on adjacent lines only

Options

-d print only duplicate lines
-c prefix count to occurrences
-u print only unique lines

Examples

sort test_list.txt | uniq outputs lines of test_list.txt in sorted order with duplicate lines removed
- uniq same command using process substitution
- sort -u test_list.txt equivalent command
uniq -d sorted_list.txt print only duplicate lines
uniq -cd sorted_list.txt print only duplicate lines and prefix the line with number of times it is repeated
uniq -u sorted_list.txt print only unique lines, repeated lines are ignored
uniq Q&A on unix stackexchange

compare two sorted files line by line

Without any options, it prints output in three columns — lines unique to file1, line unique to file2 and lines common to both files

Options

-1 suppress lines unique to file1
-2 suppress lines unique to file2
-3 suppress lines common to both files

Examples

comm -23 sorted_file1.txt sorted_file2.txt print lines unique to sorted_file1.txt
- comm -23 same command using process substitution, if sorted input files are not available
comm -13 sorted_file1.txt sorted_file2.txt print lines unique to sorted_file2.txt
comm -12 sorted_file1.txt sorted_file2.txt print lines common to both files
comm Q&A on unix stackexchange

examples

compare two files byte by byte

Useful to compare binary files. If the two files are same, no output is displayed (exit status 0)
If there is a difference, it prints the first difference — line number and byte location (exit status 1)
Option -s allows to suppress the output, useful in scripts

Useful to compare old and new versions of text files
All the differences are printed, which might not be desirable if files are too long

Options

-s convey message when two files are same
-y two column output
-i ignore case while comparing
-w ignore white-spaces
-r recursively compare files between the two directories specified
-q report if files differ, not the details of difference

Examples

diff -s test_list_mar2.txt test_list_mar3.txt compare two files
diff -s report.log bkp/mar10/ no need to specify second filename if names are same
diff -qr report/ bkp/mar10/report/ recursively compare files between report and bkp/mar10/report directories, filenames not matching are also specified in output
- see this link for detailed analysis and corner cases
diff report/ bkp/mar10/report/ | grep -w ‘^diff’ useful trick to get only names of mismatching files (provided no mismatches contain the whole word diff at start of line)

Further Reading

diff Q&A on unix stackexchange
gvimdiff edit two, three or four versions of a file with Vim and show differences
GUI diff and merge tools

translate or delete characters

Options

-d delete the specified characters
-c complement set of characters to be replaced

Examples

tr a-z A-Z convert lowercase to uppercase
tr -d ._ delete the dot and underscore characters
tr a-z n-za-m encrypted_test_list.txt Encrypt by replacing every lowercase alphabet with 13th alphabet after it
- Same command on encrypted text will decrypt it
tr Q&A on unix stackexchange

stream editor for filtering and transforming text

Options

-n suppress automatic printing of pattern space
-i edit files inplace (makes backup if SUFFIX supplied)
-r use extended regular expressions
-e add the script to the commands to be executed
-f add the contents of script-file to the commands to be executed
- for examples and details, refer to links given below

commands

We’ll be seeing examples only for three commonly used commands

d Delete the pattern space
p Print out the pattern space
s search and replace
check out ‘Often-Used Commands’ and ‘Less Frequently-Used Commands’ sections in info sed for complete list of commands

range

By default, sed acts on all of input contents. This can be refined to specific line number or a range defined by line numbers, search pattern or mix of the two

n,m range between nth line to mth line, including n and m
i

j act on ith line and i+j, i+2j, i+3j, etc

2 means 1st, 3rd, 5th, 7th, etc lines i.e odd numbered lines
5

3 means 5th, 8th, 11th, etc

n only nth line

$ only last line

/pattern/ lines matching pattern

n,/pattern/ nth line to line matching pattern

n,+x nth line and x lines after

/pattern/,m line matching pattern to mth line

/pattern/,+x line matching pattern and x lines after

/pattern1/,/pattern2/ line matching pattern1 to line matching pattern2

/pattern/I lines matching pattern, pattern is case insensitive

for more details, see section ‘Selecting lines with sed’ in info sed

see ‘Regular Expressions’ in grep command for extended regular expressions reference

also check out ‘Overview of Regular Expression Syntax’ section in info sed

Examples for selective deletion(d)

sed ‘/cat/d’ story.txt delete every line containing cat
sed ‘/cat/!d’ story.txt delete every line NOT containing cat
sed ‘$d’ story.txt delete last line of the file
sed ‘2,5d’ story.txt delete lines 2,3,4,5 of the file
sed ‘1,/test/d’ dir_list.txt delete all lines from beginning of file to first occurrence of line containing test (the matched line is also deleted)
sed ‘/test/,$d’ dir_list.txt delete all lines from line containing test to end of file

Examples for selective printing(p)

sed -n ‘5p’ story.txt print 5th line, -n option overrides default print behavior of sed
- use sed ‘5q;d’ story.txt on large files. Read more
sed -n ‘/cat/p’ story.txt print every line containing the text cat
- equivalent to sed ‘/cat/!d’ story.txt
sed -n ‘4,8!p’ story.txt print all lines except lines 4 to 8
man grep | sed -n ‘/^\s*exit status/I,/^$/p’ extract exit status information of a command from manual
- /^\s*exit status/I checks for line starting with ‘exit status’ in case insensitive way, white-space may be present at start of line
- /^$/ empty line
man ls | sed -n ‘/^\s*-F/,/^$/p’ extract information on command option from manual
- /^\s*-F/ line starting with option ‘-F’, white-space may be present at start of line

Examples for search and replace(s)

sed -i ‘s/cat/dog/g’ story.txt search and replace every occurrence of cat with dog in story.txt
sed -i.bkp ‘s/cat/dog/g’ story.txt in addition to inplace file editing, create backup file story.txt.bkp, so that if a mistake happens, original file can be restored
- sed -i.bkp ‘s/cat/dog/g’ *.txt to perform operation on all files ending with .txt in current directory
sed -i ‘5,10s/cat/dog/gI’ story.txt search and replace every occurrence of cat (case insensitive due to modifier I) with dog in story.txt only in line numbers 5 to 10
sed ‘/cat/ s/animal/mammal/g’ story.txt replace animal with mammal in all lines containing cat
- Since -i option is not used, output is displayed on standard output and story.txt is not changed
- spacing between range and command is optional, sed ‘/cat/s/animal/mammal/g’ story.txt can also be used
sed -i -e ‘s/cat/dog/g’ -e ‘s/lion/tiger/g’ story.txt search and replace every occurrence of cat with dog and lion with tiger
- any number of -e option can be used
- sed -i ‘s/cat/dog/g ; s/lion/tiger/g’ story.txt alternative syntax, spacing around ; is optional
sed -r ‘s/(.*)/abc: \1 :xyz/’ list.txt add prefix ‘abc: ‘ and suffix ‘ :xyz’ to every line of list.txt
sed -i -r «s/(.*)/$(basename $PWD)\/\1/» dir_list.txt add current directory name and forward-slash character at the start of every line
- Note the use of double quotes to perform command substitution
sed -i -r «s|.*|$HOME/\0|» dir_list.txt add home directory and forward-slash at the start of every line
- Since the value of ‘$HOME’ itself contains forward-slash characters, we cannot use / as delimiter
- Any character other than backslash or newline can be used as delimiter, for example | # ^ see this link for more info
- \0 back-reference contains entire matched string

Example input file

replace all reg with register

change start and end address

Using bash variables

split inline commented code to comment + code

range of lines matching pattern

inplace editing

Further Reading

pattern scanning and text processing language

awk derives its name from authors Alfred Aho, Peter Weinberger and Brian Kernighan.

syntax

awk ‘BEGIN
condition1
condition2
. END
‘

BEGIN used to initialize variables (could be user defined or awk variables or both), executed once — optional block

condition1 condition2 . action performed for every line of input, condition is optional, more than one block <> can be used with/without condition

END perform action once at end of program — optional block
commands can be written in a file and passed using the -f option instead of writing it all on command line
- for examples and details, refer to links given below

Example input file

Just printing something, no input

search and replace
when the
portion of condition
is not specified, by default
is executed if the condition evaluates to true
- 1 is a generally used awk idiom to print contents of $0 after performing some processing
- print statement without argument will print the content of $0

filtering content

selecting based on line numbers
NR is record number

selecting based on start and end condition
for following examples
- numbers 1 to 20 is input
- regex pattern /4/ is start condition
- regex pattern /6/ is end condition
f is idiomatically used to represent a flag variable

column manipulations
by default, one or more consecutive spaces/tabs are considered as field separators

specifying a different input/output field separator
can be string alone or regex, multiple separators can be specified using | in regex pattern

dealing with duplicates, line/field wise

inplace editing

Further Reading

The Perl 5 language interpreter

Larry Wall wrote Perl as a general purpose scripting language, borrowing features from C, shell scripting, awk, sed, grep, cut, sort etc

Reference tables given below for frequently used constructs with perl one-liners. Resource links given at end for further reading.

Option	Description
-e	execute perl code
-n	iterate over input files in a loop, lines are NOT printed by default
-p	iterate over input files in a loop, lines are printed by default
-l	chomp input line, $\ gets value of $/ if no argument given
-a	autosplit input lines on space, implicitly sets -n for Perl version 5.20.0 and above
-F	specifies the pattern to split input lines, implicitly sets -a and -n for Perl version 5.20.0 and above
-i	edit files inplace, if extension provided make a backup copy
-0777	slurp entire file as single string, not advisable for large input files

Variable	Description
$_	The default input and pattern-searching space
$.	Current line number
$/	input record separator, newline by default
$\	output record separator, empty string by default
@F	contains the fields of each line read, applicable with -a or -F option
%ENV	contains current environment variables
$ARGV	contains the name of the current file

Function	Description
length	Returns the length in characters of the value of EXPR. If EXPR is omitted, returns the length of $_
eof	Returns 1 if the next read on FILEHANDLE will return end of file

Simple Perl program

Example input file

Search and replace special characters

The \Q and q() constructs are helpful to nullify regex meta characters

Print lines based on line number or pattern

Print range of lines based on line number or pattern

Dealing with duplicates

Источник