- Basic Data Extraction and Text Processing in Linux
- cut, sort, uniq, wc
- Introduction
- Prerequisites
- The cut command [man cut]
- The sort command [man sort]
- Example
- The uniq command [man uniq]
- The wc command [man wc]
- Putting it all together
- What’s the best tool to do text processing in Linux or Mac? [closed]
- 5 Answers 5
- Text processing using Linux
- 2 Answers 2
- Text Processing
Basic Data Extraction and Text Processing in Linux
cut, sort, uniq, wc
2 hours ago · 3 min read
Introduction
By themselves, these linux tools can’t accomplish much. But when used together, the possibilities for what we can accomplish with data processing are endless. Naturally, these tools are used by experts in fields from data science to artificial intelligence. Familiarizing yourself with these tools will unlock a new level of competency for you, and give you a leg up on the competition. Other data filtering tools not mentioned here but are worth mentioning are sed , grep , and awk .
Prerequisites
- Understand how piping commands works in linux
- Know how to use the cat command
The cut command [man cut]
- Example: cut -d » » -f 1 In this example, the cut command defines a delimiter -d » » and selects the first field -f 1. In other words, the cut operation will find the first instance of a space and then cut everything after that space, for every line. The way field parameter works is that each delimiter ( » » in this case) creates a new field and you can decide which fields to keep. > out.txt will cat the output of the previous piped commands into a file called out.txt instead of ouputting to the shell.
The sort command [man sort]
If using the sort command without any arguments, its default sort rules are as follows:
- numbers > letters
- lowercase > uppercase
Example
If we have a list like
T h e default sort rules would output this
Notice that the numbers don’t seem to be in order. The default sort compares each character individually. To sort numbers by integer order, you can use sort -n .
The uniq command [man uniq]
Running some text through the default uniq command will output the text file without duplicate lines. Another useful argument, uniq -c will output each line prepended with a count of how many times that line occurred. This is useful if you want to analyze the frequency of certain lines in your data.
The wc command [man wc]
Wc stands for wordcount, but the wc command can count more than just words. In many cases with text processing and analysis, we want a line count. wc -l will do just that. Check out the man link above for more things wc can do.
Putting it all together
Here’s the structure of a sample access.log file from an IIS webserver
If we wanted to get the frequency of each http status code we would use a command like cat access.log | cut -d » » -f 5 | sort -n | uniq -c Which would return
So there are 5 occurences of 200 status codes, 1 occurance of a 304, and 1 occurance of a 404.
Источник
What’s the best tool to do text processing in Linux or Mac? [closed]
Want to improve this question? Update the question so it’s on-topic for Stack Overflow.
Closed 6 years ago .
I generally need to do a fair amount of text processing for my research, such as removing the last token from all lines, extracting the first two tokens from each line, splitting each line into tokens, etc.
What is the best way to perform this? Should I learn Perl for this? Or should I learn some kind of shell commands? The main concern is speed. If I need to write long code for such stuff, it defeats the purpose.
I started learning sed on @Mimisbrunnr ‘s recommendation and already could do what I needed to. But it seems people favor awk more. So, will try that. Thanks for all your replies.
5 Answers 5
Perl and awk come to mind, although Python will do, if you’d rather not learn a new language.
Perl’s a general purpose language, awk’s more oriented to text processing of the type you’ve described.
For doing simple steam editing sed is a great utility that comes standard on most *nix boxes, but for anything much more complex than that I would suggest getting into Perl. The curve isn’t that bad and it’s great for writing most forms of regular text parsing. A great reference can be found here.
*nix tools such as awk/grep/tail/head/sed etc are good file processing tools. If you want to search for patterns in files and process them, you can use awk. For big files, you can use a combination of grep+awk. Grep for its speed in pattern searching and awk for its ability to manipulate text. with regards to sed, oftern what sed does, awk can already do them, so i find it redundant to use sed for file processing.
In terms of speed of processing files, awk is often on par, or sometimes better than Perl or other languages.
Also, 2 very good tools for getting the front and back portion of a file FAST, are tail and head . So to get last lines, you can use tail .
Best tool depends on the task to be performed, of course. Beside the usual *nix tools like sed/awk etc and programming languages (Perl, Python) cited by others, currently for the text processing I need where the original data format doesn’t follow rigid parsing rules but may vary slightly, I found myself very well with Vim macros and Vimscript functions which I call inside the Vim editor.
Something like this (for the Vim uninitiated): you write the processing function(s) eg. TxtProcessingToBeDone1() on a file script.vim, source it with :source script.vim, then open the file(s) you want to edit and:
on the whole buffer at once or as one-shot operation to be repeated on spot with @: and @@ keys. Also multiple buffers/files can be processed at the same time with :bufdo and :argdo.
With a Vimscript function you can repeat all the tasks you would do on a regular editing session (search a pattern, reg-ex, substitution, move to, delete, yank, etc, etc), automate it and also apply some programming control flow (if/then).
Similar considerations apply to other advanced scriptable editors as well.
Источник
Text processing using Linux
I need to write a Linux program, which reads portions of data from a csv file and write into a text file in the following pattern.
NAME : FROM= -100 -346 -249 -125 TO= -346 -249 -125 100 COLOR= COLOR1 COLOR2 COLOR3 COLOR4
NAME will be a fixed row,
FROM and TO information should be retreived from csv file and
COLOR information can be hard coded array of colors from program itself.
From csv data below, the first value(-100) under MIN will be the first value(-100) under FROM of text file. The last value(100) from excel MAX column will be the last value(100) under text file TO column. The values under VALUE column in excel will be rounded and used as TO and FROM per pattern shown.
2 Answers 2
awk solution (for your current input file):
I’m assuming there may be more rows (but not columns) of data, both in the first and second section of the file, and that there should be as many COLOR entries on the COLOR row as there are data values on the FROM and TO lines in the output.
Non-numeric lines are skipped by the !/^1/ block.
The data that is repeated in the output is picked up by the third block ( $3 == «» ). That block creates a data and a col string with the appropriate values. Rounding is performed using sprintf() with a format specifying a floating point number with no decimal places.
The minimum and maximum values are picked up from the later section of the input file as the minimum of the second column and the maximum of the third column.
The END block prints out the resulting report.
Источник
Text Processing
The rich set of text processing commands is comprehensive and time saving. Knowing even their existence is enough to avoid the need of writing yet another script (which takes time and effort plus debugging) – a trap which many beginners fall into. An extensive list of text processing commands and examples can be found here
As the name implies, this command is used to sort files. How about alphabetic sort and numeric sort? Possible. How about sorting a particular column? Possible. Prioritized multiple sorting order? Possible. Randomize? Unique? Just about any sorting need is catered by this powerful command
Options
- -R random sort
- -r reverse the sort order
- -o redirect sorted result to specified filename, very useful to sort a file inplace
- -n sort numerically
- -V version sort, aware of numbers within text
- -h sort human readable numbers like 4K, 3M, etc
- -k sort via key
- -u sort uniquely
- -b ignore leading white-spaces of a line while sorting
- -t use SEP instead of non-blank to blank transition
Examples
- sort dir_list.txt display sorted file on standard output
- sort -bn numbers.txt -o numbers.txt sort numbers.txt numerically (ignoring leading white-spaces) and overwrite the file with sorted output
- sort -R crypto_keys.txt -o crypto_keys_random.txt sort randomly and write to new file
- shuf crypto_keys.txt -o crypto_keys_random.txt can also be used
- du -sh * | sort -h sort file/directory sizes in current directory in human readable format
Further Reading
report or omit repeated lines
This command is more specific to recognizing duplicates. Usually requires a sorted input as the comparison is made on adjacent lines only
Options
- -d print only duplicate lines
- -c prefix count to occurrences
- -u print only unique lines
Examples
- sort test_list.txt | uniq outputs lines of test_list.txt in sorted order with duplicate lines removed
- uniq same command using process substitution
- sort -u test_list.txt equivalent command
- uniq -d sorted_list.txt print only duplicate lines
- uniq -cd sorted_list.txt print only duplicate lines and prefix the line with number of times it is repeated
- uniq -u sorted_list.txt print only unique lines, repeated lines are ignored
- uniq Q&A on unix stackexchange
compare two sorted files line by line
Without any options, it prints output in three columns — lines unique to file1, line unique to file2 and lines common to both files
Options
- -1 suppress lines unique to file1
- -2 suppress lines unique to file2
- -3 suppress lines common to both files
Examples
- comm -23 sorted_file1.txt sorted_file2.txt print lines unique to sorted_file1.txt
- comm -23 same command using process substitution, if sorted input files are not available
- comm -13 sorted_file1.txt sorted_file2.txt print lines unique to sorted_file2.txt
- comm -12 sorted_file1.txt sorted_file2.txt print lines common to both files
- comm Q&A on unix stackexchange
- examples
compare two files byte by byte
Useful to compare binary files. If the two files are same, no output is displayed (exit status 0)
If there is a difference, it prints the first difference — line number and byte location (exit status 1)
Option -s allows to suppress the output, useful in scripts
Useful to compare old and new versions of text files
All the differences are printed, which might not be desirable if files are too long
Options
- -s convey message when two files are same
- -y two column output
- -i ignore case while comparing
- -w ignore white-spaces
- -r recursively compare files between the two directories specified
- -q report if files differ, not the details of difference
Examples
- diff -s test_list_mar2.txt test_list_mar3.txt compare two files
- diff -s report.log bkp/mar10/ no need to specify second filename if names are same
- diff -qr report/ bkp/mar10/report/ recursively compare files between report and bkp/mar10/report directories, filenames not matching are also specified in output
- see this link for detailed analysis and corner cases
- diff report/ bkp/mar10/report/ | grep -w ‘^diff’ useful trick to get only names of mismatching files (provided no mismatches contain the whole word diff at start of line)
Further Reading
- diff Q&A on unix stackexchange
- gvimdiff edit two, three or four versions of a file with Vim and show differences
- GUI diff and merge tools
translate or delete characters
Options
- -d delete the specified characters
- -c complement set of characters to be replaced
Examples
- tr a-z A-Z convert lowercase to uppercase
- tr -d ._ delete the dot and underscore characters
- tr a-z n-za-m encrypted_test_list.txt Encrypt by replacing every lowercase alphabet with 13th alphabet after it
- Same command on encrypted text will decrypt it
- tr Q&A on unix stackexchange
stream editor for filtering and transforming text
Options
- -n suppress automatic printing of pattern space
- -i edit files inplace (makes backup if SUFFIX supplied)
- -r use extended regular expressions
- -e add the script to the commands to be executed
- -f add the contents of script-file to the commands to be executed
- for examples and details, refer to links given below
commands
We’ll be seeing examples only for three commonly used commands
- d Delete the pattern space
- p Print out the pattern space
- s search and replace
- check out ‘Often-Used Commands’ and ‘Less Frequently-Used Commands’ sections in info sed for complete list of commands
range
By default, sed acts on all of input contents. This can be refined to specific line number or a range defined by line numbers, search pattern or mix of the two
- n,m range between nth line to mth line, including n and m
- i
j act on ith line and i+j, i+2j, i+3j, etc
- 1
2 means 1st, 3rd, 5th, 7th, etc lines i.e odd numbered lines
5
3 means 5th, 8th, 11th, etc
Examples for selective deletion(d)
- sed ‘/cat/d’ story.txt delete every line containing cat
- sed ‘/cat/!d’ story.txt delete every line NOT containing cat
- sed ‘$d’ story.txt delete last line of the file
- sed ‘2,5d’ story.txt delete lines 2,3,4,5 of the file
- sed ‘1,/test/d’ dir_list.txt delete all lines from beginning of file to first occurrence of line containing test (the matched line is also deleted)
- sed ‘/test/,$d’ dir_list.txt delete all lines from line containing test to end of file
Examples for selective printing(p)
- sed -n ‘5p’ story.txt print 5th line, -n option overrides default print behavior of sed
- use sed ‘5q;d’ story.txt on large files. Read more
- sed -n ‘/cat/p’ story.txt print every line containing the text cat
- equivalent to sed ‘/cat/!d’ story.txt
- sed -n ‘4,8!p’ story.txt print all lines except lines 4 to 8
- man grep | sed -n ‘/^\s*exit status/I,/^$/p’ extract exit status information of a command from manual
- /^\s*exit status/I checks for line starting with ‘exit status’ in case insensitive way, white-space may be present at start of line
- /^$/ empty line
- man ls | sed -n ‘/^\s*-F/,/^$/p’ extract information on command option from manual
- /^\s*-F/ line starting with option ‘-F’, white-space may be present at start of line
Examples for search and replace(s)
- sed -i ‘s/cat/dog/g’ story.txt search and replace every occurrence of cat with dog in story.txt
- sed -i.bkp ‘s/cat/dog/g’ story.txt in addition to inplace file editing, create backup file story.txt.bkp, so that if a mistake happens, original file can be restored
- sed -i.bkp ‘s/cat/dog/g’ *.txt to perform operation on all files ending with .txt in current directory
- sed -i ‘5,10s/cat/dog/gI’ story.txt search and replace every occurrence of cat (case insensitive due to modifier I) with dog in story.txt only in line numbers 5 to 10
- sed ‘/cat/ s/animal/mammal/g’ story.txt replace animal with mammal in all lines containing cat
- Since -i option is not used, output is displayed on standard output and story.txt is not changed
- spacing between range and command is optional, sed ‘/cat/s/animal/mammal/g’ story.txt can also be used
- sed -i -e ‘s/cat/dog/g’ -e ‘s/lion/tiger/g’ story.txt search and replace every occurrence of cat with dog and lion with tiger
- any number of -e option can be used
- sed -i ‘s/cat/dog/g ; s/lion/tiger/g’ story.txt alternative syntax, spacing around ; is optional
- sed -r ‘s/(.*)/abc: \1 :xyz/’ list.txt add prefix ‘abc: ‘ and suffix ‘ :xyz’ to every line of list.txt
- sed -i -r «s/(.*)/$(basename $PWD)\/\1/» dir_list.txt add current directory name and forward-slash character at the start of every line
- Note the use of double quotes to perform command substitution
- sed -i -r «s|.*|$HOME/\0|» dir_list.txt add home directory and forward-slash at the start of every line
- Since the value of ‘$HOME’ itself contains forward-slash characters, we cannot use / as delimiter
- Any character other than backslash or newline can be used as delimiter, for example | # ^ see this link for more info
- \0 back-reference contains entire matched string
Example input file
- replace all reg with register
- change start and end address
- Using bash variables
- split inline commented code to comment + code
- range of lines matching pattern
- inplace editing
Further Reading
pattern scanning and text processing language
awk derives its name from authors Alfred Aho, Peter Weinberger and Brian Kernighan.
syntax
- awk ‘BEGIN
condition1 condition2 . END ‘ - BEGIN
used to initialize variables (could be user defined or awk variables or both), executed once — optional block - condition1
condition2 . action performed for every line of input, condition is optional, more than one block <> can be used with/without condition - END
perform action once at end of program — optional block
- BEGIN
- commands can be written in a file and passed using the -f option instead of writing it all on command line
- for examples and details, refer to links given below
Example input file
- Just printing something, no input
- search and replace
- when the
portion of condition is not specified, by default is executed if the condition evaluates to true - 1 is a generally used awk idiom to print contents of $0 after performing some processing
- print statement without argument will print the content of $0
- filtering content
- selecting based on line numbers
- NR is record number
- selecting based on start and end condition
- for following examples
- numbers 1 to 20 is input
- regex pattern /4/ is start condition
- regex pattern /6/ is end condition
- f is idiomatically used to represent a flag variable
- column manipulations
- by default, one or more consecutive spaces/tabs are considered as field separators
- specifying a different input/output field separator
- can be string alone or regex, multiple separators can be specified using | in regex pattern
- dealing with duplicates, line/field wise
- inplace editing
Further Reading
The Perl 5 language interpreter
Larry Wall wrote Perl as a general purpose scripting language, borrowing features from C, shell scripting, awk, sed, grep, cut, sort etc
Reference tables given below for frequently used constructs with perl one-liners. Resource links given at end for further reading.
Option | Description |
---|---|
-e | execute perl code |
-n | iterate over input files in a loop, lines are NOT printed by default |
-p | iterate over input files in a loop, lines are printed by default |
-l | chomp input line, $\ gets value of $/ if no argument given |
-a | autosplit input lines on space, implicitly sets -n for Perl version 5.20.0 and above |
-F | specifies the pattern to split input lines, implicitly sets -a and -n for Perl version 5.20.0 and above |
-i | edit files inplace, if extension provided make a backup copy |
-0777 | slurp entire file as single string, not advisable for large input files |
Variable | Description |
---|---|
$_ | The default input and pattern-searching space |
$. | Current line number |
$/ | input record separator, newline by default |
$\ | output record separator, empty string by default |
@F | contains the fields of each line read, applicable with -a or -F option |
%ENV | contains current environment variables |
$ARGV | contains the name of the current file |
Function | Description |
---|---|
length | Returns the length in characters of the value of EXPR. If EXPR is omitted, returns the length of $_ |
eof | Returns 1 if the next read on FILEHANDLE will return end of file |
Simple Perl program
Example input file
- Search and replace special characters
The \Q and q() constructs are helpful to nullify regex meta characters
- Print lines based on line number or pattern
- Print range of lines based on line number or pattern
- Dealing with duplicates
Источник