Linux remove duplicate lines

Содержание

Linux Shell – How To Remove Duplicate Text Lines
Removing Duplicate Lines With Sort, Uniq and Shell Pipes
Remove duplicate lines with uniq
Removing duplicate lines from a text file on Linux
Sort file contents on Linux
How to remove duplicate lines on Linux with uniq command
How to remove duplicate lines in a .txt file and save result to the new file
Conclusion
How to remove duplicate lines from files preserving their order
How it works
Why not use the uniq command?
Other approaches
Using the sort command
Using cat , sort and cut
How it works
How to remove duplicate lines inside a text file?
10 Answers 10
How to delete duplicate lines in a file without sorting it in Unix?
9 Answers 9
The first solution is also from http://sed.sourceforge.net/sed1line.txt
The second solution is easy to understood (from myself):
Remove duplicate lines while keeping the order of the lines
5 Answers 5

Linux Shell – How To Remove Duplicate Text Lines

Removing Duplicate Lines With Sort, Uniq and Shell Pipes

Use the following syntax:
sort < file-name >| uniq -u
sort file.log | uniq -u

Remove duplicate lines with uniq

Here is a sample test file called garbage.txt displayed using the cat command:
cat garbage.txt
Sample outputs:

Removing duplicate lines from a text file on Linux

Type the following command to get rid of all duplicate lines:
$ sort garbage.txt | uniq -u
Sample output:

-u : check for strict ordering, remove all duplicate lines.

Sort file contents on Linux

Let us say you have a file named users.txt:
cat users.txt
Sample outputs:

Let us sort, run:
sort users.txt
Next sort by last name, run:
sort +2 users.txt
Want to sort in reverse order? Try:
sort -r users.txt
You can eliminate any duplicate entries in a file while ordering the file, run:
sort +2 -u users.txt
sort -u users.txt

Without any options, the sort compares entire lines in the file and outputs them in ASCII order. You can control output with options.

How to remove duplicate lines on Linux with uniq command

Consider the following file:
cat -n telphone.txt
Sample outputs:

The uniq command removes the 8th line from file and places the result in a file called output.txt:
uniq telphone.txt output.txt
Verify it:
cat -n output.txt

How to remove duplicate lines in a .txt file and save result to the new file

Try any one of the following syntax:
sort input_file | uniq > output_file
sort input_file | uniq -u | tee output_file

No ads and tracking
In-depth guides for developers and sysadmins at Opensourceflare✨
Join my Patreon to support independent content creators and start reading latest guides:
- How to set up Redis sentinel cluster on Ubuntu or Debian Linux
- How To Set Up SSH Keys With YubiKey as two-factor authentication (U2F/FIDO2)
- How to set up Mariadb Galera cluster on Ubuntu or Debian Linux
- A podman tutorial for beginners – part I (run Linux containers without Docker and in daemonless mode)
- How to protect Linux against rogue USB devices using USBGuard

Join Patreon ➔

Conclusion

The sort command is used to order the lines of a text file and uniq filters duplicate adjacent lines from a text file. These commands have many more useful options. I suggest you read the man pages by typing the following man command:
man sort
man uniq

🐧 Get the latest tutorials on Linux, Open Source & DevOps via

Источник

How to remove duplicate lines from files preserving their order

Suppose you have a text file and you need to remove all of its duplicate lines.

To remove the duplicate lines preserving their order in the file use:

How it works

The script keeps an associative array with indices equal to the unique lines of the file and values equal to their occurrences. For each line of the file, if the line occurrences are zero then it increases them by one and prints the line, otherwise it just increases the occurrences without printing the line.

I was not familiar with awk and I wanted to understand how is this accomplished with such a short script ( awk ward). I did my research and here is what is going on:

the awk “script” !visited[$0]++ is executed for each line of the input file
visited[] is a variable of type associative array (a.k.a. Map). We don’t have to initialize it, awk will do this for us the first time we access it.
the $0 variable holds the contents of the line currently being processed
visited[$0] accesses the value stored in the map with key equal to $0 (the line being processed), a.k.a. the occurrences (which we set below)
the ! negates the occurrences value:
- In awk, any nonzero numeric value or any nonempty string value is true
- By default, variables are initialized to the empty string, which is zero if converted to a number
- That being said:
  - if visited[$0] returns a number greater than zero, this negation is resolved to false .
  - if visited[$0] returns a number equal to zero or an empty string, this negation is resolved to true .
the ++ operation increases the variable’s value ( visited[$0] ) by one.
- If the value is empty, awk converts it to 0 (number) automatically and then it gets increased.
- Note: the operation is executed after we access the variable’s value.

Summing up, the whole expression evaluates to:

true if the occurrences are zero/empty string
false if the occurrences are greater than zero

If the pattern succeeds then the associated action is being executed. If we don’t provide an action, awk by default print s the input.

An omitted action is equivalent to

Our script consists of one awk statement with an expression, omitting the action. So this:

is equivalent to this:

For every line of the file, if the expression succeeds the line is printed to the output. Otherwise, the action is not executed, nothing is printed.

Why not use the uniq command?

The uniq commands removes only the adjacent duplicate lines. Demonstration:

Other approaches

Using the sort command

We can also use the following sort command to remove the duplicate lines but the line order is not preserved.

Using cat , sort and cut

The previous approach would produce a de-duplicated file whose lines would be sorted based on the contents. Piping a bunch of commands we can overcome this issue:

How it works

Suppose we have the following file:

cat -n test.txt prepends the order number in each line.

sort -uk2 sorts the lines based on the second column ( k2 option) and keeps only the first occurrence of the lines with the same second column value ( u option)

sort -nk1 sorts the lines based on their first column ( k1 option) treating the column as a number ( -n option)

Finally, cut -f2- prints each line starting from the second column until its end ( -f2- option: note the — suffix which instructs to include the rest of the line)

Источник

How to remove duplicate lines inside a text file?

A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).

What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won’t be neighbours) there is to be only one of the kind left.

I have written a program in Scala (consider it Java if you don’t know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?

UPDATE: the awk ‘!seen[$0]++’ filename solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn’t work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don’t feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.

10 Answers 10

An awk solution seen on #bash (Freenode):

There’s a simple (which is not to say obvious) method using standard utilities which doesn’t require a large memory except to run sort , which in most implementations has specific optimizations for huge files (a good external sort algorithm). An advantage of this method is that it only loops over all the lines inside special-purpose utilities, never inside interpreted languages.

If all lines begin with a non-whitespace character, you can dispense with some of the options:

For a large amount of duplication, a method that only requires storing a single copy of each line in memory will perform better. With some interpretation overhead, there’s a very concise awk script for that (already posted by enzotib):

Источник

How to delete duplicate lines in a file without sorting it in Unix?

Is there a way to delete duplicate lines in a file in Unix?

I can do it with sort -u and uniq commands, but I want to use sed or awk . Is that possible?

9 Answers 9

seen is an associative-array that Awk will pass every line of the file to. If a line isn’t in the array then seen[$0] will evaluate to false. The ! is the logical NOT operator and will invert the false to true. Awk will print the lines where the expression evaluates to true. The ++ increments seen so that seen[$0] == 1 after the first time a line is found and then seen[$0] == 2 , and so on.
Awk evaluates everything but 0 and «» (empty string) to true. If a duplicate line is placed in seen then !seen[$0] will evaluate to false and the line will not be written to the output.

]*\n).*\n\1/d; s/\n//; h; P’ means, roughly, «Append the whole hold space this line, then if you see a duplicated line throw the whole thing out, otherwise copy the whole mess back into the hold space and print the first part (which is the line you just read.»

] represents a range of ASCII characters from 0x20 (space) to 0x7E (tilde). These are considered the printable ASCII characters (linked page also has 0x7F/delete but that doesn’t seem right). That makes the solution broken for anyone not using ASCII or anyone using, say, tab characters.. The more portable [^\n] includes a whole lot more characters. all of ’em except one, in fact.

Perl one-liner similar to @jonas’s awk solution:

This variation removes trailing whitespace before comparing:

This variation edits the file in-place:

This variation edits the file in-place, and makes a backup file.bak

An alternative way using Vim(Vi compatible):

Delete duplicate, consecutive lines from a file:

vim -esu NONE +’g/\v^(.*)\n\1$/d’ +wq

Delete duplicate, nonconsecutive and nonempty lines from a file:

The one-liner that Andre Miller posted above works except for recent versions of sed when the input file ends with a blank line and no chars. On my Mac my CPU just spins.

Infinite loop if last line is blank and has no chars:

Doesn’t hang, but you lose the last line

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one’s intuitions
about how a command to «append the Next line» ought to behave.
Another fact favoring the change was that «» will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone «N;» to «$d;N;».

The first solution is also from http://sed.sourceforge.net/sed1line.txt

the core idea is:

$!N; : if current line is NOT the last line, use N command to read the next line into pattern space .
/^(.*)\n\1$/!P : if the contents of current pattern space is two duplicate string separated by \n , which means the next line is the same with current line, we can NOT print it according to our core idea; otherwise, which means current line is the LAST appearance of all of its duplicate consecutive lines, we can now use P command to print the chars in current pattern space util \n ( \n also printed).
D : we use D command to delete the chars in current pattern space util \n ( \n also deleted), then the content of pattern space is the next line.
and D command will force sed to jump to its FIRST command $!N , but NOT read the next line from file or standard input stream.

The second solution is easy to understood (from myself):

the core idea is:

read a new line from input stream or file and print it once.
use :loop command set a label named loop .
use N to read next line into the pattern space .
use s/^(.*)\n\1$/\1/ to delete current line if the next line is same with current line, we use s command to do the delete action.
if the s command is executed successfully, then use tloop command force sed to jump to the label named loop , which will do the same loop to the next lines util there are no duplicate consecutive lines of the line which is latest printed ; otherwise, use D command to delete the line which is the same with the latest-printed line , and force sed to jump to first command, which is the p command, the content of current pattern space is the next new line.

Источник

Remove duplicate lines while keeping the order of the lines

The «»server»» has: 8 GByte RAM + 16 GByte SWAP, x>300 GByte free space, amd64, desktop CPU. Scientific Linux 6.6. Nothing else runs on it to make LOAD. Awk aborts after a few seconds.. out.txt is

1.6 GByte. GNU Awk 3.1.7.

Question: How can I remove the duplicate lines while keeping the order of the lines? Case is important too, ex: «A» and «a» is two different line, have to keep it. But «a» and «a» is duplicate, only the first one is needed.

Answer could be in anything.. if awk is not good for this.. then perl/sed.. what could the problem be?

Update: I tried this on a RHEL machine, it doesn’t aborts, but I didn’t had time to wait for it to finish.. why doesn SL linux differ from RHEL?

Update: I’m trying on an Ubuntu 14 virtual gues.. so far it works! It’s not an ulimit problem: mawk 1.3.3

5 Answers 5

I doubt it will make a difference but, just in case, here’s how to do the same thing in Perl:

If the problem is keeping the unique lines in memory, that will have the same issue as the awk you tried. So, another approach could be:

On a GNU system, cat -n will prepend the line number to each line following some amount of spaces and followed by a character. cat pipes this input representation to sort .

sort ‘s -k2 option instructs it only to consider the characters from the second field until the end of the line when sorting, and sort splits fields by default on white-space (or cat ‘s inserted spaces and ).
When followed by -k1n , sort considers the 2nd field first, and then secondly—in the case of identical -k2 fields—it considers the 1st field but as sorted numerically. So repeated lines will be sorted together but in the order they appeared.

The results are piped to uniq —which is told to ignore the first field ( -f1 — and also as separated by whitespace)—and which results in a list of unique lines in the original file and is piped back to sort .

This time sort sorts on the first field ( cat ‘s inserted line number) numerically, getting the sort order back to what it was in the original file and pipes these results to cut .

Lastly, cut removes the line numbers that were inserted by cat . This is effected by cut printing only from the 2nd field through the end of the line (and cut ‘s default delimiter is a character).

Источник