Group lines in linux

Содержание

How to generate list of unique lines in text file using a Linux shell script?
4 Answers 4
Best way to simulate «group by» from bash?
14 Answers 14
Sort lines of a text file
How to concatenate multiple lines of output to one line?
11 Answers 11
The fastest and easiest ways I know to solve this problem:
How to join multiple lines of file names into one with custom delimiter?
22 Answers 22

How to generate list of unique lines in text file using a Linux shell script?

Suppose I have a file that contain a bunch of lines, some repeating:

What linux command(s) should I use to generate a list of unique lines:

Does this change if the file is unsorted, i.e. repeating lines may not be in blocks?

4 Answers 4

If you don’t mind the output being sorted, use

This sorts and removes duplicates

cat to output the contents, piped to sort to sort them, piped to uniq to print out the unique values:

cat test1.txt | sort | uniq

you don’t need to do the sort part if the file contents are already sorted.

Create a new sort file with unique lines :

Create a new file with uniques lines (unsorted) :

If we do not care about the order, then the best solution is actually:

If we also want to ignore the case letter, we can use it (as a result all letters will be converted to uppercase):

It would seem that even a better idea would be to use the command:

and if we also want to ignore the case letter (as a result the first row of duplicates is returned, without any change in case):

However, in this case, may be returned a completely different result, than in case when we use the sort command, because uniq command does not detect repeated lines unless they are adjacent.

Источник

Best way to simulate «group by» from bash?

Suppose you have a file that contains IP addresses, one address in each line:

You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:

One way to do this is:

However it is really far from being efficient.

How would you solve this problem more efficiently using bash?

(One thing to add: I know it can be solved from perl or awk, I’m interested in a better solution in bash, not in those languages.)

Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.

I liked the hashtable-like solution — anybody can provide improvements to that solution?

ADDITIONAL INFO #2:

Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn’t available for me. It was a custom built linux machine without most of the tools I’m used to. And I think it was an interesting problem.

So please, don’t blame the question, just ignore it if you don’t like it. 🙂

14 Answers 14

This will print the count first, but other than that it should be exactly what you want.

The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.

for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )

The canonical solution is the one mentioned by another respondent:

It is shorter and more concise than what can be written in Perl or awk.

You write that you don’t want to use sort, because the data’s size is larger than the machine’s main memory size. Don’t underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T’s billing data) on machines with 128k (that’s 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine’s main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine’s main memory.

Источник

Sort lines of a text file

The sort command is used to sort the lines of a text file in Linux. You can provide several command line options for sorting data in a text file.

Here is an example file:

To sort the file in alphabetical order, we can use the sort command without any options:

To sort in reverse, we can use the -r option:

We can also sort on the column. For example, we will create a file with the following text:

Blank space is the default field separator. This means that we can sort the text pictured above by the second column. To do that, the -k option, along with the field number, is used:

In the picture above, we have sorted the file sort1.txt in alphabetical order using the second column.

To check if a file is already sorted, use sort with the -c option. This option also reports the first unsorted line:

Источник

How to concatenate multiple lines of output to one line?

If I run the command cat file | grep pattern , I get many lines of output. How do you concatenate all lines into one line, effectively replacing each «\n» with «\» » (end with » followed by space)?

cat file | grep pattern | xargs sed s/\n/ /g isn’t working for me.

11 Answers 11

Use tr ‘\n’ ‘ ‘ to translate all newline characters to spaces:

Note: grep reads files, cat concatenates files. Don’t cat file | grep !

Edit:

tr can only handle single character translations. You could use awk to change the output record separator like:

This would transform:

Piping output to xargs will concatenate each line of output to a single line with spaces:

Or any command, eg. ls | xargs . The default limit of xargs output is

4096 characters, but can be increased with eg. xargs -s 8192 .

In bash echo without quotes remove carriage returns, tabs and multiple spaces

This could be what you want

As to your edit, I’m not sure what it means, perhaps this?

(this assumes that

does not occur in file )

This is an example which produces output separate by commas. You can replace the comma by whatever separator you need.

Here is the method using ex editor (part of Vim):

Join all lines and print to the standard output:

Join all lines in-place (in the file):

Note: This will concatenate all lines inside the file it-self!

The fastest and easiest ways I know to solve this problem:

When we want to replace the new line character \n with the space:

xargs has own limits on the number of characters per line and the number of all characters combined, but we can increase them. Details can be found by running this command: xargs —show-limits and of course in the manual: man xargs

When we want to replace one character with another exactly one character:

When we want to replace one character with many characters:

First, we replace the newline characters \n for tildes

(or choose another unique character not present in the text), and then we replace the tilde characters with any other characters ( many_characters ) and we do it for each tilde (flag g ).

Источник

How to join multiple lines of file names into one with custom delimiter?

I would like to join the result of ls -1 into one line and delimit it with whatever i want.

Are there any standard Linux commands I can use to achieve this?

22 Answers 22

Similar to the very first option but omits the trailing delimiter

EDIT: Simply «ls -m» If you want your delimiter to be a comma

Ah, the power and simplicity !

Change the comma «,» to whatever you want. Note that this includes a «trailing comma»

This replaces the last comma with a newline:

ls -m includes newlines at the screen-width character (80th for example).

Mostly Bash (only ls is external):

Using readarray (aka mapfile ) in Bash 4:

Thanks to gniourf_gniourf for the suggestions.

I think this one is awesome

ORS is the «output record separator» so now your lines will be joined with a comma.

Parsing ls in general is not advised, so alternative better way is to use find , for example:

Or by using find and paste :

For general joining multiple lines (not related to file system), check: Concise and portable “join” on the Unix command-line.

The combination of setting IFS and use of «$*» can do what you want. I’m using a subshell so I don’t interfere with this shell’s $IFS

To capture the output,

Don’t reinvent the wheel.

It does exactly that.

Adding on top of majkinetor’s answer, here is the way of removing trailing delimiter(since I cannot just comment under his answer yet):

Just remove as many trailing bytes as your delimiter counts for.

I like this approach because I can use multi character delimiters + other benefits of awk :

EDIT

As Peter has noticed, negative byte count is not supported in native MacOS version of head. This however can be easily fixed.

First, install coreutils . «The GNU Core Utilities are the basic file, shell and text manipulation utilities of the GNU operating system.»

Commands also provided by MacOS are installed with the prefix «g». For example gls .

Once you have done this you can use ghead which has negative byte count, or better, make alias:

Источник