linux - How to find rows in a file with same values for specific columns using unix commands? -

- August 15, 2015

i have file large number of rows. each row contains 5 columns delimited tabs. want find rows have same value first 4 columns have different values 5th column.

name     age    address    phone    city eric      5      add1      1234     city1 jerry     5      add1      1234     city2 eric      5      add1      1234     city3 eric      5      add1      1234     city4 jax       5      add1      1234     city5 jax       5      add1      1234     city6 niko      5      add1      1234     city7

the result table should be

 eric      5      add1      1234     city1  eric      5      add1      1234     city3  eric      5      add1      1234     city4  jax       5      add1      1234     city5  jax       5      add1      1234     city6

i tried using uniq -u -f4 after sort ignores first 4 fields in case return rows.

i inclined use awk this.

`script.awk`

{ x = count[$1,$2,$3,$4]++; line[$1,$2,$3,$4,x] = $0 } end {   (key in count)         {             kc = count[key]             if (kc > 1)             {                 (i = 0; < kc; i++)                 {                     print line[key,i]                 }             }         }     }

for each line, increment count of number of rows first 4 field values key. save current line in correct sequence. @ end, each key count more one, print each of saved lines key.

sample run

$ awk -f script.awk data jax       5      add1      1234     city5 jax       5      add1      1234     city6 eric      5      add1      1234     city1 eric      5      add1      1234     city3 eric      5      add1      1234     city4 $

note generates keys in different order appear in file (the first eric, 5, add1, 1234 entry occurs before first jax, 5, add1, 1234 entry).

it possible resolve if necessary so.

`script2.awk`

{   x = count[$1,$2,$3,$4]++;     line[$1,$2,$3,$4,x] = $0     if (x == 0)         seq[n++] = $1 subsep $2 subsep $3 subsep $4 } end {   (s = 0; s < n; s++)         {             key = seq[s]             kc = count[key]             if (kc > 1)             {                 (i = 0; < kc; i++)                 {                     print line[key,i]                 }             }         }     }

subsep character used separate components of multi-item key, assignment seq[n++] records value used index in count[$1,$2,$3,$4]. seq array records each key (the first 4 columns) in order in appear. stepping through array in sequence gives keys in order in first entry appears.

sample run

$ awk -f script2.awk data eric      5      add1      1234     city1 eric      5      add1      1234     city3 eric      5      add1      1234     city4 jax       5      add1      1234     city5 jax       5      add1      1234     city6 $

preprocessing data save memory , speed processing

the code above keeps lot of data in memory. has complete copy of each line in data files; has key first 4 fields; has key 4 fields plus integer. practical purposes, that's 3 copies of data. if date files large, problem. however, given sample data has jerry's row appearing in middle of eric's rows, not possible better — unless data sorted first. know related rows in file, , can process more simply.

`script3.awk`

{        new_key = $1 subsep $2 subsep $3 subsep $4     if (new_key == old_key)     {         if (old_line != "") { print old_line; old_line = "" }         print $0     }     else     {         old_line = $0         old_key = new_key     } }

sample run

$ sort data | awk -f script3.awk eric      5      add1      1234     city1 eric      5      add1      1234     city3 eric      5      add1      1234     city4 jax       5      add1      1234     city5 jax       5      add1      1234     city6 $

of course, coincidence eric precedes jax in alphabetic sequence; sorting, lose original data sequence. script3.awk script keeps @ 2 keys , 1 line in memory, isn't going stress in terms of memory. adding sort time still gives measurable savings on original processing mechanism.

if original order critical, have more work. think involves numbering each line in original file, sorting using line number fifth key after first 4 keys group same keys together, , identify each group of rows same 4 key values same row number, sort again on group number , sequence number within group, , feeding modified version of processing in script3.awk script. still might better original if files in gigabyte range. however, way sure measurements on realistically sized examples.

for example:

nl data | sort -k2,2 -k3,3 -k4,4 -k5,5 -k1,1n | awk '{ new_key = $2 subsep $3 subsep $4 subsep $5        if (old_key != new_key) { grp_seq = $1 }        print grp_seq, $0        old_key = new_key      }' | sort -k1,1n -k2,2n

this generates:

1      1    name     age    address    phone    city 2      2    eric      5      add1      1234     city1 2      4    eric      5      add1      1234     city3 2      5    eric      5      add1      1234     city4 3      3    jerry     5      add1      1234     city2 6      6    jax       5      add1      1234     city5 6      7    jax       5      add1      1234     city6 8      8    niko      5      add1      1234     city7

you can apply modified version of script3.awk ignores $1 , $2 generate desired output. or run output shown through program stripped 2 leading columns off.

Search This Blog

WIKI