linux - How to find rows in a file with same values for specific columns using unix commands? -
i have file large number of rows. each row contains 5 columns delimited tabs. want find rows have same value first 4 columns have different values 5th column.
name age address phone city eric 5 add1 1234 city1 jerry 5 add1 1234 city2 eric 5 add1 1234 city3 eric 5 add1 1234 city4 jax 5 add1 1234 city5 jax 5 add1 1234 city6 niko 5 add1 1234 city7
the result table should be
eric 5 add1 1234 city1 eric 5 add1 1234 city3 eric 5 add1 1234 city4 jax 5 add1 1234 city5 jax 5 add1 1234 city6
i tried using uniq -u -f4
after sort
ignores first 4 fields in case return rows.
i inclined use awk
this.
script.awk
{ x = count[$1,$2,$3,$4]++; line[$1,$2,$3,$4,x] = $0 } end { (key in count) { kc = count[key] if (kc > 1) { (i = 0; < kc; i++) { print line[key,i] } } } }
for each line, increment count of number of rows first 4 field values key. save current line in correct sequence. @ end, each key count more one, print each of saved lines key.
sample run
$ awk -f script.awk data jax 5 add1 1234 city5 jax 5 add1 1234 city6 eric 5 add1 1234 city1 eric 5 add1 1234 city3 eric 5 add1 1234 city4 $
note generates keys in different order appear in file (the first eric, 5, add1, 1234
entry occurs before first jax, 5, add1, 1234
entry).
it possible resolve if necessary so.
script2.awk
{ x = count[$1,$2,$3,$4]++; line[$1,$2,$3,$4,x] = $0 if (x == 0) seq[n++] = $1 subsep $2 subsep $3 subsep $4 } end { (s = 0; s < n; s++) { key = seq[s] kc = count[key] if (kc > 1) { (i = 0; < kc; i++) { print line[key,i] } } } }
subsep
character used separate components of multi-item key, assignment seq[n++]
records value used index in count[$1,$2,$3,$4]
. seq
array records each key (the first 4 columns) in order in appear. stepping through array in sequence gives keys in order in first entry appears.
sample run
$ awk -f script2.awk data eric 5 add1 1234 city1 eric 5 add1 1234 city3 eric 5 add1 1234 city4 jax 5 add1 1234 city5 jax 5 add1 1234 city6 $
preprocessing data save memory , speed processing
the code above keeps lot of data in memory. has complete copy of each line in data files; has key first 4 fields; has key 4 fields plus integer. practical purposes, that's 3 copies of data. if date files large, problem. however, given sample data has jerry
's row appearing in middle of eric
's rows, not possible better — unless data sorted first. know related rows in file, , can process more simply.
script3.awk
{ new_key = $1 subsep $2 subsep $3 subsep $4 if (new_key == old_key) { if (old_line != "") { print old_line; old_line = "" } print $0 } else { old_line = $0 old_key = new_key } }
sample run
$ sort data | awk -f script3.awk eric 5 add1 1234 city1 eric 5 add1 1234 city3 eric 5 add1 1234 city4 jax 5 add1 1234 city5 jax 5 add1 1234 city6 $
of course, coincidence eric
precedes jax
in alphabetic sequence; sorting, lose original data sequence. script3.awk
script keeps @ 2 keys , 1 line in memory, isn't going stress in terms of memory. adding sort time still gives measurable savings on original processing mechanism.
if original order critical, have more work. think involves numbering each line in original file, sorting using line number fifth key after first 4 keys group same keys together, , identify each group of rows same 4 key values same row number, sort again on group number , sequence number within group, , feeding modified version of processing in script3.awk
script. still might better original if files in gigabyte range. however, way sure measurements on realistically sized examples.
for example:
nl data | sort -k2,2 -k3,3 -k4,4 -k5,5 -k1,1n | awk '{ new_key = $2 subsep $3 subsep $4 subsep $5 if (old_key != new_key) { grp_seq = $1 } print grp_seq, $0 old_key = new_key }' | sort -k1,1n -k2,2n
this generates:
1 1 name age address phone city 2 2 eric 5 add1 1234 city1 2 4 eric 5 add1 1234 city3 2 5 eric 5 add1 1234 city4 3 3 jerry 5 add1 1234 city2 6 6 jax 5 add1 1234 city5 6 7 jax 5 add1 1234 city6 8 8 niko 5 add1 1234 city7
you can apply modified version of script3.awk
ignores $1
, $2
generate desired output. or run output shown through program stripped 2 leading columns off.
Comments
Post a Comment