java - Tokenizing and Indexing many files -
i have read several files , and index each word in files. while indexing have follow format:
requirement ==> word , {d1,tf1,d2,tf2,d4,tf4} , someothervalue
explanation :
1)word = word in files 2)d1,d2,d4... fileid 3) tf1,tf2,tf4....are number of times word appears in d1,d2,d4 respectievly
i created class "token" contains words different files 'string token' , name of file belongs 'string fileid' , frequency in file 'int count'.
i can check various words in 1 file , update count. used arraylist so. when same word appears in file how can append fileid , count while indexing.
i create a
class refcount { string fileid; int count; refcount( fileid ){ this.fileid = fileid; count = 1; } void increment(){ count++; } // more... }
and class token should be
class token { string word; list<refcount> references; ... public void countword( string fileid ){ int last = references.size() - 1; if( last >= 0 ){ refcount rc = references.get(last); if( equals(fileid) ){ rc.increment(); return; } } references.add( fileid ); } // more... }
this assumes adding references file file last file id needs checked determine whether still in same file.
you should use map<string,token>
rather list.
edit display results can iterate map or list of tokens, list of refcount objects:
for( token token: tokenlist ){ system.out.print( token.getword() + ":" ); for( refcount refcount: token.getreferences() ){ system.out.print( " " + refcount.getfileid() + "*" + refcount.getcount() ); } system.out.println(); }
you may want terminate line after every n-th id/count pair.
Comments
Post a Comment