r - Elastic : make a light count query (vs search query) -
i accessing bulk data in elastic through r. analytics purpose need query data relatively long duration (say month). data month approx 4.5 million rows , r goes out of memory.
sample data below (for 1 day):
dt <- as.date("2015-09-01", "%y-%m-%d") frmdt <- strftime(dt,"%y-%m-%d") todt <- as.date(dt+1) todt <- strftime(todt,"%y-%m-%d") connect(es_base="http://xx.yy.zzz.kk") start_date <- as.integer(as.posixct(frmdt))*1000 end_date <- as.integer(as.posixct(todt))*1000 query <- sprintf('{"query":{"range":{"time":{"gte":"%s","lte":"%s"}}}}',start_date,end_date) s_list <- elastic::search(index = "organised_2015_09",type = "property_search", body=query , fields = c("trackid", "time"), size=1000000)$hits$hits length(s_list) [1] 144612
this result 1 day has 144k records , 222 mb. sample list item below:
> s_list[[1]] $`_index` [1] "organised_2015_09" $`_type` [1] "property_search" $`_id` [1] "1441122918941" $`_version` [1] 1 $`_score` [1] 1 $fields $fields$time $fields$time[[1]] [1] 1441122918941 $fields$trackid $fields$trackid[[1]] [1] "fd4b4ce88101e58623ba9e6e31971d1f"
actually summary count of number of items "trackid" , "time" (summarize every day) suffice analytics purpose. hence tried transform count query aggregations. constructed below query:
query < -'{"size" : 0, "query": { "filtered": { "query": { "match_all": {} }, "filter": { "range": { "time": { "gte": 1441045800000, "lte": 1443551400000 } } } } }, "aggs": { "articles_over_time": { "date_histogram": { "field": "time", "interval": "day", "time_zone": "+05:30" }, "aggs": { "group_by_state": { "terms": { "field": "trackid", "size": 0 } } } } } }' response <- elastic::search(index="organised_recent",type="property_search",body=query, search_type="count")
however did not gain in speed or document size. think missing not sure what.
Comments
Post a Comment