apache kafka - Best way to stream PDF Files -


what way stream pdf files through messaging queue?

would idea in kafka?

here have in mind:

  1. pick pdf files file drop location.
  2. stream files through kafka.
  3. parse files low level info retrieval , cleanup. done in storm topology or spark. maybe custom map reduce code.
  4. finally, wan run machine learning algorithms on these documents.

note steps mentioned above possibilities. if have better implementation, please suggest.

i'd break 3 problems:

  1. ingestion
  2. parsing
  3. analytics

so can ingestion once iterate on parsing , analytics understanding of both data , problem evolve.

for ingestion, i'd push actual file accessible location, such hdfs or http server , send short message via kafka file @ given location has been added , ready parsing. once file has been parsed, store info in database can iterate again on entire set of ingested files if parsing algorithm changes.


Comments

Popular posts from this blog

javascript - Chart.js (Radar Chart) different scaleLineColor for each scaleLine -

apache - Error with PHP mail(): Multiple or malformed newlines found in additional_header -

java - Android – MapFragment overlay button shadow, just like MyLocation button -