apache kafka - Best way to stream PDF Files -
what way stream pdf files through messaging queue?
would idea in kafka?
here have in mind:
- pick pdf files file drop location.
- stream files through kafka.
- parse files low level info retrieval , cleanup. done in storm topology or spark. maybe custom map reduce code.
- finally, wan run machine learning algorithms on these documents.
note steps mentioned above possibilities. if have better implementation, please suggest.
i'd break 3 problems:
- ingestion
- parsing
- analytics
so can ingestion once iterate on parsing , analytics understanding of both data , problem evolve.
for ingestion, i'd push actual file accessible location, such hdfs or http server , send short message via kafka file @ given location has been added , ready parsing. once file has been parsed, store info in database can iterate again on entire set of ingested files if parsing algorithm changes.
Comments
Post a Comment