Extracting text from HTML file using Python -

- May 15, 2010

i'd extract text html file using python. want same output if copied text browser , pasted notepad.

i'd more robust using regular expressions may fail on poorly formed html. i've seen many people recommend beautiful soup, i've had few problems using it. one, picked unwanted text, such javascript source. also, did not interpret html entities. example, expect ' in html source converted apostrophe in text, if i'd pasted browser content notepad.

update html2text looks promising. handles html entities correctly , ignores javascript. however, not produce plain text; produces markdown have turned plain text. comes no examples or documentation, code looks clean.

Search This Blog

WIKI

Extracting text from HTML file using Python -

Comments

Post a Comment

Popular posts from this blog

jquery - ReferenceError: CKEDITOR is not defined -

javascript - Chart.js (Radar Chart) different scaleLineColor for each scaleLine -

apache - Error with PHP mail(): Multiple or malformed newlines found in additional_header -