Parse HTML - Get Clean Text from HTML
Get Clean Text from HTML:
The below python function helps in fetching the pure text from HTML. you can use htmllaundry package as well, but this function helps you in customizing the text that you want it as per your requirement
## html parser
def parse_html(content):
from bs4.element import Comment
soup = BeautifulSoup(content,'html.parser')
text_elements = soup.findAll(text=True)
tag_blacklist = ['style','script','head','title','meta','[document]','img']
clean_text = []
for element in text_elements:
if element.parent.name in tag_blacklist or isinstance(element,Comment):
continue
else:
text_ = element.strip()
clean_text.append(text_)
result_text = " ".join(clean_text)
result_text = result_text.replace(r'[\r\n]','')
tag_remove_pattern = re.compile(r'<[^>]+>')
result_text = tag_remove_pattern.sub('',result_text)
result_text = re.sub(r'\\','',result_text)
return result_text
These are all the list of "tag_blacklist = ['style','script','head','title','meta','[document]','img']"that has been ignored from the HTML content. Not only these tags but the Comment portion from the HTML content is handled with the library from bs4.element import Comment
After upon doing the requests.get on a particular url, the response.content is passed as an argument in the above function.
Keywords: Get clean text from html,html text extraction,pure text from html using,get text from html using python, how to get text from html using python
No comments