How does this even work, an AI summarizer to create TLDR news articles

How does this even work, an AI summarizer to create TLDR news articles

I started a little project called Cloud News Reposted here on Medium. All Articles are written by a little python script, which goes through RSS feeds of my choice and uses HuggingFace/Transformers to summarize the blog post.

The resulting blog posts are looking something like this

The whole AIBot makes use of a handful of python libraries

  • transformers, State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow. (Quoted from their website)
  • newspaper3k, basically a parser for websites which can dismantle the website for us, especially useful to get the text of the website (LINK)
  • PyTorch is an open source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook’s AI Research lab. It is free and open-source software released under the Modified BSD license. (Source: Wikipedia)
  • Feedparser, simple RSS parser lib for Python
  • requests, to post to Medium.com
pip3 install transformers
# this one works for Windows with Nvidia cards 
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip3 install newspaper3k

Step 1: Get feeds

Lets start with the easy part, using feedparser to get the blog posts

All we do is getting all feeds and then taking entries from the last 7 days (86400 seconds times 7 days) and we add all results to an array (all_blogs). So the array will hold all JSON objects, but for the next step we basically need

  • blog[‘links’][0][‘href’] — which holds the URL of the original post
  • blog[‘title’] — the title of the blog post

Step 2: Get the summary

Now that we have the list of all blog posts we can walk through them with newspaper and use transformer to summarize.

Therefor we walk through all urls which are part of the blog, we could work on the RSS itself of course, but I found too many which only offer a summary of the blog and not the full body. Therefor we use newspaper to read and parse the website (blog[‘links’][0][‘href’]).

The parsed text we feed into our transformer one by one. The real magic is within 3 lines of code

inputs = tokenizer("summarize: " + article.text, return_tensors="pt", max_length=512, truncation=True)    
outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)     
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

Here is the full code:

Thats it again, leave a clap and like always be excellent to each other