How does this even work, an AI summarizer to create TLDR news articles
I started a little project called Cloud News Reposted here on Medium. All Articles are written by a little python script, which goes through RSS feeds of my choice and uses HuggingFace/Transformers to summarize the blog post.
Get the latest cloud news in one spot. Google, AWS, Azure, you name it, reposted and summarized. All blog posts are…medium.com
The resulting blog posts are looking something like this
The whole AIBot makes use of a handful of python libraries
- transformers, State-of-the-art Natural Language Processing for Jax, Pytorch and TensorFlow Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow. (Quoted from their website)
- newspaper3k, basically a parser for websites which can dismantle the website for us, especially useful to get the text of the website (LINK)
- PyTorch is an open source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook’s AI Research lab. It is free and open-source software released under the Modified BSD license. (Source: Wikipedia)
- Feedparser, simple RSS parser lib for Python
- requests, to post to Medium.com
pip3 install transformers
# this one works for Windows with Nvidia cards
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip3 install newspaper3k
Step 1: Get feeds
Lets start with the easy part, using feedparser to get the blog posts

All we do is getting all feeds and then taking entries from the last 7 days (86400 seconds times 7 days) and we add all results to an array (all_blogs). So the array will hold all JSON objects, but for the next step we basically need
- blog[‘links’][0][‘href’] — which holds the URL of the original post
- blog[‘title’] — the title of the blog post
Step 2: Get the summary
Now that we have the list of all blog posts we can walk through them with newspaper and use transformer to summarize.

Therefor we walk through all urls which are part of the blog, we could work on the RSS itself of course, but I found too many which only offer a summary of the blog and not the full body. Therefor we use newspaper to read and parse the website (blog[‘links’][0][‘href’]).
The parsed text we feed into our transformer one by one. The real magic is within 3 lines of code
inputs = tokenizer("summarize: " + article.text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
Here is the full code:
You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…github.com
Thats it again, leave a clap and like always be excellent to each other