Follow this Java screen scraper example to aggregate content

By

Cameron McKenzie, TechTarget

With enterprise applications, it's not unusual to aggregate content published on live sites. As such, it's a good idea to develop a level of familiarity with one of the popular Java screen scraper libraries.

In this step-by-step Java screen scraper tutorial, we'll take a look at a framework named JSoup. The JSoup library has been around for a while, and it really hasn't changed much in recent years. It remains a popular and reliable server-side option.

In this example, we will use JSoup to pull some information off the popular GitHub interview questions and answers article published on this site. Extracted information will include the author's name, the display title and a list of all of the embedded links within the page. Here's how to accomplish this.

Step 1: Link to the JSoup library

To work with JSoup, first, create a basic Maven project, and add the JSoup dependency:

<dependency>

<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>

<version>1.11.3</version>

</dependency>

Step 2: Parse a URL

With the Maven dependency added, the next step is to have JSoup parse the specified URL to generate a JSoup Document object. For this example, we will name the returned Document object pageToScrape.

The code can be placed in a microservice, a servlet, a RESTful web component or even a JavaServer Page, but for this example, we will keep things simple and use a class with a runnable main method.

public class JavaScreenScraper {
  public static void main(String args[]) throws Exception {
    String url = "https://www.theserverside.com/video/Tough-sample-Git-and-GitHub-interview-questions-and-answers";
    Document pageToScrape = Jsoup.connect(url).get();
    }
  }  
}

Step 3: Scrape the page title

With the webpage now held as a Document object, it's possible to perform any number of inspections on the page. One inspection is to get the text displayed at the top of the browser when the page renders. You can achieve this with a call to the pageToScrape's title() method.

pageToScrape.title();

Page title — JSoup will screen scrape the page title.

Step 4: Page scrape with CSS selectors

JSoup provides a few built-in methods, such as title() and body(). These enable developers to work with page elements. For more specific queries on the page, you'll need CSS selectors.

For example, the CSS selector for the spot where the author's name is displayed is:

#author > div > a

Given this information, you can have JSoup extract this element from the page with the selectFirst method. This returns an object of type Element representing the anchor tag. Invoke the text() method to obtain the text displayed by the anchor tag:

Element authorLink = pageToScrape.selectFirst("#author > div > a");
System.out.printf("The author: %s.\n", authorLink.text());

How to find a CSS selector

Every element on an HTML page can be identified through a CSS selector. However, it can be extremely frustrating to figure out what the selector is for a given element if one simply inspects the HTML source code. Fortunately, Google Chrome provides some great tools that help identify an element's CSS selector attribute.

To find an element's CSS selector value, simply right-click on the page element of interest, and choose Inspect. Then, left-click the three dots that appear adjacent to the HTML in the Elements tab. Finally, choose Copy > Copy selector, and the CSS selector of the element is added to your copy-and-paste clipboard.

CSS selector — How to find CSS selector of element to screen scrape

Step 5: Multiple screen scraping results

If you're interested in more than one occurrence of an element on a page, you can use the Document object's generic select() method. Like the selectFirst() method, this derivation takes a CSS selector as an argument. However, instead of returning a single Element, it returns a collection of them.

For this Java screen scraper example, we will look at all of the anchor tags on the page and print the URL, or href attribute, of each of them. The code is:

Elements links = pageToScrape.select("a[href]");
for (Element link : links) {
    System.out.print("\nLink: " + link.attr("href"));
}

When put together, the JavaScreenScraper class looks as follows:

package com.mcnz.design;
import org.jsoup.Jsoup;
 
/* Java screen scraper example code */
public class JavaScreenScraper {
 
  public static void main(String args[]) throws Exception {
     
    String url = "https://www.theserverside.com/video/Tough-sample-Git-and-GitHub-interview-questions-and-answers";
           
    Document pageToScrape = Jsoup.connect(url).get();
    System.out.printf("The title is: %s. \n", pageToScrape.title());
 
    Element authorLink = pageToScrape.selectFirst("#author > div > a");
    System.out.printf("The author: %s.\n", authorLink.text());
   
    Elements links = pageToScrape.select("a[href]");
    for (Element link : links) {
      System.out.print("\nLink: " + link.attr("href"));
    } 
  }  
}

The Java screen scraper example code can then be run as a stand-alone Java class and generate the following output:

The title is: Tough sample GitHub interview questions and answers for job candidates. 
The authors name is: Cameron McKenzie.
There have been 1 comments.
 
Link: https://www.theserverside.com/video/Tips-and-tricks-on-how-to-use-Jenkins-Git-Plugin
Link: https://www.theserverside.com/video/Tackle-these-10-sample-DevOps-interview-questions-and-answers
Link: https://www.theserverside.com/video/A-RESTful-APIs-tutorial-Learn-key-web-service-design-principles

The list of links goes on for over 100 iterations, so the output is truncated at three.

As you can see, it is fairly easy to aggregate data from active webpages with a Java screen scraper. There are a number of JavaScript libraries that perform a similar task, but if you want to process intensive applications, access to a server-side technology is preferred. If a Java screen scraper is what you need, give JSoup a try.

The source code for this example can be found on GitHub.

View All Videos

Search App Architecture

Insomnia vs. Postman: Comparing API management tools
Insomnia has a streamlined interface and focus. Postman has extensive features for end-to-end development. Choosing comes down to...
8 best practices for creating architecture decision records
An ADR is only as good as the record quality. Follow these best practices to establish a dependable ADR creation and maintenance ...
Refactor vs. rewrite: Deciding how to fix problem software
At some point, all developers must decide whether to refactor code or rewrite it. Base this choice on factors such as ...

Search Software Quality

7 essential macOS code editors
Learn about the top code editors for MacOS. Make your choice from the following list of code editors based on price, features, ...
Google adds Gemini CLI for GitHub Actions coding agent
The beta version of Google Gemini CLI for GitHub Actions starts simple and builds in security, but overall, the 'honeymoon phase'...
Scrum master certification exam questions and answers
Are you ready for the Scrum master certification exam? Test yourself on these 10 tough Scrum master exam questions and answers.

Search Cloud Computing

MELT away your cloud observability troubles with open source
In today's complex cloud environments, enterprises face a critical visibility challenge. Comprehensive observability isn't just a...
The cloud observability quiz: Are you monitoring or observing?
Ready to test your cloud observability expertise? Discover if you can distinguish between metrics, logs and traces while ...
A practical guide to PATs in Azure DevOps
In the rapidly evolving DevOps landscape, understanding how and when to use PATs empowers users to build flexible, secure and ...

Search Security

How outer space became the next big attack surface
VisionSpace Technologies' Andrzej Olchawa and Milenko Starcik discussed a set of vulnerabilities capable of ending space missions...
How to vibe code with security in mind
As more organizations integrate vibe coding and AI-assisted coding into their application development processes, it's important ...
AI agents access everything, fall to zero-click exploit
Zenity CTO Michael Bargury joins the Black Hat USA 2025 News Desk to discuss research on a dangerous exploit, how generative AI ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Close