Introduction
In this tutorial, we are going to show how to use jsoup library to convert HTML content into plain text without HTML tag in a Java application.
Add jsoup library to your Java project
To use jsoup Java library in the Gradle build project, add the following dependency into the build.gradle file.
compile 'org.jsoup:jsoup:1.13.1'
To use jsoup Java library in the Maven build project, add the following dependency into the pom.xml file.
org.jsoup
jsoup
1.13.1
To download the jsoup-1.13.1.jar file you can visit jsoup download page at jsoup.org/download
Convert HTML String into Plain Text
The Java application below, we use Jsoup.clean[] method to remove HTML tags in a HTML content to return plain text content.
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
public class ConvertHtmlToText {
public static void main[String... args] {
String htmlString = "Simple Solution
Convert HTML to Text
";
String outputText = Jsoup.clean[htmlString, new Whitelist[]];
System.out.println[outputText];
}
}
The output is:
Simple SolutionConvert HTML to Text
Convert HTML from Website into Plain Text
In the following example Java program, we combine Jsoup.clean[] with Jsoup.connect[] method provided by jsoup library to download HTML content from URL and then remove HTML tags.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Whitelist;
import java.io.IOException;
public class ConvertHtmlToTextFromUrl {
public static void main[String... args] {
try {
String url = "//simplesolution.dev/";
Document document = Jsoup.connect[url].get[];
String htmlString = document.html[];
String outputText = Jsoup.clean[htmlString, new Whitelist[]];
System.out.println[outputText];
} catch [IOException e] {
e.printStackTrace[];
}
}
}
The output is:
Convert HTML File into Plain Text
Following examples to show how to read HTML content from a file and remove HTML tags. For example, we have a sample.html file with the following content.
Simple Solution
Example 1 read file content NIO classes .
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class ConvertHtmlToTextFromFile1 {
public static void main[String... args] {
try {
String fileName = "sample.html";
Path filePath = Paths.get[fileName];
byte[] fileBytes = Files.readAllBytes[filePath];
String htmlString = new String[fileBytes, "UTF-8"];
String outputText = Jsoup.clean[htmlString, new Whitelist[]];
System.out.println[outputText];
} catch [IOException e] {
e.printStackTrace[];
}
}
}
The output is:
Example 2 read HTML file using Jsoup.parse[] method.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Whitelist;
import java.io.File;
import java.io.IOException;
public class ConvertHtmlToTextFromFile2 {
public static void main[String... args] {
try {
String fileName = "sample.html";
File file = new File[fileName];
Document document = Jsoup.parse[file, "UTF-8"];
String htmlString = document.html[];
String outputText = Jsoup.clean[htmlString, new Whitelist[]];
System.out.println[outputText];
} catch [IOException e] {
e.printStackTrace[];
}
}
}
The output is:
Happy Coding 😊
Related Articlesjsoup parse HTML Document from a File and InputStream in Java
jsoup parse HTML Document from an URL in Java
Read Text Files in Java