Showing posts with label J2EE. Show all posts
Showing posts with label J2EE. Show all posts

Tuesday, 5 November 2013

What are the pros and cons of the leading Java HTML parsers?

Almost all known HTML parsers implements the W3C DOM API (part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidyNekoHTML,TagSoup and HtmlCleaner. You usually use this kind of HTML parsers to "tidy" the HTML source (e.g. replacing the HTML-valid <br> by a XML-valid <br />), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.
The only ones which jumps out are HtmlUnit and Jsoup.


HtmlUnit provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It's much more than alone a HTML parser. It's a real "GUI-less webbrowser" and HTML unit testing tool.


Jsoup also provides a completely own API. It gives you the possibility to select elements using jQuery-likeCSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest.
Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document know what a hell of pain it is to traverse the DOM using the verboseNodeList and Node APIs. True, XPath makes the life easier, but still, it's another learning curve and it can end up to be still verbose.
Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big, without writing utility/helper methods).
String url = "";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();
Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());
NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
And here's an example how to do exactly the same with Jsoup:
String url = "";
Document document = Jsoup.connect(url).get();
Element question ="#question .post-text p").first();
System.out.println("Question: " + question.text());
Elements answerers ="#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
Do you see the difference? It's not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).


The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for the first mentioned group of parsers. There are pretty a lot of them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.

How to know tomcat server: JSP page is modified ..?

Because when Tomcat is asked to execute a JSP, is compares the modification date of the JSP file with the modification time of the compiled class corresponding to this JSP, and if more recent, it recompiles on the fly before executing it.
This is BTW an option that should be turned off in production, because it takes time to perform this check.
See for details.

It is upto the container to decide when to load servlets. A servlet can be loaded at runtime on demand. And coming to JSP, JSP translated to servlet can also be loaded at runtime.
Coming to your question,
Why Tomcat does not require restart?
It is because Tomcat is capable of adding/modifying classpath to Web Application classloader at runtime. Tomcat will be having their custom Classloader implementation which allows them to add the classpaths at runtime.
How does the custom classloader might work?
One way to get this working is when a Servlet/JSP is modified, a new classloader is created for the Servlet/JSP with Application classloader as parent classloader . And the new classloader will load the modified class again.

Monday, 11 March 2013

OCWCD Chapter 2 : Web App architecture

A container runs and controls the servlets.
A full J2EE application server must have both a web container and an EJB container.
When a request is received by the webserver and needs to access a servlet, the webserver hands the
request to the servlet-helper app : the container. Then the container hands the request to the servlet

  •  Match the request URL to a specific servlet;
  • Creates request and response objects used to get informations from the client and send back informations to him;
  • Creates the thread for the servlet;
  • Calls the servlet’s service() method, passing request and response objects as parameters;
  • service() method will call the doXxx() method of the servlet depending on request type;
  •  Get the response object from the finished service() method and converts it to an HTTP response;
  •  Destroys request and response objects;
  •  Sends back response to the webserver

Capabilities of containers

Containers provide :
  • Communications support : handles communication between the servlets and the webserver
  • Lifecycle management : controls life and death of servlets, class loading, instantiation, initialization, invocation the servlets’ methods, and making servlet instances eligible for GC
  • Multithreading support : Automatically creates a new thread for every servlet request received. When the Servlet service() method completes, the thread dies.
  • Declarative security : manages the security inside the XML deployment descriptor file. Security can be configured and changed only by modifying this file.
  • JSP support : Manages the JSPs.
URL to servlet mapping

The servlet container uses the deployment descriptor to map URL to servlets. Two DD elements
are used to achieve this :

<servlet> : maps internal name to fully qualified class name;
<servlet-mapping> : maps internal name to public URL name.

Model-View-Controller (MVC) Design

MVC does not only separates the Business Logic from the presentation, the Business Logic doesn’t
even know there is a presentation.
MVC : separate Business Logic (model) and the presentation, then put something between them toconnect them (controler), so the business can become a real re-usable code, and the presentation layer can be modified/changed at will.


Overview :

  • HTTP stands for HyperText Transport Protocol, and is the network protocol used over the web. It runs over TCP/IP.
  • HTTP uses a request/response model. Client sends an HTTP request, then the server gives back the HTTP response that the browser will display (depending on the content type of the answer) If the response is an HTML file, then the HTML file is added to the HTTP response body
  • An HTTP response includes : status code, the content-type (MIME type), and actuel content of the response
  • A MIME type tells the browser what kind of data the response is holding URL stands for Uniform Resource Locator: starts with a protocol, then a server name, optionnaly followed by a port number, then the path to the resource followed by the resource name. Parameters may appear at the end, separated from the rest by a ?

HTTP methods

GET Gets the data identified by an URI
POST Same, the request’s body is passed to the resource specified by the URI
HEAD Sends back only the header of the response that would be sent by a GET
PUT Sends to the server data to store at specified URI. Server does not process the
data ; it only stores them
DELETE Delete specified resource
TRACE When it receives this request, the server sends it back to the client
OPTIONS Ask server informations about a resource or about its own capabilities
CONNECT Ask for a connction (tunnelling)

Differences between GET and POST

POST has a body.
GET parameters are passed on the URL line, after a ?. The URL size is limited.
The parameters appear in the browser URL field, hence showing any password to the world...
GET is supposed to be used to GET things (only retrieval, no modification on the server).
GET processing must be idempotent.

A click on a hyperlink always sends a GET request.
GET is the default method used by HTML forms. Use the method=”POST” to change it.

In POST requests, the parameters are passed in the body of the request, after the header. There is no
size-limit. Parameters are not visible from the user.
POST is supposed to be used to send data to be processed.
POST may not be idempotent and is the only one.

IDEMPOTENT : Can do the same thing over and over again, with no negative side effect.