sabato 30 ottobre 2010

Login to a website using Apache HttpClient and Jericho HTML Parser

The fastest way to create a program is to use code that’s already written: a library. Recently I got the need to build a HTTP-aware client application in Java. This application had to login to a website, download files automatically and logout. It was very important that the program was able to parse the HTML content to find the login form, execute the login in a correct way and then find the links to download files. I decided to use Apache HttpClient to access resources via HTTP and Jericho HTML Parser to parse the HTML content. In this article I will share the most important functions I'm using. These functions show the interrelation we can build between HttpClient and Jericho to achieve the original goal.

First of all, we need to import the libraries:
// Apache HttpClient
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.ResponseHandler;
import org.apache.http.cookie.Cookie;
import org.apache.http.impl.client.BasicResponseHandler;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.protocol.HTTP;

// Jericho HTML Parser
import net.htmlparser.jericho.*;
Now we can write down a first, useful function to retrieve login data. Just a little preamble:

The traditional HTTP login form is something like this:

<form method="POST" action="action">
<input type="text" name="username" value="Username" /> 
<input type="password" name="password" value="Password" />
...
<\form>

It's important to notice that usually the form uses name/value pairs we have to deal with.

The buildNameValuePairs function parses the HTML document and returns a dynamic List<NameValuePair> encapsulating name/value pairs to login to a website. The first step in parsing a HTML document  is to construct a Source object from the source data, which can be a String, Reader, InputStream, URLConnection or URL. In this case, the function doesn't construct a new Source object by loading the content directly from the URL because it must remain inside the connection created by HttpClient. We don't need to create another connection and waste useful resources.
public static List<NameValuePair> buildNameValuePairs(
                                       InputStream is,
                                       String strUsername,
                                       String strPassword)
                                       throws Exception  {
        // Initialize the list
        List<NameValuePair> nvps = 
                      new ArrayList<NameValuePair>();

        // string arrays containing the column labels 
        // and values
        String[] columnLabels, columnValues = null;

        // Register the tag types related to the Mason
        // server platform.
        MasonTagTypes.register();

        // Construct a new Source object by loading the
        // content from the specified InputStream.
        Source source = new Source(is);

        // Set the Logger that handles log messages to null.
        // We are a windowed application and we don't
        // need to handle verbose outputs.
        source.setLogger(null);

        // Parses all of the tags in this source document
        // sequentially from beginning to end.
        source.fullSequentialParse();

        // Return a list of all forms in the 
        // source document.
        List<Element> formElements = 
                                     source.getAllElements(
                                      HTMLElementName.FORM);

        // Start the loop
        for (Element formElement : formElements) {
            String loginTag = 
                       formElement.getStartTag().toString();
            // stop the execution of the current iteration 
            // and go back to the beginning of the loop to 
            // begin a new iteration
            if (loginTag == null) continue;
            // Let's find the login form.
            else if (
              loginTag.toLowerCase().indexOf("login")>-1) {
                // Create a segment of the Source document 
                // containing the login form as CharSequence.
                CharSequence cs = formElement.getContent();
                // Constructs a new Source object from 
                // the specified segment.
                source = new Source(cs);

                // Return a collection of FormField objects.
                // Each form field consists of a group of 
                // form controls having the same name.
                FormFields formFields = 
                                     source.getFormFields();

                // Return a string array containing 
                // the column labels corresponding to the 
                // values from the getColumnValues(Map)
                // method.
                columnLabels = formFields.getColumnLabels();
                // Convert all the form submission values 
                // of the constituent form fields into a 
                // simple string array.
                columnValues = formFields.getColumnValues();

                break;
            }
        }

        /**
         * Now we can construct the List of name/value pairs
         * to login to a website.
         */
        for (int i = 0; i < columnValues.length; i++)
            nvps.add(new BasicNameValuePair(columnLabels[i],
                                          columnValues[i]));
        
        // if (columnLabels[i].equalsIgnoreCase("username")){
        // nvps.add(new BasicNameValuePair(columnLabels[i],
        //                                 strUsername)); 

        // return statement
        return nvps;    
}

Now we can build the master function that will establish a connection and login using the buildNameValuePairs function. We can pass 3 variables as arguments to the function: the URL to execute the login as String, username and password.

public static boolean loginExecuted(String strDomainUrl, 
                                    // user-supplied
                                    String strUsername, 
                                    // user-supplied
                                    String strPassword) 
                                    // user-supplied
                                    throws Exception {
        boolean bSuccess = false;

        // Create a new HTTP client and a
        // connection manager
        DefaultHttpClient httpclient = 
                                new DefaultHttpClient();

        // The GET method means retrieve whatever
        // information (in the form of an entity) is
        // identified by the Request-URI. If the Request-URI
        // refers to a data-producing process, it is the
        // produced data which shall be returned as the 
        // entity in the response and not the source text 
        // of the process, unless that text happens to be 
        // the output of the process.
        HttpGet httpget = new HttpGet(strDomainUrl);

        HttpResponse response = httpclient.execute(httpget);

        // An entity that can be sent or received with an
        // HTTP message.
        HttpEntity entity = response.getEntity();

        // The InputStream from the entity
        InputStream instream = entity.getContent();

        // Now we can call the function we built before
        List<NameValuePair> loginNvps = buildNameValuePairs(
                                             instream,
                                             strUsername,
                                             strPassword);

        instream.close();

        if (entity != null)
            entity.consumeContent();

        if (loginNvps.size() > 0) {
            // The post method is used to request that 
            // the origin server accept the entity enclosed  
            // in the request as a new subordinate of the 
            // resource identified by the Request-URI in 
            // the Request-Line. Essentially this means 
            // that the POST data will be stored by the
            // server and usually will be processed 
            // by a server side application.
            HttpPost httpost = new HttpPost(strDomainUrl);

            httpost.setHeader("User-Agent",
                       "Mozilla/5.0 (compatible; MSIE 7.0; 
                                          Windows 2000)");

            httpost.setEntity(
                     new UrlEncodedFormEntity(loginNvps, 
                                              HTTP.UTF_8));

            response = httpclient.execute(httpost);
            entity = response.getEntity();

            if (entity != null)
                entity.consumeContent();
            
            
            // At this point we can handle the connection 
            // as we like. For example we can check out 
            // if procedure was succesful (bSuccess = true), 
            // read the content, download files...
        }

        // return statement
        return bSuccess;
    }
That's all for now (I will check out this post as soon as possible to fix errors).
Best regards.