mercoledì 5 gennaio 2011

Parsing HTML string to get links in Javascript

Recently I needed a Javascript function to retrieve links from a HTML string. Unfortunately I couldn't use third party powerful tools like jquery, so I thought to use RegEx.
Let's assume we have a HTML page like this:
<html>
    <body>
        <a href="google.com" title="Google Site">Google</a>
        <a href="mozilla.com" title="Mozilla Site">Mozilla</a>
        <a href="blogger.com" title="Blogger Site">Mozilla</a>
    </body>
</html>
This page contains links to Google, Mozilla and Blogger. How can we get the links from the HTML content?
<script language="JavaScript" type="text/javascript">
function getLinks() {
    var html = "<html> \
                <body> \
                <a href=\"google.com\" 
                   title=\"Google Site\">Google</a> \
                <a href=\"mozilla.com\" 
                   title=\"Mozilla Site\">Mozilla</a> \
                <a href=\"blogger.com\" 
                   title=\"Blogger Site\">Blogger</a> \
                </body> \
                </html>";

    var links = [];

    html.replace(
     /[^<]*(<a href="([^"]+)" title="([^"]+)">([^<]+)<\/a>)/g, 
     function() {
        links.push(Array().slice.call(arguments, 1, 5));
    });

    alert(links.join("\n"));
}
</script>
The getLinks() function retrieves the links from the HTML content and puts them into an array. "The slice method creates a new array from a selected section of the links array". Some useful informations about the slice method here.


So at the end we have an "array of array". If we want to retrieve a single element, we can call it as links[x][y], where x is the row and y is the column.
For example, let's assume we want to extract some information from the first link:
alert("First link (Google):\n" +
      "Destination anchor: " + links[0][1] + "\n" +
      "\"title\" attribute: " + links[0][2] + "\n" +
      "Source anchor: " + links[0][3]);
The function has several limits: for example it's case sensitive and depends on the A element. In the case above, the href and title attribute are set, but if we have an A element like this:
<a href="google.com">Google</a>
without the title attribute, the function won't work. In that case, we should modify the regex in this way
html.replace(
     /[^<]*(<a href="([^"]+)">([^<]+)<\/a>)/g, 
     function() {
        links.push(Array().slice.call(arguments, 1, 4));
    });
Best regards.