LinkedIn Developer Meetup at Le Web on Dec 9

Want to learn more about how to develop or integrate with the LinkedIn platform? Then plan on attending the LinkedIn Developer Meetup at Kube Hotel, Paris (December 9th at 5:30pm CET / 17:30 hrs) that we’re holding on the sidelines of the Le Web Conference this year. A big thanks to Dave McClure and Loic Le Meur for helping with the developer meetup!

At the meetup, we’ll show you how you can use our Developer Portal to quickly get up and running. You’ll also get to hear from other developers who’ve already built integrations with our APIs. We’ll have ample time for Q&A and some networking after that (of course)!

LinkedIn Developer Meetup at Kube Hotel, Paris (Le Web 09) on Dec 9th, 09

Since we opened the LinkedIn APIs for all developers, we’ve seen a tremendous response from the developer community. We’ve seen developers quickly build a community at our Developer portal and build the tools and techniques that make it so easy to integrate with LinkedIn. These include a PHP Library, a Ruby Gem and .NET integration via DotNetOpenAuth.

We believe this rapid uptake is the result of open standards such as OAuth, which allows other developers to use off the shelf libraries and open-source projects to quickly take advantage of them. I look forward to discussing these topics at the meetup mentioned above.

We’re excited to see integrations like Tweetdeck, HootsuiteSobees and many more that now allow you to bring LinkedIn to your desktop. Similar integrations announced earlier this month with Microsoft Outlook 2010, Blackberry, etc. will continue to help you extend your LinkedIn network to all areas of your  professional life – wherever you work.

Share: Email | LinkedIn | Digg | Twitter

LinkedIn at QCon 2009

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent feature announcements, tips and tricks, or success stories.

Over fifty million people use LinkedIn to find, connect and collaborate with professionals worldwide, do business and build their careers. It is the mission of LinkedIn’s engineering and operations teams to build and scale LinkedIn systems, services and develop features and functionality that makes the LinkedIn service effective and delightful to use.

Apart from traditional challenges of running a high-traffic 24/7 service, we are also faced with the unique problems of scaling the professional graph, making all our services interoperable while respecting privacy and visibility rules along that graph. Another challenge is making sure we present to busy professionals the most relevant information possible – whether it’s the network updates stream, search results or other content they choose to share. LinkedIn engineers and architects recently shared some of the technologies we are working on during QCon San Francisco 2009 conference.

Jay Kreps talked about distributed scaling storage at LinkedIn in his presentation – Project Voldemort: Scaling Simple Storage At LinkedIn.

Sean Dawson and myself talked about LinkedIn Network Updates Service – the information stream you see every day when you hit LinkedIn homepage. We will continue sharing our experiences and technologies with the engineering community – as we have done in the past.

Please take a moment – visit LinkedIn Technology Careers site and join a great team of engineers and make a difference in how professionals do business.

Share: Email | LinkedIn | Digg | Twitter

LinkedIn at ApacheCon 2009

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent feature announcements, tips and tricks, or success stories.

I have the honor of presenting my talk entitled “Empowering the Social Web with Apache Shindig” at the 10th annual Apache Conference this week. I’ll talk about how Apache Shindig and OpenSocial standards power our very own LinkedIn InApps Platform and hundreds of other social containers on the web. Read on for more, but first a minor digression on LinkedIn, the Apache Software Foundation, and Open Source.

LinkedIn Runs on Apache

The Apache Software Foundation was founded in 1999 to foster development of the open source Apache HTTP Server —  which powers millions of web sites worldwide. Based on this success other projects were added over time. Each project is able to take advantage of the technical, legal and organizational resources provided by Apache. Today there are hundreds of high quality projects under the Apache umbrella.

The high quality and open nature of Apache software is in heavy use at LinkedIn. Many of our servers run Apache Tomcat. We build our software with Apache Ant and Ivy. Diverse and useful libraries such as Apache HttpClient, Commons, and Lucene  provide great functionality for Linkedin with less effort. This allows us to focus on what we do best — providing a great web experience to our members.

Leveraging Apache software is only the start. LinkedIn actively contributes code, design and testing to many Apache projects. These efforts insure that these projects continue to grow and evolve to meet our future challenges. In addition to our contributions to Apache Shindig you’ll find LinkedIn active in the Apache Lucene community where we’ve developed a number of extensions to this powerful search technology. LinkedIn code provides faceted search via bobo-browse, real time indexing with zoie, and extra performance with the kamikaze search extension. We’ve also released our data storage solution, Voldemort with an Apache License.  (Read more about Voldemort here)

Shindig, powering InApps at LinkedIn since 2008

Shindig Open Social

At ApacheCon I’ll be talking about Apache Shindig, a framework that renders InApps like LinkedIn Events, Amazon Reading List, Tripit and 8 other applications. Shindig converts these applications into web content on the home page, profile page and full page views. The Shindig REST API allows our internal and external developers to access data using the OpenSocial and the Portable Contacts standards.

During the past months our involvement with Shindig has reaped benefits for LinkedIn Members and the developers we partner with. We incorporated numerous performance enhancements that have sped up page load times for InApps. These recent updates also include support for OpenSocial 0.9 which allows for easier, faster development of applications. New features include OpenSocial Templates, a new lightweight JavaScript API and “Data Pipelining” which reduces page load time. By applying these new features the applications such as Company Buzz now load much faster.

Over this same time period LinkedIn has contributed back to the Shindig and OpenSocial community. Our diligent QA teams have helped to find and fix cross-browser compatibility issues. Code contributions have flowed steadily back to the project. And we continue to work with the community to build and release the next version of Shindig, version 1.1, and future versions targeting the upcoming OpenSocial 1.0 standard.

Doing More, Learning More

At LinkedIn I’m proud to have witnessed our numerous contributions to the open source and Apache communities. By collaborating with our peers we have achieved much more than going it alone.

If you’re interested in learning more about Shindig and OpenSocial you can still register for ApacheCon and see my talk. If a more informal setting is to your liking you can attend the free Apache Social and Widgets Meetup this Thursday, November 5th 2009, which is sponsored by LinkedIn.

Share: Email | LinkedIn | Digg | Twitter

Java One 2009: The Secret Sauce that helps scale LinkedIn

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent feature announcements, tips and tricks, or success stories.

Java One 2009 has come and gone, and once again the engineering team at LinkedIn had the opportunity to make a few presentations that we’d like to share on the blog. Earlier this week, Brandon and Yegor shared their presentation in this blog. In addition to that, Dhananjay and I, were given the opportunity to deliver a technical session at Java One 2009 on how LinkedIn stores its data. A grand time was enjoyed by the both of us, as we regaled some 200+ engineering folks on how we have built our services to manage the data storage platform. The presentation was extremely well received and we just learned that our session was chosen as a Top session at the conference and will be linked to from the Java One conference homepage.

In addition, we’ve also received requests for a copy of the slides from many of you, so we have embedded it in this post as well. Please feel free to share this content with your peers and stay tuned for more around this exciting area on the blog. Look forward to hearing your comments.

Share: Email | LinkedIn | Digg | Twitter

Java One 2009: Building Consistent RESTful APIs in a High Performance Environment

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent feature announcements, tips and tricks, or success stories.

At this year’s JavaOne conference Yegor Borovikov and myself had the opportunity to present details of our RESTful API framework. Our Birds of a Feather presentation is titled “Building Consistent RESTful APIs in a High Performance Environment” and it describes our use of a coherent domain model as the foundation for our APIs. Flip through the various slides in the embed below and feel free to leave a comment or two.

Also, stay tuned for another Java One Presentation from my colleagues David Raccah and Dhananjay Ragade, later this week.

Share: Email | LinkedIn | Digg | Twitter

Project Voldemort (Part II): How it works

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent announcements, tips and tricks, or success stories.

In my last blog entry I described what LinkedIn is doing with our open source key-value storage system Project Voldemort. In this entry I will talk about what how Voldemort works, and what features we will be adding to it.

With Voldemort we hope to scale both the amount of data we can store and the number of requests for that data. Naturally the only way to do this is to spread both the load and the data across many servers. But spreading across servers creates two key problems:

1.   You must find a way to split the data up the data so that no one server has to store everything
2.   You must find a way to handle server failures without interrupting service

Scaling

The first point is fairly obvious—if you want to handle more requests you need more machines, if you want to handle more data, you need more disks and memory (and servers to hold the disks and memory). But there are still a number of subtleties.

Any system that doesn’t maintain local state, can easily be scaled by just making more copies of it and using a hardware load balance to randomly distribute requests over the machines. Since the whole point of a storage system is to store things, this becomes somewhat more difficult: if we randomly distribute writes then the data will be different on each machine, if we write to every machine then we will potentially have dozens of machines to update on each write.

In order to effectively use all the machines, the data in Voldemort is split-up amongst the servers in such a way that each item is stored on multiple machines (the user specifies how many). This means that you have to first figure out which is the correct server to use. This partitioning is done via a consistent hashing mechanism that let’s any server calculate the location of data without doing any expensive look ups.

This kind of partitioning is commonly done to improve the performance of write requests (since without it, every single server would have to be updated every time you did a write). What is not commonly understood is that this is also required to improve read performance. Memory access is thousands of times faster than disk access, so the ratio of memory to data is the critical factor for accessing the performance of a storage system. By partitioning the data you increase this ratio by shrinking the data on each machine. Another way to think of this is as improving cache locality—if requests are randomly balanced over all machines then “hot” items end up in cache on all servers and the hit ratio is fairly low, by partitioning the storage among machines the cache hit ratio dramatically improves.

Detecting Failure

To handle this problem any distributed system must do some kind of failure detection. Typically this is done by some kind of heart-beat mechanism—each server pings some master co-ordination nodes (or each other) to say “Hi, I am still alive!” In the simple case if a node fails to ping for some time then it is assumed to be dead.

But this raises a few problems, first there aren’t any master nodes in the Voldemort cluster, each node is a peer—so what if one server gets a ping and another does not? Then the servers will have a differing view of who is and is not alive. In fact, maintaining the state about who is alive is the exact same distributed state management problem we were trying to solve in the first place.

The second problem is a bit more existential: what does it mean to be alive? Indeed, just because a server is alive enough to say “hi!” or “ping!” doesn’t mean you are alive enough to correctly service requests with low latency. One solution is to increase the complexity of the ping message to include a variety of metrics on the server’s performance, and then make the prediction as to whether that server is alive or not. But what we do is much simpler. Since Voldemort only has a few types of request (PUT, GET, DELETE, etc.) and since each server is getting hundreds of these requests per second, why invent a new ping request to detect liveness? Instead, since each of these requests has similar performance, it makes sense to simply set an SLA (service level agreement) for the requests and ban servers who cannot meet their SLA (this could be because they are down, because requests are timing out, or many other reasons). Servers that violate this SLA get banned for a short period of time, after which we attempt to restore them (which may lead the them getting banned again).

This is a fairly simple mechanism for the user of Voldemort to use, since they may have their own SLA they need to maintain (i.e. serve 99% of the pages in less than 100ms or something like that). The simplicity of the query model actually becomes something of an advantage in this kind of performance analysis. The three Voldemort queries have known performance, so it is very easy to predict the load a new feature will generate by just counting the number of requests. This is always a challenge with SQL: poorly designed SQL queries may produce thousands of times more load. Compounding this problem, distinguishing the bad queries from the good requires knowing both the index structure and the data on which it will run—neither of which is present in your code—so it easy for an efficiency to slip past even a diligent review if you don’t perform real tests on real data for each modification to see what query plan will be generated.

Dealing With Failure

The redundancy of storage makes the system more resilient to server failure. Since each value is stored N times, you can tolerate as many as N – 1  machine failures without data loss. This causes other problems, though. Since each value is stored in multiple places it is possible that one of these servers will not get updated (say because it is crashed when the update occurs). To help solve this problem Voldemort uses a data versioning mechanism called Vector Clocks that are common in distributed programming. This is an idea we took from Amazon’s Dynamo system. This data versioning allows the servers to detect stale data when it is read and repair it.

The first advantage of this mechanism is that it does not require a consistent view of which servers are working and which are not. If server A cannot get to server C, but server B can get to C that will not break the versioning system. This kind of failure can be especially common in the case of transient failures.

Another advantage relates to challenges faced with expanding data centers. Requests between data centers that are physically remote have much higher latency then requests within a data center (10 or 100x slower depending on geography). With most storage systems it isn’t possible to take concurrent writes across multiple data centers without risk of losing data, since if two updates occur at once, one in each data center, there is no principled way for the storage system to choose between them. By versioning the data you can allow the results to be resolved or the conflict to be given back to the application for resolution.

Picking A Name

I should probably also mention how it got its name, since that is something I got a lot of questions about. I wanted to come up with a name that was distinctive and a little self-deprecating (projects shouldn’t take themselves too seriously). At the time I was reading the last Harry Potter book, and Voldemort had split himself into many pieces each of which had to be destroyed to kill him. I thought, “that sounds like a distributed system”. I don’t know whether it is nerdier to be reading Harry Potter or to be wondering what kind of consistency protocol Voldemort uses when keeping all his pieces up-to-date, but regardless, the name stuck.

For more information on the project, check out the web site and mailing list. Patches, bug reports, and suggestions are gladly accepted.

Quick Update: Interested in similar projects, check out the job openings at LinkedIn to work on these challenges full time.

Share: Email | LinkedIn | Digg | Twitter

Project Voldemort: Scaling Simple Storage at LinkedIn

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent announcements, tips and tricks, or success stories.

About a month ago LinkedIn released the code for an open source distributed storage system called Project Voldemort. I wanted to give a little more information about what it is good for, how it came to be, and what our plans are for the future.

Some Background

Like a lot of websites, LinkedIn started with a single big database and a cluster of front-end servers (unlike a lot of websites it also started out with a big social graph in memory on remote machines, but that is a different story). As we grew, this database got split into a variety of remote services for serving up profiles, performing searches, interacting with groups, maintaining network updates, fetching companies, etc. These databases may have read-only replicas, but we didn’t have a system for scaling writes.

Unfortunately for engineers and DBAs, many of the rich features that people expect from a modern internet site either require massive data sets or high write loads, or both. This became a problem as we looked at how to scale some write-intensive features like Who’s Viewed My Profile that require as many updates as reads. We faced a similar scale problem for offline computed data, such as finding similar profiles—the set of all user profiles is very large, but even a modest subset of the set of all user profile pairs is quite huge.

To handle this problem we looked at the systems other internet companies had built. We really like Google’s Bigtable, but we didn’t think it made sense to try to build it if you didn’t have access to a low-latency GFS implementation. Our primary goal was to get low-latency, high-availability access to our data. For complex analysis we had Hadoop and databases, for complex queries we had a distributed search system, and the goal wasn’t to try to duplicate any of these. We were inspired by Amazon’s Dynamo paper, which seemed to meet the needs we have as well as being feasible to implement with low-latency queries–much of our design for Project Voldemort comes from that.

Our experience with the system so far has been quite good. We were able to move applications that needed to handle hundreds of millions of reads and writes per day from over 400ms to under 10ms while simultaneously increasing the amount of data we store.

Open Source

LinkedIn is a big open source user, and we have contributed back a number of the improvements to Lucene we have made such as Zoie, Kamikaze, and Bobo. Most of the things we build are pretty LinkedIn-specific, but things like search and storage are pretty much stand-alone and we are happy to get other users (and contributors!). I myself have been a long-time open source lurker—I am the first to check out the source, but rarely have the time to make any improvements. Fortunately, even if most people are as lazy as me, not all are. In the last few months we have got close to 50 contributions from people around the world. Some are small, just doing a little cleanup, and others have been quite substantial introducing new features or major code improvements.

The long-term success of an open source project depends on its not being controlled by a single company, person, group, but forming a real self-sustaining group of interested developers. This is our goal in working on the open source project. LinkedIn is not a storage systems company, and neither are the other web companies facing some of the same problems, so we think we think we can all benefit by sharing our work in this area.

The Future

So what is next for the project? The most important feature for a storage system is always improving performance and reliability. But there are a couple of other things in the works. We are working on making it easier to incrementally add to clusters of servers, improving our support for batch computed data from Hadoop, and implementing some clients in other programming languages.

For more information on the project, check out the main site. We are always looking for contributors to the project, so if you are interested check out the projects page and mailing list. Ideas, bug reports, patches, etc. are all gladly accepted.

Interested in similar projects, check out the job openings at LinkedIn to work on it full time.

Stay tuned for upcoming blog posts that will reveal more details of the system internals and some of the lessons learned

Share: Email | LinkedIn | Digg | Twitter

OSGi at LinkedIn – Bundle repositories

Code Alert! This is a part of our continuing series on Engineering at LinkedIn. If this isn’t your cup of Java, check back tomorrow for regular LinkedIn programming. In the meanwhile, check out some of our recent announcements, tips and tricks, or success stories.

When you start using OSGi, the very first problem you are going to be faced with, is the fact that OSGi requires bundles. A bundle is nothing more than a jar file with extra manifest information. Here is a ‘typical’ example of a manifest for an OSGi bundle (the entries in bold are the OSGi specific headers).

Bundle-Activator:
. com.linkedin.colorado.helloworld.client.HelloWorldClientActivator
Import-Package:
. com.linkedin.colorado.helloworld.api;version="[1.0.0,1.0.1)",
. com.linkedin.colorado.helloworld.client;version="[1.0.0,1.0.1)",
. org.osgi.framework;version="[1.4.0,1.4.1)"
Export-Package:
. com.linkedin.colorado.helloworld.client;version="1.0.0"
Bundle-Version: 1.0.0
Bundle-Name: colorado-helloworld-client
Bundle-ManifestVersion: 2
Bundle-SymbolicName: colorado-helloworld-client
Bnd-LastModified: 1224557973403
Generated-On: 20081020
Tool: Bnd-<unknown version>
Implementation-Version: DevBuild

This is how you instruct the OSGi container about your dependencies (Import-Package), what you provide (Export-Package), how you become active (Bundle-Activator), etc…

So why did I start this post by saying it was a problem ? The answer is actually two-fold:

  1. you need to generate those headers for your own jar files and it can be quite challenging if you want to do it manually (our biggest bundle currently has over 760 import package entries!)
  2. all external libraries that you require also need to be a bundle (libraries that you do not control like log4j, xerces,…)

In this post I will be concentrating on problem #2 and I will come back to problem #1 in a later post.

Let’s start with some numbers. As of this writing (January 2009), our repository of external libraries contains 200 jar files. Only 8 of them are bundles out of the box (4%). I believe that this small sample reflects the harsh reality out there: over 95% of the available libraries are not bundle.

So what is the solution? The answer is unfortunately not that easy. For starters, you should definitely check the SpringSource bundle repository that they are offering for free. It contains a good list of libraries that have been converted to bundles (they even have a full time employee just for this ongoing task!). One of the big issue is that it is hard to keep up with as new libraries are popping up on a daily basis include snapshots. It’s benefits are debatable but in practice, sometimes you just don’t have a choice! In our case, we just cannot afford to rely solely on the availability of bundles. So here is the approach that we took:

ext-bundles

What we are trying to achieve is to convert a repository of libraries (96% jar files (blue)) into a repositories of bundles (100% bundles (red)). For this we use bnd, ivy and some custom code.

Our repository of external libraries is using ivy for dependency management (note that the process would be very similar with maven). Using ivy resolution, it is relatively easy to build the (non cyclic) graph of dependencies between all the libraries. All the leaves represent libraries that do not have dependencies on other libraries (Step1).

steps

BND is a tool that analyzes a jar file and can create OSGi manifest headers. In Step 2, we iterate over each leaf and we feed it to bnd to generate a bundle as a result. We use some custom code (ant tasks) to have more control over what is provided as input to bnd and the errors/warnings that we need. In Step 3, we repeat the same process one level up the dependency graph. This time we know we are dealing with libraries that have dependencies, but we also know that they have properly been converted into bundles, so the classpath (which is one of the input to bnd) will contain only proper bundles. With the proper classpath, bnd will be able to generate the proper manifest entries (the version and resolution attributes of the Import-Package entries will be correct). And recursively we go all the way up the chain of dependencies until we have converted the entire repository.

Overall this process works quite well but there are several issues that I want to point out:

  1. The result clearly depends on the quality of the original repository in terms of dependencies. If the dependencies are wrong or missing, then the end result will be of lesser quality with the "resolution:=optional" attribute being set which can lead to the dreaded NoClassDefFoundError problem when deploying in the OSGi container. To fix this issue, we need to have a clean repository which, thanks to this process, we can now detect (I was mentioning error reporting added previously).
  2. The only change this process is really doing is adding header manifests to the jar file, the content of the jar file itself is not modified. If the jar file was a signed jar file, then changing the manifest breaks the overall signature even if you do not touch any of the headers containing the signature of individual classes. To fix this issue, in our case it is ok to simply remove the signature entirely.
  3. This process does not fix libraries that are simply not OSGi compatible. For example, OSGi do not support classes in the default package which for example the jdom library exposes, or they have class loading issues (famous Class.forName() OSGi issue). To fix this problem (which from my experience has been very rare), we have been using SpringSource versions.

The last point I wanted to raise is my concern that there isn’t a ‘one-size-fits-all’ repository. Even with the amazing work that SpringSource is doing with the free repository, you get their interpretation of dependencies. For example, the jdom bundle (version 1.0) has the following entry: Import-Package: org.jaxen;version=”[1.1.1, 2.0.0)”;resolution:=”optional”

The above entry basically means that jdom depends optionally on org.jaxen package version 1.1.1 all the way to 2.0.0 not included. This may work for you or not depending on your needs. In our case we like tighter version ranges ("[1.1.1, 1.1.2)"). What if jaxen v.1.2.3 ends up having a show-stopper bug when used in conjunction with jdom but you still need it for other parts of your code and you end up deploying it in the same container ? Stay tuned for a separate post entirely dedicated to version management soon.

Also, check out our series on OSGi at LinkedIn.

engpost

Share: Email | LinkedIn | Digg | Twitter

Implementing Dijkstra’s Algorithm in Ruby

Recently in a LinkedIn LED hack day, I got a chance to play around with data to analyze the social graph. In order to compute some results in real-time, I needed an efficient way to find the shortest path between two nodes in a graph and Dijkstra’s algorithm came to mind.

Dijkstra’s algorithm is a graph search algorithm that solves the single-source shortest path problem for a graph with non negative edge path costs, outputting a shortest path tree. This algorithm is often used in routing. Wikipedia

I searched on the web a bit, but I couldn’t find a Ruby implementation so I decided to write my own. It ended up being pretty easy to implement. I decided to post it here in case someone in the future wants to save 30 minutes or so implementing it. Please note that it makes use of priority queue library which is written by K. Kodama.

require 'pqueue'

class Algorithm
INFINITY = 1 << 32

def self.dijkstra(source, edges, weights, n)
visited = Array.new(n, false)
shortest_distances = Array.new(n, INFINITY)
previous = Array.new(n, nil)
pq = PQueue.new(proc {|x,y| shortest_distances[x] < shortest_distances[y]})

pq.push(source)
visited = true
shortest_distances = 0

while pq.size != 0
v = pq.pop
visited[v] = true
if edges[v]
edges[v].each do |w|
if !visited[w] and shortest_distances[w] > shortest_distances[v] + weights[v][w]
shortest_distances[w] = shortest_distances[v] + weights[v][w]
previous[w] = v
pq.push(w)
end
end
end
end
return [shortest_distances, previous]
end
end

Please let me know if you have any questions.

Share: Email | LinkedIn | Digg | Twitter

JDBC Connection Pooling for Rails on Glassfish

In Light Engineering (LED), we’re known to be multilingual – depending on the project, we’ve been known to speak Perl, Python, Java, C++, Javascript and PHP, to name a few. Our weapon of choice is still Ruby on Rails, the popular MVC framework. Out belief is that Rails makes certain types of tasks easy, and others laughably trivial. That being said, LinkedIn is still primarily a Java shop, and for good reason. Java technologies are mature, proven, and all around solid. For this reason, LED has had a very vested interest in the development work that is going into JRuby.

We started a few months ago around the time JRuby 1.1.2 went live by switching some of our Rails applications to run on Glassfish. Using Warbler, we successfully wrapped our Rails applications into WAR files and deployed on Glassfish (we’ll probably write a more detailed tutorial of this at a future date). A WAR file is completely self contained application that can be deployed simply by copying to an autodeploy directory. No more Apache/Nginx reverse proxy, no more Capistrano, no more installing gems on a production container, no more of any of that madness. This was a huge win, and we broke out the champagne bottles.

But we weren’t done. We weren’t taking advantage of many Java technologies, most notably, we weren’t taking advantage of the JDBC connection pooling capabilities of the Glassfish application server for our MySQL database.

We started by reading this tutorial by Arun Gupta of Sun. The article is fantastic, but the one criticism I have is that it was written from the perspective of a master Java engineer that learned Rails, as opposed to that of a Rails engineer approaching JRuby.

From a high level, here are the steps needed to enable JDBC connection pooling for a Rails application running in a Glassfish container:

  1. Define a JDBC connection pool.
  2. Define a JDBC resource with a JNDI name.
  3. Download and install the MySQL connection adapter.
  4. Update database.yml to use JDBC.
  5. Configure ActiveRecord to disconnect after every query.

Believe it or not, there are only five steps. I have to admit, I was initially intimidated. Java allows so much power and flexibility that, to a novice, seeing a hundred configuration choices in the Glassfish admin web UI can be a deterrent. As it turns out, we only need to touch two parts of that UI. Let’s get started:

1. Define a JDBC connection pool.

Log in to your Glassfish application server. Expand Resources->Connection Pools.

Connection Pools - Ikai.jpg

Click new. You’ll be presented with three fields. The name is arbitrary, but you’ll need to know it later. Select javax.sql.DataSource as the resource type, and MySQL as the vendor.

New JDBC Connection Step 1 - Ikai.jpg

The next screen will have more options. Change the datasource classname to com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource

Under additional properties, there should be fields configuring the database connection. Set these as appropriate.

Edit Connection Pool - Ikai.jpg

2. Define a JDBC resource with a JNDI name.

JNDI stands for Java Naming and Directory Interface, and will allow us to create a standardized name for the JDBC connection pool we just created.

In the navigation pane, click through to Resources -> JDBC -> JDBC Resources. Click new.

New JDBC Resource - Ikai.jpg

In the drop down box, select the JDBC connection pool created in step 1. For the JNDI name, it’s accepted practice to name it jdbc/connection_name.

3. Download and install the MySQL connection adapter.

tar zxvf mysql-connector-java-VERSION.tar.gz
cd mysql-connector-java-VERSION
ant
cp mysql-connector-java-VERSION-bin.jar $GLASSFISH_HOME/lib

You may have to restart Glassfish for the install to work:

asadmin stop-domain DOMAIN
asadmin start-domain DOMAIN

4. Update database.yml to use JDBC.

In your config/database.yml file, we need to tell Rails to use the connection pool rather than directly connecting to the database. Here’s a snippet of our production configuration:

production:
adapter: jdbc
jndi: jdbc/polls
driver: com.mysql.jdbc.Driver

That’s all you will need. Unlike standard configuration files, you do not need to specify things like the username, password or host because these are configured in the Application Server. I like this method because it means the engineer doing the deployment does not need to build the YAML file each time, check it in to SVN, or copy from a database.yml template with the production settings. It’s one less deployment step, and ultimately, one less item on the security checklist.

5. Configure ActiveRecord to disconnect after every query.

ActiveRecord maintains a persistent connection to the database. This is no longer necessary, as there is very little overhead in opening a connection to JDBC, which manages the connection persistence. We’ll need to disable this. I’m using code borrowed from Nick Sieger’s awesome presentation at RailsConf 2008:

# config/initializers/close_connections.rb
if defined?($servlet_context)
require 'action_controller/dispatcher'


ActionController::Dispatcher.after_dispatch do
ActiveRecord::Base.clear_active_connections!
end
end

You’re done! Now all you have to do is build the WAR file and drop it in Glassfish’s autodeploy directory.

Share: Email | LinkedIn | Digg | Twitter

Close
E-mail It
Powered by ShareThis