Tuesday, June 30, 2009

Neo4j a graph database

Neo4j is a graph database. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development.

According Emil Eifrem
the neo4j database outperforms relational backends with >1000x for many increasingly important use cases.

Sunday, June 28, 2009

How to Exploit Multiple Cores

How to Exploit Multiple Cores for Better Performance and Scalability( by Todd Hoff)

InfoQueue has this excellent talk by Brian Goetz on the new features being added to Java SE 7 that will allow programmers to fully exploit our massively multi-processor future. While the talk is about Java it's really more general than that and there's a lot to learn here for everyone.

Brian starts with a short, coherent, and compelling explanation of why programmers can't expect to be saved by ever faster CPUs and why we must learn to exploit the strengths of multiple core computers to make our software go faster.

Some techniques for exploiting multiple cores are given in an equally short, coherent, and compelling explanation of why divide and conquer as the secret to multi-core bliss, fork-join, how the Java approach differs from map-reduce, and lots of other juicy topics.

The multi-core "problem" is only going to get worse. Tilera founder Anant Agarwal estimates by 2017 embedded processors could have 4,096 cores, server CPUs might have 512 cores and desktop chips could use 128 cores. Some disagree saying this is too optimistic, but Agarwal maintains the number of cores will double every 18 months.

An abstract of the talk follows though I would highly recommend watching the whole thing. Brian does a great job.

Why is Parallelism More Important Now?

  •  Coarse grain concurrency was all the rage for Java 5. The hardware reality has changed. The number of cores is increasing so applications must now search for fine grain parallelism (fork-join)
  •  As hardware becomes more parallel, more and more cores, software has to look for techniques to find more and more parallelism to keep the hardware busy.
  •  Clock rates have been increasing exponentially over the last 30 years or so. Allowed programmers to be lazy because a faster processor would be released that saved your butt. There wasn't a need to tune programs.
  •  That wait for faster processor game is up. Around 2003 clock rates stopped increasing. Hit the power wall. Faster processors require more power. Thinner chip conductor lines were required and the thinner lines can't dissipate the increased power without causing overheating which effects the resistance characteristics of the conductors. So you can't keep increasing clock rate.
  •  Fastest Intel CPU 4 or 5 years ago was 3.2 Ghz. Today it's about the same or even slower.
  •  Easier to build 2.6 Ghz or 2.8 Ghz chips. Moore's law wasn't repealed so we can cram more transistors on each wafer. So more processing power could be put on a chip which leads to putting more and more processing cores on a chip. This is multicore.
  •  Multicore systems are the trend. The number of cores will grow at exponential rate for the next 10 years. 4 cores at the low end. The high end 256 (Sun) and 800 (Azul) core systems.
  •  More cores per chip instead of faster chips. Moore's law has been redirected to multicore.
  • The problem is it's harder to make a program go faster on a multicore system. A faster chip will run your program faster. If you have a 100 cores you program won't go faster unless you explicitly design it to take advantage of those chips.
  •  No free lunch anymore. Must now be able to partition your program so it can run faster by running on multiple cores. And you must be able keep doing that as the number of cores keeps improving.
  •  We need a way to specify programs so they can be made parallel as topologies change by adding more cores.
  • As hardware evolves platforms must evolve to take advantage of the new hardware. Started off with course grain tasks which was sufficient given the number of cores. This approach won't work as the number cores increase.
  • Must find finer-grained parallelism. Example sorting and searching data. Opportunities around data. The data can for sorting can be chunked and sorted and the brought together with a merge sort. Searching can be done in parallel by searching subregions of the data and merging the results.
  • Parallel solutions use more CPU in aggregate because of the coordination needed and that data needs to be handled more than once (merge). But the result is faster because it's done in parallel. This adds business value. Faster is better for humans.

    What has Java 7 Added to Support Parallelism?

  •  Example problem is to find the max number from a list.
  •  The course grained threading approach is to use a thread pool, divide up the numbers, and let the task pool compute the sub problems. A shared task pool is slow as the number increases which forces the work to be more course grained. No way to load balance. Code is ugly. Doesn't match the problem well. The runtime is dominated by how long it takes the longest subtask to run. Had to decide up front how many pieces to divide the problem into.
  • Solution using divide and conquer. Divide set into pieces recursively until the problem is so small the sequential solution is more efficient. Sort the pieces. Merge the results. 0(n log n), but problem is parallelizable. Scales well and can keep many CPUs busy.
  • Divide and conquer uses fork-join to fork off subtasks and wait for them to complete and then join the results. A typical thread pool solution is not efficient. Creates too many threads and creating threads are expensive and use a lot of memory.
  • This approach portable because it's abstract. It doesn't know how many processors are available It's independent of the topology.
  • The fork-join pool is optimized for fine grained operations whereas the thread pool is optimized for course grained operations. Best used for problems without IO. Just computations using CPU that tend to fork off sub problems. Allows data to be shared read-only and used across different computations without copying.
  • This approach scales nearly linearly with the number of hardware threads.
  • The goal for fork-join: Avoid context switches; Have as many threads as hardware threads and keep them all busy; Minimize queue lock contention for data structures. Avoid common task queue.
  •  Implementation uses Work-Stealing. Each thread has a work queue that is a double ended queue. Each thread pulls work from the head of queue and processes it. When there's nothing do it steals work from the tail of another queue. No contention for the head because only one thread access it. Rare contention on tail because stealing is infrequent as the stolen work is large which takes them time to process. Process starts with one task. It breaks up the work. Other tasks steal work and start the same process. Load balances without central coordination, few context switches, little coordination.
  • The same approach also works for graph traversal, matrix operations, linear algebra, modeling, generate moves and evaluate the result. Latent parallelism can be found in a lot of places once you start looking.
  •  Support higher level operations like ParallelArray. Can specify filtering, transformation, and aggregation options. Not a generalized in-memory database, but has a very transparent cost model. It's clear how many parallel operations are happening. Can look at the code and quickly know what's a parallel operation so you will know the cost.
  • Looks like map reduce except this is scaling across a multicore system, one single JVM, whereas map reduce is across a cluster. The strategy is the same: divide and conquer.
  • Idea is to make specifying parallel operations so easy you wouldn't even think of the serial approach.
  • Tuesday, June 16, 2009

    How to control services in linux with chkconfig

    Linux / Unix Command: chkconfig

    chkconfig - updates and queries runlevel information for system services:

    chkconfig --list [name]
    chkconfig --add name
    chkconfig --del name
    chkconfig [--level levels] name <on|off|reset>
    chkconfig [--level levels] name

    DESCRIPTION

    chkconfig provides a simple command-line tool for maintaining the /etc/rc[0-6].d directory hierarchy by relieving system administrators of the task of directly manipulating the numerous symbolic links in those directories.

    This implementation of chkconfig was inspired by the chkconfig command present in the IRIX operating system. Rather than maintaining configuration information outside of the /etc/rc[0-6].d hierarchy, however, this version directly manages the symlinks in /etc/rc[0-6].d. This leaves all of the configuration information regarding what services init
    starts in a single location. chkconfig has five distinct functions: adding new services for
    management, removing services from management, listing the current startup information for services, changing the startup information for services, and checking the startup state of a particular service. When chkconfig is run without any options, it displays usage
    information. If only a service name is given, it checks to see if the service is configured to be started in the current runlevel. If it is, chkconfig returns true; otherwise it returns false. The
    --level option may be used to have chkconfig query an alternative runlevel rather than the current one. If one of on, off, or reset is specified after the service name, chkconfig changes the startup information for the specified service. The on and off flags cause the service
    to be started or stopped, respectively, in the runlevels being changed. The reset flag resets the startup information for the service to whatever is specified in the init script in question.

    By default, the on and off options affect only runlevels 2, 3, 4, and 5, while reset affects all of the runlevels. The --level option may be used to specify which runlevels are affected.

    Note that for every service, each runlevel has either a start script or a stop script. When switching runlevels, init will not re-start an already-started service, and will not re-stop a service that is not running.

    OPTIONS

    --level levels

    Specifies the run levels an operation should pertain to. It is given as a string of numbers from 0 to 7. For example, --level 35 specifies runlevels 3 and 5.
    --add name

    This option adds a new service for management by chkconfig.When a new service is added, chkconfig ensures that the servicehas either a start or a kill entry in every runlevel. If any runlevel is missing such an entry, chkconfig creates the appropriate entryas specified by the default values in the init script. Note thatdefault entries in LSB-delimited 'INIT INFO' sections take precedence over the default runlevels in the initscript.

    --del name

    The service is removed from chkconfig management, and any symboliclinks in /etc/rc[0-6].d which pertain to it are removed.

    --list name

    This option lists all of the services which chkconfig knows about, and whether they are stopped or started in each runlevel. If name is specified, information in only display about service name.

    RUNLEVEL FILES

    Each service which should be manageable by chkconfig needs two or more commented lines added to its init.d script. The first linetells chkconfig what runlevels the service should be started in by default, as well as the start and stop priority levels. If the serviceshould not, by default, be started in any runlevels, a - should be used in place of the runlevels list. The second line contains a description for the service, and may be extended acrossmultiple lines with backslash continuation.

    For example, random.init has these three lines:

    # chkconfig: 2345 20 80
    # description: Saves and restores system entropy pool for \
    # higher quality random number generation.


    This says that the random script should be started in levels 2, 3, 4, and 5, that its start priority should be 20, and that its stop priority should be 80. You should be able to figure out what the description says; the \ causes the line to be continued. The
    extra space in front of the line is ignored.

    For instance take a look on service configuration to run tomcat:

    #startup script for Jakarta Tomcat
    #
    # chkconfig: 345 84 16
    # description: Jakarta Tomcat Java Servlet/JSP Container

    TOMCAT_HOME=/usr/local/bin/apache-tomcat-6.0.16
    TOMCAT_START=/usr/local/bin/apache-tomcat-6.0.16/bin/startup.sh
    TOMCAT_STOP=/usr/local/bin/apache-tomcat-6.0.16/bin/shutdown.sh
    TOMCAT_RUN=/usr/local/bin/apache-tomcat-6.0.16/bin/catalina.sh
    #Necessary environment variables
    export CATALINA_HOME=/usr/local/bin/apache-tomcat-6.0.16

    # Source function library.
    . /etc/rc.d/init.d/functions

    # Source networking configuration.
    . /etc/sysconfig/network

    # Check that networking is up.
    [ ${NETWORKING} = "no" ] && exit 0

    #Check for tomcat script
    if [ ! -f $TOMCAT_HOME/bin/catalina.sh ]
    then
    echo "Tomcat not available..."
    exit
    fi

    start() {
    echo -n "Starting Tomcat: "
    su - root -c $TOMCAT_START
    echo
    touch /var/lock/subsys/tomcatd
    # We may need to sleep here so it will be up for apache
    # sleep 5
    #Instead should check to see if apache is up by looking for http.pid
    }
    run() {
    echo -n "Starting Tomcat: "
    su - root -c $TOMCAT_START
    echo
    touch /var/lock/subsys/tomcatd
    # We may need to sleep here so it will be up for apache
    # sleep 5
    #Instead should check to see if apache is up by looking for http.pid
    }

    stop() {
    echo -n $"Shutting down Tomcat: "
    su - root -c $TOMCAT_STOP
    rm -f /var/lock/subsys/tomcatd.pid
    echo
    }

    status() {
    ps ax --width=1000 | grep "[o]rg.apache.catalina.startup.Bootstrap
    start" | awk '{printf $1 " "}' | wc | awk '{print $2}' > /tmp/tomcat_process_count.txt
    read line < /tmp/tomcat_process_count.txt if [ $line -gt 0 ]; then echo -n "tomcatd ( pid " ps ax --width=1000 | grep "[o]rg.apache.catalina.startup.Bootstrap start" | awk '{printf $1 " "}' echo -n ") is running..." else echo -n "Tomcat is stopped" fi } case "$1" in start) start ;; run) run ;; stop) stop ;; restart) stop sleep 3 start ;; status) status ;; *) echo "Usage: tomcatd {start|stop|restart|status}" exit 1 esac

    Sunday, June 7, 2009

    Java service wrapper fro Linux

    http://wrapper.tanukisoftware.org/doc/english/introduction.html

    How to send e-mail using GMail on CentOS from shell script

    sSMTP is a very simple and straightforward alternative to big MTAs like sendmail or Exim. Unfortunately CentOS repositories don’t come with it, so you have to fetch it from Fedora’s EPEL repo.

    rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-3.noarch.rpm
    yum install ssmtp


    Configuration is rather simple. Just change the following values in your /etc/ssmtp/ssmtp.conf:


    root=noreply@youdomain.com
    AuthUser=noreply@yourdomain.com
    AuthPass=password
    FromLineOverride=YES
    mailhub=smtp.gmail.com:587
    UseSTARTTLS=YES

    To test just run a following command("-v" should be used for debugging):

    echo "test" | ssmtp -v -s "Test" genadyg@exelate.com


    To use sSmtp in unix script:

    ssmtp myemailaddress@gmail.com < msg.txt

    msg.txt is a simple text using the proper formatting for sSMTP:

    To: toaddress@gmail.com
    From: fromaddress@gmail.com
    Subject: alert
    Message text

    Wednesday, June 3, 2009

    Google wave: implementation notes

    It's an HTML 5 app, built on Google Web Toolkit. It includes a rich text editor and other functions like desktop drag-and-drop (which, for example, lets you drag a set of photos right into a wave).

    Went Walkabout. Brought back Google Wave

    Google Wave Drips With Ambition. A New Communication Platform For A New Web


    Google wave is here !

    Google Wave is a new communication and collaboration platform based on hosted XML documents (called waves) supporting concurrent modifications and low-latency updates. This platform enables people to communicate and work together in new, convenient and effective ways. We will offer these benefits to users of Google Wave and we also want to share them with everyone else by making waves an open platform that everybody can share. We welcome others to run wave servers and become wave providers, for themselves or as services for their users, and to "federate" waves, that is, to share waves with each other and with Google Wave. In this way users from different wave providers can communicate and collaborate using shared waves. We are introducing the Google Wave Federation Protocol for federating waves between wave providers on the Internet.

    Here are the initial white papers that are available to complement the Google Wave Federation Protocol:

    • Google Wave Federation Architecture
    • Google Wave Data Model and Client-Server Protocol
    • Google Wave Operational Transform
    • General Verifiable Federation

    The Google Wave APIs are documented here.

    Tuesday, June 2, 2009

    Amazon s3 java client

    SmartGWT

    SmartGWT is a GWT based framework that allows you to not only utilize
    its comprehensive widget library for your application UI, but also tie
    these widgets in with your server-side for data management. SmartGWT is
    based on the powerful and mature SmartClient library.
    http://code.google.com/p/smartgwt/

    Common REST Mistakes

    Common REST Mistakes( from Paul Prescod)

    When designing your first REST system there are a variety of mistakes people often make. I want to summarize them so that you can avoid them. If any are unclear, ask for more information on rest-discuss.
    1. Using HTTP is not enough. You can use HTTP in a Web service without SOAP or XML-RPC and still do the logical equivalent of SOAP or XML-RPC. If you're going to use HTTP wrong you would actually be better off doing it in a standard way! Most of these other points describe ways in which people abuse HTTP.

    2. Do not overuse POST. POST is in some senses the "most flexible" of HTTP's methods. It has a slightly looser definition than the other methods and it supports sending information in and getting information out at the same time. Therefore there is a tendency to want to use POST for everything. In your first REST Web Service, I would say that you should only use POST when you are creating a new URI. Pretend POST means "create new URI as child of the current URI." As you get more sophisticated, you may decide to use POST for other kinds of mutations on a resource. One rule of thumb is to ask yourself whether you are using POST to do something that is really a GET, DELETE or PUT, or could be decomposed into a combination of
      methods.

    3. Do not depend on URI's internal structure. Some people think about REST design in terms of setting up a bunch of URIs. "I'll put purchase orders in /purchases and I'll give them all numbers like /purchases/12132 and customer records will be in
      /customers..." That can be a helpful way to think while you are whiteboarding
      and chatting, but should not be your final public interface to the service. According to Web Architectural principles, most URIs are opaque to client software most of the time. In other words, your public API should not depend on the structure of your URIs. Instead there would typically be a single XML file that points to the components of your service. Those components would have hyperlinks that point to other components and so forth. Then you can introduce people to your service with a single URI and you can distribute the actual components across computers and domains however you want. My rule of thumb is that clients only construct URIs when they are building queries (and thus using query strings). Those queries return references to objects with opaque URIs.

    4. Do not put actions in URIs.
      This follows naturally from the previous point. But a particularly pernicious abuse of URIs is to have query strings like "someuri?action=delete". First, you are using GET to do something unsafe. Second, there is no formal relationship between this "action URI" and the "object" URI. After all your "action=" convention is something specific to your application. REST is about driving as many "application conventions" out of the protocol as possible.
    5. Services are seldom resources. In a REST design, a "stock quote service" is not very
      interesting. In a REST design you would instead have a "stock" resources and a
      service would just be an index of stock resources.

    6. Sessions are irrelevant.
      There should be no need for a client to "login" or "start a connection." HTTP authentication is done automatically on every message. Client applications are consumers of resources, not services. Therefore there is nothing to log in to! Let's say that you are booking a flight on a REST web service. You don't create a new "session" connection to the service. Rather you ask the "itinerary creator object" to create you a new itinerary. You can start filling in the blanks but then get some totally different component elsewhere on the web to fill in some other blanks. There is no session so there is no problem of migrating session state between clients. There is also no issue of "session affinity" in the server (though there are still load balancing issues to continue).

    7. Do not invent proprietary object identifiers. Use URIs. URIs are important because you can always associate information with them in two ways. The simplest way is to put data on a web server so that the URI can be dereferenced in order to get the data. Note that this technique only works with URIs that can be dereferenced so these URIs (http URIs!) are strongly preferred to URN or UUID-based URIs. Another way is to use RDF and other techniques that allow you to project metadata onto a URI
      that may not be under your control. If you use URI syntax with UUIDs or something like that then you get half of the benefit of URIs. You get a standardized syntax but have no standardized dereferencing capability. If you use an HTTP URI then you get the other half of the benefit because you then also have a standardized
      derferencing mechanism.

    8. Do not worry about protocol independence. There exists only one protocol whichsupports the proper resource manipulation semantics. If another one arises in the future, it will be easy to keep your same design and merely support the alternate
      protocol's interface. On the other hand, what people usually mean by "protocol
      independence" is to abandon resource modelling and therefore abandon both REST
      and the Web.
    Overall, the thing to keep in mind is that REST is about exposing resources through URIs, not services through messaging interfaces.

    Monday, June 1, 2009

    PHP/Java Rest frameworks finalists

    A RESTful Web App Development PHP Library

    http://tonic.sourceforge.net/docs.html

    Lightweight REST framework

    The mission of this open source project is to bring the simplicity and
    efficiency of the REST architectural style to Java developers.
    Concretely, it is composed of two parts:

    1) Restlet API
    1.     Supports all REST concepts (resource, representation, data, connector, components, etc.)
    2.     Suitable for both client and server REST applications
    3.     Maplets support the concept of URIs as UI with advanced pattern matching features
    4.     Chainlets filter calls to implement features like logging, authentication or compression
    5.     Complete alternative to Servlet API with no external dependency (JAR < 50kb)
    6.     Supports blocking and non-blocking NIO modes
    2) Noelios Restlet Engine (NRE)
    1.   Reference implementation of the Restlet API provided by Noelios Consulting (core JAR < 60kb)
    2.   Server connector provided: HTTP (via Jetty connectors)
    3.   Client connectors provided: HTTP, JDBC, SMTP (via JavaMail)
    4.   Support for logging (LoggerChainlet) and cool URIs rewriting (RedirectRestlet)
    5.   Static files serving (DirectoryRestlet) with metadata association based on file extensions
    6.    FreeMarker template representations as an alternative to JSP pages
    7.    Automatic server-side content negotiation based on media type and language

    Also, an introduction paper as well as a detailled tutorial are available.

    http://www.restlet.org/