Sunday, May 31, 2009

Fractal Imaging


The other day, a friend of mine made me very happy. He returned my only copy of my favorite book, Fractal Imaging, by Ning Lu. This book has long been my favorite technical book, bar none, but I am beginning to think it might just be my favorite book of any kind.

"Virtue, like a river, flows soundlessly in the deepest place." You don't expect to encounter this sort of statement in an übergeek tome of this magnitude, and yet Lu scatters such proverbs (as well as quotations from Nietzsche, Joseph Conrad, Ansel Adams, Shakespeare, Jim Morrison, and others) throughout the text. This alone, of course, makes it quite an unusual computer-science book.

But the best part may be Lu's ability to blend the oft-times-sophisticated concepts of measure theory (and the math behind iterated function systems) with beautiful graphs, line drawings, transformed images (some in color; many downright spectacular), and the occasional page or two of C code. The overall effect is mesmerizing. Potentially intimidating math is made tractable (fortunately) through Lu's clear, often inspiring elucidations of difficult concepts. Ultimately, the patient reader is rewarded with numerous "Aha!" moments, culminating, at last, in an understanding for (and appreciation of) the surpassing beauty of fractal image transformation theory.

What is fractal imaging? Well, it's more than just the algorithmic generation of ferns (like the generated image above) from non-linear equation systems. It's a way of looking at ordinary (bitmap) images of all kinds. The hypothesis is that any given image (of any kind) is the end-result of iterating on some particular (unknown) system of non-linear equations, and that if one only knew what those equations are, one could regenerate the image algorithmically (from a set of equations) on demand. The implications are far-reaching. This means:

1. Instead of storing a bitmap of the image, you can just store the equations from which it can be generated. (This is often a 100-to-1 storage reduction.)
2. The image is now scale-free. That is, you can generate it at any scale -- enlarge it as much as you wish -- without losing fidelity. (Imagine being able to blow up an image onscreen without it becoming all blocky and pixelated.)

Georgia Tech professor Michael Barnsley originated the theory of fractal decomposition of images in the 1980s. He eventually formed a company, Iterated Systems (of which Ning Lu was Principal Scientist), to monetize the technology, and for a while it looked very much as if the small Georgia company would become the quadruple-platinum tech startup of the Eighties. Despite much excitement over the technology, however, it failed to draw much commercial interest -- in part because of its computationally intensive nature. Decompression was fast, but compressing an image was extremely slow (especially with computers of the era), for reasons that are quite apparent when you read Ning Lu's book.

Iterated Systems eventually became a company called MediaBin, Inc., which was ultimately acquired by Interwoven (which, in turn, was recently acquired by Autonomy). The fractal imaging technology is still used in Autonomy's MediaBin Digital Asset Management product, as (for example) part of the "similarity search" feature, where you specify (through the DAM client's GUI) a source image and tell the system to find all assets in the system that look like the source image. Images in the system have already been decomposed into their fractal primitives. When the source image's primitives (which, remember, are scale-free) are known, they can be compared against sets of fractal shards in the system. When the similarity between shard-sets is good enough, an "alike" image is presumed to have been found.

It's a fascinating technology and one of the great computer-imaging developments of the 20th century, IMHO. If you're into digital imaging you might want to track down Ning Lu's book. A 40-page sample of it is online here. I say drop what you're doing and check it out.

Thursday, May 21, 2009

U.S. Patent Office web site runs on Netscape iPlanet?

I suppose I shouldn't be so surprised at this, but I never seriously thought I would encounter the name "iPlanet(TM) Web Server Enterprise Edition" on an in-production web site again any time soon. And yet it turns out, the U.S. Patent and Trademark Office's publicly searchable database of patent applications is powered by a 2001 version of Netscape's app server.

The iPlanet "welcome page" can be found at http://appft1.uspto.gov/, which is one of the hosts for the Patent Application search site.

Brings back memories, doesn't it?

Sunday, May 17, 2009

WolframAlpha fails to impress

I been fooling around with WolframAlpha (the much-ballyhooed intelligent search engine) yesterday and today, and all I can say is, it feels very alpha.

I tried every query I could think of to get the annual consumption of electricity in the U.S., and got nowhere. On the other hand, if you just enter "electricity," your first hit is Coulomb's law 2.0mC, 5.0mC, 250cm, your second hit is 12A, 110V, and your third hit is diode 0.6 V. Which seems (how shall I say?) pretty useless.

As it turns out, WolframAlpha is also extraordinarily slow most of the time and hasn't performed well in load tests. Apparently Wolfram's AI doesn't extend to figuring out how to make something scale.

I'll suspend judgment on WA for a while longer, pending further testing. But right now it looks and smells like the answer to a question nobody asked.

Saturday, May 16, 2009

Google patent on floating data centers

This may be old news to others, but I only learned about it just now, and I have to assume there are still people who haven't heard it yet, so:

It seems the U.S. Patent and Trademark Office has awarded Google a patent for for a floating data center that uses the ocean to provide both power and cooling. The patent, granted on 28 April 2009, tells how floating data centers would be located 3 to 7 miles from shore, in 50 to 70 meters of water. The technique would obviously use ocean water for cooling, but according to the patent, Google also intends to use the motion of ocean surface waves to create electricity, via “wave farms” that produce up to 40 megawatts of electrical power. As a side-benefit, floating data centers (if located far enough out to sea) are not subject to real estate or property taxes.

More details of how the wave energy machines would work can be found here.

The interesting thing will be to see where, in the world, Google would put its first offshore data center. Any guesses?

Thursday, May 14, 2009

How to get 62099 unique visitors to your blog in one day

I spent the last two days in New York City attending a conference (the Enterprise Search Summit) and didn't have time to check Google Analytics (for my blog traffic) until just now. Imagine my shock to discover that my May 12 post, "One of the toughest job-interview questions ever," drew 62099 unique visitors from 143 countries. Plus over 125 comments.

So I guess if you want to draw traffic to your blog, the formula is very simple:

1. Get a job interview with a large search company.
2. Talk about one of the interview questions in excruciating depth.
3. (Optionally) Spend an inordinate amount of time discussing algorithms and such.

Conversely, if you want to completely kill your blog's traffic numbers, the formula for driving people away seems to be:

1. Discuss CMIS (the new OASIS content-management interop standard).
2. Fail to mention some kind of programming topic.
3. Avoid controversy and don't piss anyone off.

Hmm, I wonder which recipe I should gravitate toward over the next few weeks?

Aw heck, screw recipes. This is not a cooking show.

Tuesday, May 12, 2009

One of the toughest job-interview questions ever

I mentioned in a previous post that I once interviewed for a job at a well-known search company. One of the five people who interviewed me asked a question that resulted in an hour-long discussion: "Explain how you would develop a frequency-sorted list of the ten thousand most-used words in the English language."

I'm not sure why anyone would ask that kind of question in the course of an interview for a technical writing job (it's more of a software-design kind of question), but it led to a lively discussion, and I still think it's one of the best technical-interview questions I've ever heard. Ask yourself: How would you answer that question?

My initial response was to assail the assumptions underlying the problem. Language is a fluid thing, I argued. It changes in real time. Vocabulary and usage patterns shift day-to-day. To develop a list of words and their frequencies means taking a snapshot of a moving target. Whatever snapshot you take today isn't going to look like the snapshot you take tomorrow -- or even five minutes from now.

So the first question is: Where do we get our sample of words from? Is this about spoken English, or written English? Two different vocabularies with two different frequency patterns. But again, each is mutable, dynamic, fluid, protean, changing minute by minute, day by day.

Suppose we limit the problem to written English. How will we obtain a "representative sampling" of English prose? It should be obvious that there is no such thing. There is no "average corpus." Think about it.

My interviewer wanted to cut the debate short and move on to algorithms and program design, but I resisted, pointing out that problem definition is extremely important; you can't rush into solving a problem before you understand how to pose it.

"Let's assume," my inquisitor said, "that the Web is a good starting place: English web-pages." I tormented my tormentor some more, pointing out that it's dangerous to assume spiders will crawl pages in any desirable (e.g., random) fashion, and anyway, some experts believe "deep Web content" (content that's either uncrawlable or has never been crawled before) constitutes the majority of online content -- so again, we're not likely to obtain any kind of "representative" sample of English words, if there even is such a thing as a representative sample of the English language (which I firmly maintain there is not).

By now, my interviewer was clearly growing impatient with my petulence, so he asked me to talk about designing a program that would obtain a sorted list of 10,000 most-used words. I dutifully regurgitated the standard crawl/canonicalize/parse/tally sorts of things that you'd typically do in such a program.

"How would you organize the words in memory?" my tormentor demanded to know.

"A big hash table," I said. "Just hash them right into the table and bump a counter at each spot."

"How much memory will you need?"

"What've you got?" I smiled.

"No, seriously, how much?" he said.

I said assuming 64-bit hardware and software, maybe something like 64 gigs: enough memory for a 4-billion-slot array of 16 bytes of data per slot. Most words will fit in that space, and a short int will suffice for a counter in each slot. (Longer words can be hashed into a separate smaller array.) Meanwhile you're using 32 bits (64 available; but you're only using 32) of address space, which is enough to hash words of length 7 or less with no collisions at all. (The typical English word has entropy of about 4.5 bits per character.) Longer words entail some risk of hash collision, but with a good hash function that shouldn't be much of a problem.

"What kind of hash function would you use?" the interviewer asked.

"I'd try a very simple linear congruential generator, for speed," I said, "and see how it performs in terms of collisions."

He asked me to draw the hash function on the whiteboard. I scribbled some pseudocode that looked something like:

HASH = INITIAL_VALUE;
FOR EACH ( CHAR IN WORD ) {
HASH *= MAGIC_NUMBER
HASH ^= CHAR
HASH %= BOUNDS
}
RETURN HASH

I explained that the hash table array length should be prime, and the BOUNDS number is less than the table length, but coprime to the table length. Good possible values for the MAGIC_NUMBER might be 7, 13, or 31 (or other small primes). You can test various values until you find one that works well.

"What will you do in the event of hash collisions?" the professor asked.

"How do you know there will be any?" I said. "Look, the English language only has a million words. We're hashing a million words into a table that can hold four billion. The load factor on the table is negligible. If we're getting collisions it means we need a better hash algorithm. There are plenty to choose from. What we ought to do is just run the experiment and see if we even get any hash collisions. "

"Assume we do get some. How will you handle them?"

"Well," I said, "you can handle collisions via linked lists, or resize and rehash the table -- or just use a cuckoo-hash algorithm and be done with it."

This led to a whole discussion of the cuckoo hashing algorithm (which, amazingly, my inquisitor -- supposedly skilled in the art -- had never heard of).

This went on and on for quite a while. We eventually discussed how to harvest the frequencies and create the desired sorted list. But in the end, I returned to my main point, which was that sample noise and sample error are inevitably going to moot the results. Each time you run the program you're going to get a different result (if you do a fresh Web crawl each time). Word frequencies are imprecise; the lower the frequency, the more "noise." Run the program on September 10, and you might find that the word "terrorist" ranks No. 1000 in frequency on the Web. Run it again on September 11, and you might find it ranks No. 100. That's an extreme example. Vocabulary noise is pervasive, though, and at the level of words that rank No. 5000+ (say) on the frequency list, the day-to-day variance in word rank for any given word is going to be substantial. It's not even meaningful to talk about precision in the face of that much noise.

Anyway, whether you agree with my analysis or not, you can see that a question like this can lead to a great deal of discussion in the course of a job interview, cutting across a potentially large number of subject domains. It's a question that leads naturally to more questions. And that's the best kind of question to ask in an interview.

Sunday, May 10, 2009

Can CMIS handle browser CRUD?

I've mentioned before the need for concrete user narratives (user stories) describing the intended usages of CMIS (Content Management Interoperability Services, soon to be an OASIS-blessed standard API for content management system interoperability). When you don't have user stories to tie to your requirements, you tend to find out things later on that you wished you'd found out earlier. That seems to be the case now with browser-based CRUD operations in CMIS.

I don't claim to be an expert on CMIS (what I know about CMIS would fill a very small volume, at this point), but in reading recent discussions on org.oasis-open.lists.cmis, I've come across a very interesting issue, which is that (apparently) it's not at all easy to upload a file, or fetch a file and its dependent files (such as an HTML page with its dependent CSS files), from a CMIS repository using the standard Atom bindings.

The situation is described (in a discussion list thread) by David Nuescheler this way: "
The Atom bindings do not lend themselves to be consumed by a lightweight browser client and for example cannot even satisfy the very simple use-case of uploading a file from the browser into a CMIS repository. Even simple read operations require hundreds of lines of JavaScript code."

Part of the problem is that files in the repository aren't natively exposed via a path, so you can't get to a file using an IRI with a normal file and path name like "./main.css" or "./a/b/index.html" or whatever. Instead, files have an ID in the repository (e.g., /12257894234222223) which is assigned
by the repository when you create the file. That wouldn't be so bad, except that there doesn't appear to be an easy way (or any way) to look up a URL using an ID (see bug CMIS-169).

Based on the difficulty encountered in doing browser CRUD during the recent CMIS Plugfest, David Nuescheler has proposed looking into adding an additional binding based on JSON GETs for reading and multi-part POSTs for writing -- which would make it possible to do at least some CMIS operations via AJAX. The new binding would probably be called something like the web-, browser-, or mashup-binding. (Notice how the name "REST" is nowhere in sight -- for good reason. CMIS as currently implemented is not how REST is supposed to work.)

Granted, CMIS was not originally designed with browser mashups in mind, but the fact is, that's where most of the traction is going to come from if the larger ecosystem of web developers decides to latch onto CMIS in a big way. SOAP has a dead-horse stench about it; no one I know likes to deal with it; but an Atom binding isn't a very useful alternative if the content you need can't be addressed or can't easily be manipulated in the browser using standard AJAX calls.

So let's hope the CMIS technical committee doesn't overlook the most important use-case of all: CMIS inside the browser. Java and .NET SOAP mashups are important, but let's face it, from this point forward all the really important stuff needs to happen in the browser. If you can't do browser CRUD with a new content-interoperability standard, you're starting life with the umbilical cord wrapped around your neck.

Thursday, May 07, 2009

DOM Storage: a Cure for the Common Cookie

One of the things that's always annoyed me about web app development is how klutzy it is to try to persist data locally (offline, say) from a script running in the browser. The cookie mechanism is just so, so . . . annoying.

But it turns out, help is on the way. Actually, Firefox has had a useful persistence mechanism (something more useful than cookies, at least) since Firefox 2, in the so-called DOM Storage API. Internet Explorer prior to version 8 also had a a similar feature called "userData behavior" that allows you to persist data across multiple browser sessions. But it looks like the whole browser world (even Google Chrome) will eventually move to the DOM Storage API, if for no other reason than it is getting official W3C blessing.

The spec in question is definitely a work-in-progress, although the key/value-pair implementation has (as I say) been a part of Firefox and Spidermonkey for quite some time. The part that's still being worked out is the structured-storage part -- the part where you can use some flavor of SQL to solve your CRUD-and-query needs in a businesslike manner.

Why do we need a W3C structured-storage DOM API when there are already such things as Gears LocalServer, Dojo OfflineRest, etc. (not to mention the HTML 5 ApplicationCache mechanism)? A good answer is given here by Nikunj R. Mehta, who heads up Oracle's related BITSY project.

Most of the current debate around how structured storage should be handled in the DOM Storage API revolves around whether or not to use SQLite (or SQL at all). The effort is presently heading toward SQLite, but Vladimir Vukićević has done an eloquent job of presenting the downside(s) to that approach. The argument against SQL is that AJAX programmers shouldn't have to be bothered to learn and use something as heavy and obtuse as SQL just to do persistence, when there are more script-friendly ways to do things (using an approach something like that of, say, CouchDB). As much as I tend to sympathize with that argument, I think the right thing to do is stay with industry standards around structured data storage, and SQL is a pretty universal standard. Of course, then people will argue that "SQL" isn't just one thing, there are many dialects, etc. Which is true, and a good example is SQLite, which the current DOM Storage effort is centering on. I happen to agree with Vladimir Vukićević (and others) who say that SQLite is just too limiting and quirky a dialect to be using in this situation.

I bring all this up not to argue for a particular solution, however, so much as to just make you aware of some of what's going on, and the current status of things. Explore the foregoing links if you want to get involved. If having a common data-persistence and offline/online HTTP-cache API that will work uniformly across browsers means anything to you, maybe you should join the discussion (or at least monitor it).

Getting beyond cookies is essential at this point; the future of web-app development is at stake and we can't wait any longer to nail this one. (We also can't do goofy things like rely on hidden Flash applets to do our heavy lifting for us.) It's time to have a real, standardized, browser-native set of persistence and caching APIs. There's not a moment to lose.

Wednesday, May 06, 2009

How to know what Oracle will do with Java

Oracle Corporation, capitalism's penultimate poster child, has done something strange. It has bought into a money-losing proposition.

The paradox is mind-bending. Oracle is one of the most meticulously well-oiled money-minting machines in the history of computing, yet it finds itself spending $7.4 billion to acquire a flabby, financially inept technocripple (Sun Microsystems) that lost $200 million in its most recent quarter. One can't help but wonder: What is Oracle doing?

What they're doing, of course, is purchasing Java (plus Solaris and a server business; plus a few trinkets and totems). The question is why. And what will they do with it now that they have their arms around it?

The "why" part is easy: Oracle's product line is tightly wed to Java, and the future of Java must be kept out of the hands of the competition (IBM in this case; and perhaps also the community). What Oracle gains is not technology, but control over the future, which is worth vastly more.

What's in store for Java, then? Short answer: Whatever Larry Ellison wants. And Larry wants a lot of things, mostly involving increased revenue for Oracle.

There's a huge paradox here, though, for Ellison and Oracle. On the one hand, Java represents a massive amount of code to develop and maintain; it's a huge cash sink. Sun tried to monetize Java over the years through various licensing schemes, but in the end it was (and is) still a money pit. On the other hand, you don't maintain control over something by letting the community have it. Is there a happy medium?

Because of the huge costs involved in maintaining Java, it makes sense to let the community bear as much of the development and maintenance burden as possible. After all, does Oracle really want to be in the business of maintaining and testing things like Java2D? Probably not. That kind of thing will get thrown over the castle wall.

But chances are, JDBC won't be.

The core pieces of Java that are of most interest to Oracle will be clung to tightly by Oracle. That includes its future direction.

Therefore the most telling indicators of what Oracle intends to do with Java in terms of allowing the community to have a say, versus keeping control of the platform under lock and key, will be how Oracle resolves the Apache/TCK crisis. The issue here (in case you haven't been following it) is that Sun has created additional functionality for Java 7 that goes beyond the community-ready, Apache-blessed, Sun-blessed open source version of Java 6. But Sun is treating its Java 7 deltas as private intellectual property, as evidenced by Sun's steadfast refusal to provide, first, a spec for Java 7, and second, a test kit (TCK) compatible with Apache license requirements. Until this dispute is resolved, there will be no open-source Java 7 and community (Apache) involvement with future versions of Java will ultimately end.

There are some who believe that if Oracle continues Sun's policy of refusing to support open-source Java beyond Java 6, the community will simply fork Java 6 and move forward. This is probably wishful thinking. The forked Java would quickly become a play-toy "hobby" version of Java, missing important pieces of functionality found in the "real" (Oracle) Java, and ultimately lacking acceptance by enterprise. It's tempting to think that with sufficient help from, say, IBM, the community version of Java might eventually become a kind of analog to Linux (the way Linux gathered steam independently of UNIX-proper). But that has to be considered a long shot. The mother ship is still controlled by Oracle. (Even if there's a mutiny, the boat's rudder is owned, controlled, and operated by Larry Ellison and Company.)

It's in Oracle's interest to control the Enterpriseness of Java EE with an iron grip. The non-enterprise stuff means nothing. Oracle will try to find a way to hive off the non-EE cruft and throw it into the moat, where the community can grovel over it if they so wish.

And a good early indication of that will be if Oracle back-burners JavaFX and throws it over the wall, into the moat (which I think it will).

My advice? Step away from the moat . . . unless you want to get muddy.

Tuesday, May 05, 2009

Adobe's Linux Problem

Adobe Systems is at a critical turning point in its long, slow march in the direction of RIA platform domination (which, should Adobe emerge the winner in that sphere, could have profound implications for all of us, as I've blogged about earlier). It is time for the company to decide whether it wants to embrace Linux "with both arms," so to speak. It's put-up-or-shut-up time. Either Linux is of strategic importance to the Adobe agenda, or it is not. Which is it?

"But," you might be saying, "Adobe has made it clear that it is committed to supporting Linux. Look at the recently much-improved Acrobat Reader for Linux, and the effort to bring Flash and Flex to Linux. Adobe is investing heavily in Linux. It's very clear."

Then why has Adobe killed Flex Builder for Linux?

It's funny, if you read some of the blog commentary on this, how many Flex developers are defending Adobe's decision to abandon further development of Flex Builder for Linux, saying (like corporate apologists) there simply isn't enough demand for Flex on Linux to justify the necessary allocation of funds.

I have no doubt whatsoever that a straight bean-counting analysis of the situation will show that the short-term ROI on Flex-for-Linux is indeed poor, and that from a quarterly-earning point of view it's not the right way to satisfy shareholder interests. Agreed, point conceded.

But that's called being shortsighted. The Linux community may be only a small percentage of the OS market, but in terms of mindshare, the Linux developer community is a constituency of disproportionate influence and importance. Also, as a gesture of seriousness about Open Source, the importance of supporting Flex tools on Linux is hard to overestimate.

But it's not just about Flex tools. Adobe has had a schizophrenic Linux "strategy" for years. It back-burnered proper support for Acrobat Reader (and PDF generally) on Linux for years. Flash: ditto. And even a product like FrameMaker (which began its life as a UNIX product, interestingly, and was available in a Solaris version until just a few months ago) has been neglected as a potential Linux port, even though Adobe did, in fact, at one time have a Linux version of FrameMaker in public beta.

Adobe has a long history of going after the lowest-hanging fruit (and only the the lowest-hanging fruit) in the Linux world, and it continues that tradition today. The only problem is, you can't claim to be an ardent supporter of Open Source and ignore the Linux community, nor can you aspire to RIA platform leadership in the Web-app world of the future without including in your plans the fastest growing platform in computing.

Adobe's shortsightedness in its approach to Linux may be good for earnings-per-share (short-term) but is emblematic of the company's inability to articulate a longer-term vision that embraces all of computing. It undermines the company's credibility in the (ever growing) Open Source world and speaks to a mindset of "quarterly profits über alles" that, frankly, is disappointing in a company that aspires to RIA-platform leadership. IBM and others have found a way to invest in Open Source and alternative platforms without compromising longterm financial goals or putting investor interests at risk. The fact that Adobe can't do this shows lack of imagination and determination.

How much can it possibly cost to support Flex Builder on Linux, or (more to the point) to have a comprehensive, consistent policy of support for Linux going forward?

Conversely: How much does it cost not to have it?

Monday, May 04, 2009

The most important job interview question to ask an R&D candidate

I've been thinking about what one question I would ask a job candidate (for an R&D job) if I could ask only one question. This assumes I've already asked my favorite high-level question, which I discussed in yesterday's post.

Most good "R&D job" questions, of course, are open-ended and have no single "right" answer. They're intended as a starting point for further discussion, and a gateway to discovering the reasoning process of the candidate.

One of the better such questions I've heard during an interview came when I applied for a job at a well-known search company. One of the five people who interviewed me asked: "Explain how you would develop a frequency-sorted list of the ten thousand most-used words in the English language." This was an outstanding question on many levels and led to a very lively hour-long discussion. But I'll save that for another day.

To me, if I'm interviewing someone who is going to be involved in writing code, and I can only ask one question in the course of an interview, it would be: "Explain what 'bad code' means to you."

If the person starts going down the road of "See what kind of warnings the compiler gives you," "run it through lint," etc., I would steer the person back on track with: "Aside from that, what would you do if I gave you, say, a couple thousand lines of someone else's code to look at? How would you judge it? What sorts of things would make the code 'good' or 'bad' in your eyes? Assume that the code compiles and actually works."

If the talk turns immediately to formatting issues, that's not good.

Presence or absence of comments: Starts to be relevant.

Coding conventions (around the naming of variables and such): Yeah yeah yeah. That's good. What else?

What about the factoring of methods? Is the code overfactored? Underfactored? Factored along the wrong lines? How can you tell? (This leads also to the question of how long is too long, for a class or method?)

What about evidence of design patterns? Does it look like the person who wrote the code doesn't know about things like Observer, Visitor, and Decorator patterns?

Does the code follow any antipatterns? Is it just plain hard to follow because of methods trying to "do too much," overusage of custom exceptions, overuse of cryptic parent-class methods, constructors or method declarations with 15 million formal parameters, etc.?

What about performance? Does it look like the code might be slow? (Why?) Could the author have perhaps designated more things "final"?

Is code repeated anywhere?

Is the code likely to create garbage-collection concerns? Memory leakage? Concurrency issues?

This list goes on and on. You get the idea.

Special extra-credit points go to the candidate who eventually asks larger questions, like Was this code written to any particular API? Is it meant to be reusable? (Is it part of a library versus plain old application code? How will people be using this code?) Is it meant to have a long lifetime, or will this code be revisited a lot (or possibly extended a lot)?

I'm sure you probably have favorite R&D questions of your own (perhaps ones you've been asked in interviews). If so, please leave a comment; I'd like to see what you've got.

Sunday, May 03, 2009

If I could ask only one job interview question

Someone asked me the other day (in response to my earlier blog about job interviews) what question I would ask during a job interview if I could ask only one question. Which is, itself, a very interesting question.

I initially responded by asking for more context, specifically: What kind of position am I trying to fill? Is the candidate in question applying for an R&D job (e.g., web-application developer)? Or is she applying for a software-industry position that requires no programming knowledge per se?

In general, as a hiring manager, the thing that interests me the most is the person's ability to get work done, and in technology, that, to me, means I want someone who is an incredibly fast learner. There's no way you can come into a job already possessing 100% of the domain knowledge you're expected to have; some on-the-job learning is bound to be necessary. Beyond that, even when you've adapted to the job, constant learning is a requirement. I've never seen a tech job where that wasn't true.

So one of my favorite interview questions is: "If you have a hard assignment that involves subject domains you know little or nothing about, what would be your approach to attacking the problem?"

If the person's first words are "Consult Google," that's fine -- that's a given, actually. But I want to know more. Consult Google, sure (consult Wikipedia, maybe), but then what?

What I don't want to hear as the very first thing is "Go ask someone in R&D" or "come and ask you." If your very first tactic is to disturb busy coworkers (without first doing any homework yourself), it means you're too lazy to at least try to find answers on your own. It also means you're inconsiderate of the value of other people's time. Newbies who ask questions on forums that could easily have been answered with a little prior research tend to get rude treatment on forums, for precisely this reason. You don't bother experts (with non-expert questions, especially) unless you've already done all the work you can possibly do on your own to solve the problem yourself. Only then should you bother other people.

Some good answers to the question of "How would I attack a difficult problem" might include:
  • Go straight to the authoritative source documentation. For example, if the problem involves E4X syntax, go read ECMA-357. If the problem involves XPath, go to the W3C XPath Language Recommendation. And so on. Go to the source! Then work your way out from there.
  • See what's already been done internally, inside the organization, on this problem. That means consulting company or departmental wikis, reading internal documents (meeting minutes, business intelligence reports, etc.), reading source code and/or code comments, and so on.
  • Find out who the acknowledged experts (in the industry) are on the subject in question and go look at their articles and blogs. Consult forums, too, if applicable. Post questions on forums, if you can do so without revealing private company information.
  • If you have a friend who is knowledgeable on the subject, reach out to the person and pick his or her brain (again providing you're able to do that without revealing proprietary information about your current project). I don't care if you bother someone outside the organization with endless questions.
Finally, if you need to, find out who inside the organization is the domain expert on the subject, and ask that person if you could have a little of his or her time.

In summary, I need someone who is smart and a fast learner, but also resourceful and self-motivated.

This post is getting to be longer than I thought, so I'll stop. Tomorrow I want to speak to the issue of what question I would ask an R&D candidate, if I could ask any question during a job interview. That'll be fun, I promise.

Saturday, May 02, 2009

There's a DAM Elephant in the Room


Typically, in my day job as an analyst, I'm on the receiving side of briefings, but the other day I actually gave one to a customer wanting to know more about the Digital Asset Management (DAM) marketplace. I took questions on a wide range of issues. But then, at one point, the customer put forward a really thought-provoking question, something I myself have been wondering for some time: Where is Adobe Systems in the DAM world? What's it doing in DAM?

The reason this is such a good question is that Adobe already has most of the necessary pieces to put together a compelling enterprise DAM story (even if it hasn't yet assembled them into a coherent whole). Some of the more noteworthy pieces include:
  • Some very interesting workflow and rights-management bits in the LiveCycle suite.
  • Adobe Version Cue, which provides a versioning and collaboration server for workgroup scenarios. Version Cue uses an embedded instance of MySQL and has SOAP interfaces.
  • Adobe Bridge, a lightbox file-preview and file-management application with some metadata editing and other tools built-in. This piece is bundled into the Adobe Creative Suite products. (Interestingly enough, Bridge is a SOAP client that can talk to Adobe Version Cue servers.)
And of course, the CS products themselves are used extensively by the same creative professionals whose needs are addressed by conventional DAM products of the Artesia, MediaBin, or North Plains variety. Most of the big DAM offerings try hard (with various degrees of success) to integrate smoothly with Adobe's creative tools, InDesign in particular.

The one piece that's missing from all this is a standards-based enterprise repository. What Adobe could use right about now is a robust ECM repository (CMIS-compliant, of course) built on industry standards, something written in Java that will play well with JRun and offer pluggable JAAS/JACC security, with LDAP directory friendliness, etc. That's a lot of code to write on your own, so obviously it would behoove Adobe to either partner with an ECM player or leverage an open-source project. Or maybe both.

You may or may not remember that back in March 2008, Adobe launched its Adobe Share service, built atop open-source ECM product Alfresco.

Then in June 2008, Adobe and Alfresco announced a partnership to embed Alfresco's content management software into Adobe's LiveCycle Enterprise Suite.

Later, in September 2008, Adobe partnered with Alfresco in a deal that had Alfresco powering the popular Acrobat.com site. (That site is currently on the verge of surpassing LiveMeeting.com and OfficeLive.com for traffic.)

Could Alfresco be the linchpin of a future Adobe DAM strategy? Hard to tell, but the handwriting on the wall, it seems to me, is starting to become legible.

As far as the DAM industry as a whole is concerned, Adobe is clearly the elephant in the room at this point. When and if this giant rich-media pachyderm decides to step forward and enter the DAM world proper, it could cause the ground to shake. It might set off Richter scales as far away as China.

My opinion? Now is not too early to take shelter under a nearby doorway or other structurally reinforced part of the building.