mostscriptorium

The details usually matter… Sean Upton’s bucket-o-bits.
May 20th, 2008

Tweaking virtualenv activation PYTHONPATH

A technique I’ve been using, and finding useful is creation of a directory for source packages I’m developing within the root of a virtualenv application environment. But in doing this, I need to be able to add namspace packages I add to that source directory each to the PYTHONPATH automatically; doing this was as simple as adding a few lines of bash code at the end of the activation script:


SRCBASE=/home/sean/code/pyocm/src
PYTHONPATH=$PYTHONPATH
for i in `find $SRCBASE -maxdepth 1 -type d | sed -e "s/[.][\/]//"`; do
CODEDIR=`echo $i | sed -e "s/[\/][.]//"`
PYTHONPATH=$PYTHONPATH:`echo -n $CODEDIR`
done
export PYTHONPATH=$PYTHONPATH

This saves me from adding these to the local site-packages, keeping development packages separate from dependencies I’ve installed from elsewhere (e.g. eggs from Cheeseshop).

March 31st, 2008

Refine your search!

In Zope/Plone applications, one often asks the catalog for an intersection of results from several indexes. For example, suppose I want all items where all of the following are true:

  1. Subject field is “sports”

  2. SearchableText matches the keyword query “stadium disab*”

  3. Path is equal to ‘/foo/bar’

There are three typical ways to achieve this in a user-interface: (1) an advanced search form/page, (2) collections (saved searches), and (3) drill-down filters within faceted navigation.

This writeup is interested in the third case – faceted search: why it is useful, why it is difficult, and how something effective might be implemented in Plone.

What facets are:

  1. Aspects of a set of search results, usually a sub-set of results.

  2. Often, clickable links to refine your search within a larger set of results.

With faceted search navigation, filtering existing search results is the name of the game. If I search for a term (often a full text search), it may be helpful to ask me whether I want to view only known subsets of the search results I am presented. For example, in a search for “Lincoln,” it may be helpful to see clickable filters beside the results that let me choose to search only the Automotive category, or the History category. It may also be helpful to limit my search to items published within the last
year.

Faceted search is often used in a variety of web applications, but is especially useful for large collections of news and information, product reviews, and local search (entertainment guides, business directories). The more structured your metadata, the more likely faceted navigation is helpful to you. With free services like Calais popping up, getting better metadata on even unstructured content is easier to automate.

Unlike navigating using taxonomies within controlled vocabularies, there is no distinct linear hierarchy to this. Applied in Plone, most applications choosing to use this pattern will treat each index as a facet and rely on the portal_catalog machinery to return the results. It is a useful oversimplification to say such a navigation strategy is the iterative set-intersection of multiple results by end-users. The same is true for faceted navigation of search results, or simple collection navigation.

What might this look like? Figure 1 shows possible click behavior and permutations of three different search criteria.

Figure 1


Figure 1 - faceted navigation results in a need for result set intersections. We might want to cache these…

Faceted navigation is more like browsing than searching, most of the time. Clickable filters are easy. But with the ability to hyperlink comes the ability to abuse it. And with good metadata often comes large vocabularies of possible choices. There are two good UI guidelines apply here:

  1. Do not show links to facets/filters that have no results, even if the filter in question is a popular term in your vocabulary of possible choices. If it is not germane to the results already
    presented to a user, omit the useless link from your navigation.

  2. Related: show the number of results in each facet link in the navigation. If you are searching automobiles, and have a facet for color, if the results page the user is already on lists seven red cars, thirty-two blue cars, and three black cars, show those numbers within the respective links. If there are zero purple cars in the result sets, just omit the link labeled “purple” per #1.

Figure 1


Figure 2 - faceted navigation. This particular example is an application some folks I work with built using Django and PostgreSQL with stored procedures used to offload some the work of creating link set counts from the application.

Both of these common-sense user interface constraints pose challenges. Running a single search result that intersects several indexes is one thing (not a problem), but it turns out getting those counts for all the various permutations of facet choices is tricky, and expensive. Getting result counts means querying the catalog, and if you have dozens of clickable facet links on your pages, and you are filtering through, say 100,000 results, you have a problem. Actually you have dozens of problems, each as big as the result set they contain. There’s one good reason one might need a memory watchdog.

To add insult to injury, what if all this expense tied up resources on your application server, sharing such burden with the threads rendering your pages? Such collapsing of burdens is hard to escape the way Zope is designed (index operations happen in the same thread and on the same hardware as the rest of the application’s execution) – a possible solution is an external catalog, much like Nuxeo used with CPS via NXLucene, and possibly like what will soon be possible with Plone and Solr (collective.solr and colletive.indexing).

This isn’t meant to sound bleak. I built a system like this that scaled okay with 30,000 items (throwing hardware at the problem). Trying to make the same system work for nearly 200,000 items (local yellow pages listings) never succeeded into production. I write this because I’ve seen what does (to some extent) and does not work, and have a ideas on how the situation can be improved. And if those do not work, we can steal^W borrow ideas from Solr [more].

This problem does not have easy answers, but there are ways to improve this, including caching the metadata for each facet permutation (read: set intersections) using an out-of-band cache-warmer (possible an external thread talking to a cache that is not thread-local?). Asynchronous invalidation notifications to such a cache could happen within writes to Zope-based content (read: event subscribers and a message queue). Such solutions help partition the work of away from an in-process bottleneck inside a Plone-based system.

In a follow-up post, I will detail some more specific ideas I have about addressing this area. There is more potential than down-side, we just need to be careful of the “gotchas.”

May 23rd, 2007

Simple memory watchdog for Zope instances

I had a need to have something very simple to watch for and HUP Zope instances that used too much memory. This bash script works on Ubuntu Edgy and should work without adjustment on other Linux systems using the procps package (which is, I think most distributions?). Here’s the script:


#!/bin/bash

MAX_INSTANCE_RESIDENT_SIZE=2048 #MB
USER_HOME=`(cd ~; pwd)`

MBVAL="0"

function gb_to_mb {
    MBVAL=`/usr/bin/python -c "print int(float('$1')*1000)"`
    return 0
}

function watchpid {
    #PID is $1

    ### get resident size from top or empty string if resident size doesn't
    ### have the 'm' (megabytes) suffix.
    RESIDENT=`top -b -n1 | grep $1 | sed -e 's/[ ]\+/ /g' | cut -d' ' -f6 | \\
    grep "[gm]" | tr -d '\n'`
    ### exit if no process w/ PID found, or if size doesn't have 'm' suffix;
    ### -- in either of these cases, $RESIDENT will be empty
    if [ -z "$RESIDENT" ]; then return 0; fi;

    if [ -n "`echo -n $RESIDENT | grep g`" ]; then
        ##gigabytes, not megabytes
        RESIDENT=`echo -n $RESIDENT | sed -e 's/g//'`
        gb_to_mb $RESIDENT
        RESIDENT=$MBVAL
    fi

    RESIDENT=`echo -n $RESIDENT | sed -e 's/m//'`
    if [ $RESIDENT -gt $MAX_INSTANCE_RESIDENT_SIZE ]; then
        echo "Process (PID $1) is too big: $RESIDENT MB"
        echo "Maximum allowed is $MAX_INSTANCE_RESIDENT_SIZE; sending SIGHUP"
        kill -HUP $1
    fi;

}

function main {
    ## find pid files for zope instances, run watchpid function on them
    ## this will send sighup to these processes if they are too big
    for i in $( find $USER_HOME | grep Z2.pid );
        do watchpid `cat $i`;
    done
}

main

This is the kind of thing you could run on a cron job every so often. I’ve just finished testing this on a development environment, but each of the two in-production servers I plan to run on has four zope instances (one for each CPU core), so the impact of a restarting just one would be minimal (load-balancer will take instance out of the pool - LVS will make it quiescent during the HUP/restart).

April 16th, 2007

Observation: ZEO Persistent client cache corruption

I’ve had good luck lately using larger-than-default ZEO client cache, with persistent client cache enabled in my zope.conf (ZODB with 4-million objects). This comes at a minor price…

An observation that I can’t find easily documented elsewhere: if you use a persistent client cache ($INSTANCE_HOME/var/*.zec) for ZEO on your zope instance, you ought not to run ‘zopectl debug’ nor any out-of-process scripts that use Zope2.app - this may lead to cache corruption. If you need to use persistent client cache, use another instance_home to run zopectl debug and automated scripts with (and if you have overlapping processes on cron, avoid using persistent client-cache on this instance).

What I get when I try to run an automated content import using Zope2.app from the same instance home as a running zope: first try, commits okay, subsequent attempts get this traceback:

Traceback (most recent call last):
File "import_acxiom2.py", line 221, in ?
app = Zope2.app()
File "/home/upton/eg/swhome/lib/python/Zope2/__init__.py", line 51, in app
startup()
File "/home/upton/eg/swhome/lib/python/Zope2/__init__.py", line 47, in startup
_startup()
File "/home/upton/eg/swhome/lib/python/Zope2/App/startup.py", line 60, in startup
DB = dbtab.getDatabase('/', is_root=1)
File "/home/upton/eg/swhome/lib/python/Zope2/Startup/datatypes.py", line 280, in getDatabase
db = factory.open(name, self.databases)
File "/home/upton/eg/swhome/lib/python/Zope2/Startup/datatypes.py", line 178, in open
DB = self.createDB(database_name, databases)
File "/home/upton/eg/swhome/lib/python/Zope2/Startup/datatypes.py", line 175, in createDB
return ZODBDatabase.open(self, databases)
File "/home/upton/eg/swhome/lib/python/ZODB/config.py", line 97, in open
storage = section.storage.open()
File "/home/upton/eg/swhome/lib/python/ZODB/config.py", line 155, in open
read_only_fallback=self.config.read_only_fallback)
File "/home/upton/eg/swhome/lib/python/ZEO/ClientStorage.py", line 314, in __init__
self._cache.open()
File "/home/upton/eg/swhome/lib/python/ZEO/cache.py", line 112, in open
self.fc.scan(self.install)
File "/home/upton/eg/swhome/lib/python/ZEO/cache.py", line 835, in scan
install(self.f, ent)
File "/home/upton/eg/swhome/lib/python/ZEO/cache.py", line 121, in install
o = Object.fromFile(f, ent.key, skip_data=True)

File “/home/upton/eg/swhome/lib/python/ZEO/cache.py”, line 630, in fromFile
raise ValueError(”corrupted record, oid”)
ValueError: corrupted record, oid

Please feel free to comment if this seems wrong?

April 2nd, 2007

Business journalists: ignorance, mis-phrasing, ethics

I read this brief on marketwatch.com about Vonage delaying its earnings because of the Verizon suit. The way something is worded matters, and this is worded poorly:

… injunction against Vonage, barring the Internet telephony company from using patented technology owned by Verizon.

Jared A. Favole and other writers in the mainstream press misconstrue how patents work. More accurate wording could have been:

… injunction against Vonage, barring the Internet telephony company from using technical methods documented in patents granted to Verizon.

The original form of the quote is neither fair nor accurate. It is certainly not objective. It is misleading, woefully ignorant, and its misinformation implicitly takes a side here. This kind of mis-phrasing is an example of where professional journalists need to be careful about - especially business journalists covering a wide beat. This perpetuates misnomers about what the patent system is and isn’t - and costs individual investors money. This omission of care is at worst an ethical lapse, and at best, this is costing people money. Patents are owned, not technology.

Worse, should jurors on a Federal patent trial get these same ideas in their head from a misinformed press, all they will do is allow large corporations to shamefully file parasitic lawsuits. Journalists conflating this type of “intellectual property” with the physical property metaphor are both the “sucker born every minute” and the con artist (there are more informed amateurs writing on this subject, so the bar needs to be raised for professionals writing for business/investor products).

When Verizon sues Vonage, it is really attacking my wallet by keeping my rates artificially high, not subject to real competition. At some point we as a society need to see robbing a few dollars a month from 50+ million people as a harm equivalent or worse to robbing a handful of people at gunpoint (in terms of net social utility). We really, really need patent reform and a ban on software and business method patents. This is getting out of hand.

Disclaimer: I’m a Vonage customer, and have no investor interest in this matter. I’m still being robbed by Verizon.

March 12th, 2007

Adobe Lightroom workflow ideas

I downloaded a 30-day trial of Lightroom on my G4 at home the other day. I was surprised and impressed by the performance on my late-2002 model Mac. Equally impressive is that Adobe built 40%+ of the application in a scripting language (Lua) using an open source database (sqlite). The other promsing thing about Lightroom is that metadata (IPTC, EXIF, tag/keyword editing, etc) is a first-class citizen in the UI, not relegated to dialog windows.

About a week ago, I was wondering how this would work with workgroups and asset management systems, so I started tinkering with querying the database (all I needed was to rebuild sqlite as the version Apple distributes in Tiger is old).

I realize workgroup and workflow integration is not a 1.0 ambition for Adobe. Eventually there will be scripting; after that a full SDK. In the meantime, any development you may want to do against Lightroom’s data is going to be out-of-band. The good news is, Lightroom is translucent here: you just need sqlite3 bindings and the ability to extract XMP, and some means of dealing with locking issues when Lightroom is using the database (which means external programs need to be read-only). An external script, could, in theory query Lightrooms database, and do some kind of sync of both assets and metadata to a photo archive or central workflow system. I’m guessing that any official workflow hooks Adobe creates for 2.0 will push this job to integrators or ambitious, tech-savvy end-users via scripting and/or SDK.

February 27th, 2007

Newspapers, OpenID, VRM

Simon Willison has a very clear screencast on OpenID here.

Several things seem clear with the possibility of OpenID emerging as an Internet-savvy (and simple) alternative to walled-garden authentication and sign-on:

  1. There will be multiple identity providers (IdPs) for log-on services. Identity systems expert Phil Windley thinks niche IdPs will spring up based on attribute exchange. My interpretation of the idea here is that you might be willing to share personal attributes with organizations in clusters:
    • I’m more likely to want to have a single identity just for aggregating my subscription to email newsletters for example (not wanting to have to set up a login in a million places, and only wanting to share my email address, and lets say, zip code).
    • I may have a separate kind of trust relationship for online newspapers…
    • …and another for my online banking.
  2. If there isn’t one central “lock” of a provider (think failure of MS Passport), then there will need to be convenience functions within browsers acting like a wallet. Microsoft CardSpace or other similar types of services come to mind. So we all end up with a digital “wallet” full of OpenID “cards” - and if we don’t like the provider of the wallet, we get another one. If we don’t like an IdP, we cut up the card and find another provider.
  3. In the end, the wallet likely ends up being tied to your web browser, because “easy” usually wins. I may end up with a Firefox plugin that works with OpenID IdPs. The average IE user will likely get a “wallet” from Microsoft. My hunch is that Microsoft and Apple are two companies that can help push open identity systems over the chasm to mass-market adoption, but I do not think they will be in the driver’s seat (open standards organizations and those with a large financial stake in identity openness will (think financial institutions).
  4. One niche I can see is OpenID taking over is federating the tracking of my media “consumption” habits. I know this is a very 1990s way of looking at the problem, but it is a real problem area in trying to find a win-win for media companies/publishers and readers. There are attributes like my behavioral preferences in mass media consumption (I read a lot of tech stuff, am interested in college basketball, that sort of thing). Or the fact that I’m willing to share my age, gender, and zip code with most newspapers, but not with an online store vendor I’m purchasing something with. My point is, I want control of the relationship, and in return, I can/will give so much information to media companies, and I might hope that this might buy me a single sign on for all major U.S. daily newspapers. One could hope that at least online newspapers could become relying parties (RPs) for OpenID-based identities so I didn’t have to register for log in every time.
  5. Think locally, read globally - the local newspaper as a niche IdP possibility:

    Local newspaper companies may have an even greater opportunity in leveraging trust, strong brand, and local credibility to create local logins within their own registration systems (for subscribers, online readers, paid or free), and find ways to federate/exchange attributes with other RPs on newspaper web sites. I want to register in San Diego, and access, say, the Washington Post without having to fill out yet another form - rather, just the one “card” in my digital “wallet.” I know this is not a new notion, but it is nice to think there may be a more agreeable alternative to a centralized provider in a local company I trust.

  6. This is VRM for media consumption participation.

Disclaimer / point-of-view: I work for a local newspaper, but these ideas are really desires from the demand-side of the media relationship, not the supply side.

February 2nd, 2007

Solr - looks interesting

I started trying to set up NXLucene this week for evaluation. The idea of an out-of-process indexing server makes a lot of sense, and I wanted to access using Python. After frustration with trying to get PyLucene 1.9.1 built using recent GCC - to make NXLucene happy, my initial thought was to update NXLucene code for PyLucene 2.0.0 compatibility. Instead, I stumbled on Solr.

Lucene is pretty much the de-facto gold standard for open source search/indexing, written in Java; though people have been using it in Python apps for years, usually it has been through some sort of bridge like PyLucene (built using GCJ/GCC not a JVM - a pain to compile and maintain upgrades).

Solr seems to be a simpler option (with Python support), and an official subproject of Lucene - a web service running a REST interface to index and search fielded/text info. Solr supports typed data fields (corresponding to schema), output in XML, Python, JSON, and a few other formats, and does transactions and replication of indexes.

Solr seems positioned as a loosely coupled, cross-language/platform way to use Lucene.

By way of example, one could replace the entire Catalog in Zope with components/indexes using multiple replicated Solr servers (one master for write/reindexing, and several nodes for queries). NXLucene as a catalog back-end supposedly worked well for Nuxeo with reasonable performance Solr may be a simpler way to do the same thing in Python apps, with a lower barrier to entry (installation).

June 5th, 2006

Apple annoyances, cocoa cults

For someone who uses a Mac 50% of the time, you would think I might figure out what the deal is with HOME and END keys in Cocoa apps. No, noone (at Apple, at least) thinks it makes sense to make these work - I’ve figured out Ctrl+A, but the NeXT-heads have won this round, becuase I can’t figure out how to get to the end of the line (I’m too lazy to Google for every subtle NIH-syndrome hotkey recreation). If I filed a bug report on ADC, it would get ignored, but hey, I wouldn’t even know that because Apple won’t tell me.

Open source is a transparent meritocracy on the other hand. Darwin double’s down his bets on free *N[IU]X, not OS X.

Apple’s NIH problem is all ego fueled by a cult of personality. It’s very circular - the avant garde of so-called “usability” setting a standard that’s only usable by its own logic.

There are many great usability gems in OS X. There’s also a lot of smoke and mirrors. And two freakin’ useless keys on the overpriced membrane keyswitch Apple-branded keyboard I’m using. MS Word, a Carbon app, manages to make HOME and END work, at least.

The Mac experience for me often ends up the equivalent of a bathroom air-freshener: I can smell the scent of apple-blossoms, but the shit still stinks.

With that in mind, I’m not surprised to see that Mark Pilgrim has switched away from Apple to Ubuntu. Others will follow.

May 25th, 2006

Complex event calendars in Plone - brain dump

SignOnSanDiego is working on an editorial CMS for our Entertainment guide, which includes an event calendar that will contain thousands of events (such as concerts, special events, museum exhibits, etc). This is going to be powered by Plone products we are creating. I know there has been discussion and PLIPs on event recurrence, but I’ve been thinking out some other scenarios useful for event calendaring. This is a brain dump of the oddball cases.

In the best case, we know an event happens at a specific date + time and we either know its duration (i.e. 2 hours) or its approximated stop time.

This allows us to plot specific calendar results. However, there are three cases requiring a bit more exploration:

  1. Basic recurrence.
  2. Scattered cases (one content item wrapping-up multiple times that recurrence can’t handle on one profile). Either by rule or by manifest of specific date-time entries.
  3. Missing stop times.

The information architecture of a good event calendar system requires smart choices about how we relate a profile (content object) describing an event to multiple occurrences.

Complex event cases:

  1. Occurs every Thursday through Sunday between 1/1/06 and 6/30/06
    • About 26 x 4 events ~= 104 event instances.
    • Note: no time specified - do we require time?
      • Or do we create the notion of an all-day event?
      • Or in addition a time unspecified?
      • unspec. / all-day treated same in search, but different label
  2. Occurs every Thursday through Sunday between 1/1/06 and 6/30/06 at 7 p.m.
    Still ~104 event instances, but time is specified.
  3. Occurs Sunday, April 2 - all day.
  4. Occurs Sunday, April 2 - time not specified.
  5. Occurs Sunday, April 2 at 6:30 p.m. - we do not know stop time or duration.
  6. Occurs every Thursday through Sunday between 1/1/06 and 6/30/06 except for the following holidays:
    1. 4/16/2006
    2. 5/14/2006
  7. case #6 with a specified time.
  8. “Free day” at a Balboa park museum:
    • “Occurs every third Tuesday of every month all day.”
    • Basic recurrence is not enough for this.
  9. Scattered days: occurs Mon. April 3, and Wed. Apr. 12. All day events.
  10. Scattered datetimes - occurs:
    • Mon. April 3, at 7 p.m. for 2 hours
    • Wed. April 12 at 6 p.m. for 2 hours

Stop times or duration:

Also, events often do not have stop times, or they are implied. You know to budget a few hours for a movie (though a movie may have a run length, which is helpful - our automation can likely approximate stop-time from movies feed). We can editorially approximate the amount of time it would take for an event such as an Opera. What if we leave some or many events stop-times unspecified - what does this mean for:

  • Search?
  • Event clipping and iCal/vCal export?

Events imported from other systems (i.e. Ticketmaster events feed) might not have stop times or duration, so we can’t plan to editorially approximate anything that is automated (too much work).

Missing stop time may not be a big deal, but it is something to think over anyway.