Xena

From InfoAnarchy
Jump to: navigation, search

See also: Bot Blogging | Xena

  1. A TV series. See Wikipedia entry
  2. An inactive developmental IRC bot that logs URLs that are being discussed in an IRC channel. Currently inoperative, used only in channels unrelated to infoAnarchy, and in ways which are dissimilar to those described below. Nevertheless, the ideas below may prove useful to someone and so this page is preserved historically intact.

This page served as a development log and overview for developers visiting the IRC channel #infoanarchy.


Source(s): Xena



Detailing the new features being added to xena (a bot in #infoanarchy on irc.oftc.net) to make her more useful.

We discussed in channel that she should add urls to an archive the same way chump does but with some additions detailed below.

Because of time commitments etc, others are now looking at using chump as the basis for these changes, rather than the clean rewrite tav was originally looking at. Without talking to tav it's difficult to judge why he thought a clean rewrite based on xena was a better option than some old hack onto chump.

I have split this page into parts with a view to facilitating planning for either way, modifying xena, or modifying chump, and making a seperate zope-based archiving site for preserving the text content of the referenced pages.

Noted difficulty levels are guesses only, and if you know better, please correct any mis-guesses. Items shown in each section are in approximate descending order of urgency.


Source(s): Xena


xena

  • Perhaps where she browses a web page and gives back the first 1024 chars, she could dump the whole of the page to the person who requested via a /msg (or notice) (only worth doing if coding would be dead easy .. eg three lines or less)


Source(s): Xena


bot-blogging interface

  • (easy) When the url is given, take remainder of line as initial title in index.
  • (easy) If a url has already been posted today, return the index reference which has already been used for the first posting.
  • (easy) Allow requiring use of a prefix character such as in *http://blogthis.com/blah.html to indicate a bloggable URL, to prevent inappropriate blogging. This should at the same time allow URLs to be embedded in text instead of having to be the first thing on the line.
This feature may remove the need for the next item (or at least remove urgency).
  • (medium) Allow deleting of urls from today's archive, in case of accidental additions (because she would parse any url dumped into the channel with no text preceding the url - it leads to the occasional accidental chumpification at present.
Notification should be given in channel to stop DoS type deletions. Enough people are usually alive in channel to notice these notifications.
  • (easy) <to channel> NOTIFY the channel when a url is added, and <to nick> NOTIFY on other actions, not <to nick> PRIVMSG as current chump does (Sep 03). People in channel will still see what commands the other people are giving, just not see the bot confirmation.
  • (hard) Unless title is provided with url (see above), automatically grab page title from html of web page, but still allow manually overriding this title with a title given in channel (this is only necessary until the bot itself can grab the title).


Source(s): Xena


bot blog changes

  • (easy) Display associated handles (eg H:) alongside URLs given on the current day's log, so it is easy to figure out which handle to use to comment on a particular URL.
  • (easy) If a comment about a URL contains another bare URL the bare URL should automatically be made clickable (identified by regex). Chump currently doesn't do this, bare URLs need [ ] around them to tell chump to make them clickable.
  • (hard) Automatically grab page title from html of web page, but still allow manually overriding this title with a title given in channel.
  • (medium) The end-of-day rollover on chump gives a clean slate. This is not great, it means some URLs added near the end of a day will only really be seen for a few minutes, and users logging in soon after roll-over see an empty chump (or have to browse the archives). Could solve this by having 12 hour log pages, and showing the most recent two 12 hour sessions on the 'current' log (or easier, show the last two days - needs no modification of session time).
  • (hard) Create a lookup-hash of urls already archived on previous days, and the corresponding page on which they appear, checked when a new url is indexed.
If the new url exists, show as a note the most recent archive page url already containing the url being fed in .. and if no comments or title mods are given today, probably don't add the url to the days archive when rolling over (because less items to index means a more efficient indexing system).


Source(s): Xena


archive system

  • Archiving page contents of pages referenced, preferably in a compressed form, preferably readable via a web interface at the archive page site. We wondered if the Internet Archive could be used to do this .. couldn't see how. Instead looking at using zope, which already supports storing/serving compressed objects.
  • This system could be made by periodically grabbing from the 'vanilla' bot blog page, and making additions, such as links to the archived copy url, which could be conditional on 404 status of actual url.
  • Or it could be done on an as-needed basis each time an entry is blogged via the xena bot.
  • Generate an ID for each referenced URL by taking a SHA-1 512/384 hash of the data prior to compression.
  • Compress pages according to the following algorithm:
1. look at file extension .. forget it for known compressed formats like jpg, bzip
2. check if compression is worthwhile (take random chunks of data and see how well they compress)
3. if > DESIRED_COMPRESSION_LEVEL, then compress
  • Store compressed/uncompressed data as an object in zope, with associated ID for retrieval (the hash) and other metadata such as date, time URL, who blogged the site, when was it blogged, in what channel, using what version of bot/storage system, and of course, shoe size (size 9).

retrieving archived data

  • from the generated blog pages
These could be dynamically generated from zope. Zope could retrieve headers first to see if there is a 404, then if 404, provide a link to the cached copy, otherwise, link to the live site. Always provide a link to the archived copy as well, but this could be in smaller text.
  • by a search interface
  • One way: user specifies a URL formatted like
<archive-domain>/<archive-base-path>/<desired-site-domain>/<desired-site-path>/<desired-filename.ext>
or for a more precise date-match, like
<archive-domain>/<archive-base-path>/<desired-site-domain>/<desired-site-path>/<desired-filename.ext>/<datetimestamp>
which will cause the nearest match on-or-before the given date/time, or after that date/time if none match before it.
This means users can easily locate various versions of a particular cached URL (if we have cached more than one version) without needing to know in advance what range/selection of dates have actually been cached.
datetimestamp should be some format similar to 20030427-16:14 or something .. not a tree format, because then going 'up' from a particular page gives a directory which in most cases in my experience is EMPTY or contains very few files. On the other hand, it would be useful to have an intuitive way to list all versions held .. needs a little more thought I guess.


Source(s): Xena


metadata tagging for irc conversations

The concept as discussed in #esp:

<simmo> add attribute plex:goodnes to that time period

<lilo> you almost have to be working from a common log though <simmo> it wouldn't matter if the timestamps given were(n't) second accurate <simmo> lilo - yeah you'd need to use a common timezone <lilo> yeah <lilo> you're looking for the strings <simmo> and then you can comment in realtime <simmo> or maybe add GMT to start of times <lilo> the time gets you to the ballpark

  • lilo nods
<lilo> well, as long as the interface knows what timezone you're in, you're fine

<lilo> comes down to using GMT under the covers though <simmo> yeah, could alloiw single time, rather than delta, and just interpreate that as a short delta, eg 2s <tav> the bot could always /ctcp time you... <simmo> right <lilo> I'm a big fan of ntpd <simmo> you could say bot, now <simmo> you could say bot, now +plex:goodness <simmo> or maybe just +plex:goodness <lilo> -3s +plex:goodness <simmo> ahah <simmo> someone logging this? ;) <lilo> or more to the point, -10.5m +plex:goodness <lilo> yes

A given line to be parsed from xena looks like

1073118736.21 2004/01/03 08:32:16 <simmo/#esp> i'll figure out the search regex and some stuff

Here is what to do with each line:

  • Parse one line into variable $irclogline

(split line into fields, timestamp, person, channel, content)

These could be inserted into a database to allow quick searching for example by some particular metadata field .. but the database could be updated only periodically .. it doesn't matter.

  • if success from search ${content} using a perl regex for
m/\+[^+][^\s]+/

(catches +sometextwithoutanyspaces but not ++this and must have at least +.. three chars like that .. could be made more robust)

  • then looks like an order .. examine closely
  • if success from search ${content} using a perl regex for
m/^\+([^+][^\s]+)/

(looks for the same type of string as we just looked for, but this time, specifically at the start of a line with nothing earlier)

  • then user wants to flag their line ..
$period=1
$timeunits=s
$flags=$1
$periodstart=$now - $period * $multiplier[$timeunits]
$periodend=$now
  • else if success from search ${content} using a perl regex for
m/^-(\d+)([smhdwyDCM]) \+([^+][^\s]+)/

(looks for something a bit like -3s +plex:goodness at the start of a line)

  • then user wants to retroactively mark given peirod before now

(given in seconds, minutes, hours, days, weeks, years, decades, centuries, or millenia respectively)

$period=$1
$timeunits=$2
$flags=$3
$periodstart=$now - $period * $multiplier[$timeunits]
$periodend=$now
  • else if success from search ${content} using a perl regex for
m/^(\d{1,2}(\:\d{1,2}){1,2}) ?- ?(\d{1,2}(\:\d{1,2}){1,2}) \+([^+][^\s]+)/

(looks for something like 12:42:53 - 17:35:17 +plex:goodness at the start of a line)

  • then user wants to mark a given period in format

HH:MM:SS - HH:MM:SS

$periodstart=$1
$periodend=$3
$flags=$5
  • if the period is negative, subtract 24h from first number and try again

--

OK, given all that, we now have the desired flags in $flags and they need to be parsed

in perl this is very easy, something like this

while $eachflag = split (/:/, $flags)

What happens next depends .. two possibilities

1. bot spools line into specific channel based on $eachflag (optionally with specific formatting ..)

2. we are updating a database, in which case we find each record with $timestamp <= $periodend >= $periodstart


Source(s): Xena


logging to a database

This is part 1 of implementation of case 2 just above. Getting the data into the database, ie split the data and feed it in.
  • A single log record should look like
*system_time of type time
year of type int
month of type int
day of type int
hour of type int
minute of type int
second of type int
user of type string
channel1 of type string
channel2 of type string
...
content of type string

(system time is indicated as the key by the asterisk character).

  • The content field is parsed and iff attributes found, they are indexed into a new table
*system_time
{user1
attribute1
attribute2
...}
{user2
attribute1
attribute2
...}
  • The content is also indexed for a text-search dictionary
*keyword
system_time1
system_time2
...
  • Another (optional) keyword index is used on lookup, with
*keyword
fuzzy_match1
fuzzy_match2
...

Ermm .. I haven't done any databases for a while and I'm not sure if that makes particular sense. This is a draft. I'm not sure particularly about those things

progress, status

Several zope or potentially-zope-capable-servers seem to have emerged to do the backend work of archiving sites referenced in the bot-blog (some GB in storage). tav is hoping to work on xena and overseeing the project, but he has been kinda busy of late. Mutiny said something about hacking chump for some of these things. ^matthew seems to be running the chump at present (Sep 03).

Why not using del.icio.us as a backend? (gresco 26/09/2006). delicious API

Apparently the project, to use the term loosely, died a death, swiftly after conception, if not coincidentally thereupon.


Source(s): Xena


comments, requests, volunteers

Any volunteers please speak to these guys in #infoanarchy.

Please post any suggestions/discussions here. It might be an idea to get an IRC client and hang in #infoanarchy on irc.oftc.net if you want to have some idea how this works at present.


Source(s): Xena