Diana the Valkyrie's Newsletter - September 2004

A hard man is good to beat

August, 2004

Downpours, floods and high winds - you know that Summer's here. This has been the wettest August for 50 years.

New and updated Galleries

Galleries added this month.

The Library

Stories added this month.

Listen with Diana the Valkyrie

Nothing new

The Movie Theatre

Movies added this month.

Newsthumbs

The new Newsthumbs system

The advantages to you

The new Newsthumbs system is now up and running. The main advantages over the old system are:

Instead of 12 different "older servers" that you have to browse or search individually, there's just one "older newsthumbs". That is actually 12 servers, and I expect to add more at the rate of one every six weeks. By 2007, there would have been 50 older servers! There still will be, but it'll still look like one huge server.

The digestion of incoming news is now being done by two servers instead of one, so the thumbnails are ready sooner. Also, I improved the way that thumbnails are made, to make it faster. So, the newsthumbs are tending to be ready at about 1pm UK time, which is about eight in the morning, EST. That scheme is expandable - I can use three, four or even more.

The technical details

In order to have files on 12 servers (more in future) be accessible in a way that looks like one server, I had to do a few tricky (but fun) things.

When you access a file, such as http://volds.thevalkyrie.com/news/newsfast/alt.amazon-women.admirers/062/watercarrier.jpg, the server realises that this is a job for the database I made. So, it strips off the end newsfast/alt.amazon-women.admirers/062/watercarrier.jpg, and feeds that to the program that will find the picture. This program I imaginatively called get.cgi.

Get.cgi takes the filename alt.amazon-women.admirers/062/watercarrier.jpg and calculates an md5 checksum of the name.

Then it looks in another database for that checksum, which tells it on which server, and under which name, that file is to be found. And it goes and gets that file, and displays it to you.

Underneath that, there's some horribly complicated stuff. I needed to make all this work so fast that you wouldn't realise that there was anything happening, and I needed to be able to create all these databases in a reasonable length of time.

My first attempt, used the Berkely DB database, becase everyone says that's fast. Maybe it is, but creating a Berkeley database of 100 million records (that's how many pictures there are on these 12 servers) took a long time. A very long time. I stopped it running after a week, and it had only done a fraction of the job.

So, I thought, let's abandon these noddy toy databases, and use a real industrial-strength SQL database. I used MySQL, because people say that's fast and powerful. Plus, it's free. Yes, well. Same problem. After a week of computation spent populating the database, and only a fraction through the job, I gave up.

I suppose I couild have tried Oracle or one of the real heavyweights, but to do that I'd need to spend a humungous amount of money, and I don't have any strong belief that it'll do significantly better than the others at populating the initial database.

At that point, I realised that I was going to have to reinvent the wheel, except my wheel wasn't going to take a month to turn. I almost completely abandoned using available databases, and wrote my own. I call it the Valkyrie Homebrew Picture Database.

First, I created a simple file called consolidated.dat. This looked like this:

Servernumber|newsgroupnumber|partialfilename|checksum

Server number is which server it's on, newsgroupnumber is a simple look-up, so instead of referring to alt.amazon-women.admirers I can refer to the much shorter number, so that makes the size of consolidated.dat much less. Partialfilename is 062/watercarrier.jpg, and checksum is the md5 checksum of the file. Two files with the same checksum are pretty much guaranteed to be the exact same file (on usenet, the same files get posted again and again).

So you can see, when I need to add a new server, all I have to do, is create a file like that, and add it to the end of the existing file, which takes maybe 15 minutes.

Interlude ... at that point, I ran into a bit of a problem. The entire consolidated.dat is about 6 gb, and there's a lot of things that can't handle files larger than 2gb. So, I can't ftp the file from one server to another using the usual ftp software, I have to use a more advanced ftp program called lftp (which I had to find and install) or else use scp (secure copy) which is faster and easier, and I already have it installed. Also, I can't use perl 5.6 (which is the version I've been using for everything), I have to use perl 5.8 or above. And that means I can't use some of the other stuff I've been using, and I had to go find upgrades of those. Sigh. End of interlude.

When I have consolidated.dat, I know (in principle) where every file is. But finding something in a file that size, would take a long time. What do you do, look at each record, one at a time until you find what you're looking for? Dullsville. So, I index it. I read that monster 6gb file, and create records like this:

namedigest|place

filedigest|place

namedigest is the md5 of the newsgroupname/partialfilename, place is the byte offset into consolidated.dat where that record starts. Creating these indexes takes several hours (six, the last time I did it). That sounds like a long time, until you remember that using Berkeley DB or MySQL was taking several *weeks* to build the database. Or possibly several months, I didn't let it run to completion.

But I don't just write those indexes into files. I look at the first byte, and write it into a different file depending on the byte. So, if the first byte is 61, I write it to nameindex61.dat or digindex61.dat

So now I have twice 256 files of 21 byte records. About 200 million records, twice (once for namedigest, once for filedigest). That's about 2gb, but divided into 256 pieces, so the files are only several megabytes each. I sort each of those files, then concatenate all 256 of them. This gives me two sorted files, nameindex.dat and digindex.dat, with fixed length records. The reason I write it into 256 different files, is that it's a lot faster to sort 256 small files than one big one. And the reason I want fixed length records, is that they're easier and faster to handle.

One file is namedigest|place and is called nameindex.dat, the other is filedigest|place and is called digindex.dat. Place is a five byte integer, don't ask how I handle such a weird thing, (it's very ugly, but a lot faster than the library function for handling big integers which is no doubt totally general but horribly slow) but I need that because place is a byte offset into a 6gb file, and that will let me handle up to a 1024 gb file, so it should last several decades before I have to redesign). You see, if I use the more normal four byte integers that you get for free with every computer, I can only count up to 4gb. Not enough.

But nameindex.dat is still 2gb, and searching through that will take a long time. It's much too big to be able to keep it in memory, so each search would be lots of disk reads, horribly slow. The answer to a slow search? Index it.

So, I index the index. I index those, just looking at the first 5 hexes. The a1823'th record of namindexindex is the record offset into nameindex where we see the first entry with a namedigest that starts with a1823. And remember, they're sorted, so the last one will be the one before the one that the a1824'th record of namindexindex points to.

This gives me two files, namindexindex and digindexindex, which are a mere 4 mb in size. Those files are 4 byte records, and there's a million of them. Always. And there's an average of 100 records in the nameindex per one in the nameindexindex. If I have twice as many files (200 million pics) then that will double, but that'll still be pretty fast - the run speed should degrade gracefully, until eventually I change from a brute force (one by one) search of 100 21-byte records to a binary chop search (they're sorted), or I change from a 5-hex scheme to a 6-hex scheme (I'll worry abouit how I'll handle this a couple of decades from now). So, now I have two 4mb files.

My original idea was to use mod_perl, which means that my program will be persistent. That means that once I've read those 4mb files once, they'll be in memory permanently. But mod_perl either didn't work, or else it screwed up the Apache rewrite (the thing that takes the URL that you ask for and feeds it into my get.cgi program). So I tried it without mod_perl, and it worked fine, but each picture access had an overhead of reading those two 4mb files before the database could get started up, which took a few seconds, and a few seconds before finding each picture is noticable. So instead of reading those files, I put that data into a Berkeley DB, the hash in the DB maps onto a hash (associative array) in perl very nicely, so I can just pretend it's all in memory and the DB does caching and hashing or indexing and whatever, and the startup of that was near-immediate. That DB means that I have an index to the index of the index. And since consolidated.dat is actually an index to where the files themselves are, it means that I have an index to the index of the index to the index. Four layers of index. I can tell you, as I was building this contraption, I had to stop a few times to uncross my eyes.

I feel sure that all the nifty programmers are laughing themselves silly at this scheme, and wondering why I didn't use ... well, I don't know. If I had known a neater way to do this, I'd have used it.

Those of you who aren't programmers, are probably wondering why I went to all this trouble, Well, there were 12 older servers, and I was adding them at a rate of one per six weeks, and that's accelerating. I guess that within three years, I'd have about 50 "older servers", and anyone looking to do a search, would spend hours searching them one by one.

I also changed the search engine. Before, I was using grep (brute force searching) on a file with all the filenames and subject lines for each post. Indeed, I'm still using that for the current newsthumbs, it's simple, and fast, provided you aren't trying to search a really big file. But for the older newsthumbs, I needed something that would scale to 100 million pictures and beyond. I use Swish-e, which is the same thing I use for searching the library. It's fairly fast, considering how much it's doing, and it means it'll search all the older servers at once. Then I de-duplicate what it finds, so you aren't downloading fourteen copies of the exact same file, zip the files, and you can download them.

I changed the "calendars", the pages that give you a menu of dates for the newsgroup you're interested in. Instead of telling you how many pictures there are for each date, which I'm guessing isn't that important, it just checks to see whether there's some or none, and if none, there's no link for that date. The first time I tested that, it took half a minute to check all the dates to see if there are pictures for any one newsgroup. But then I realised, I can do all that just once, and store the calendars as files. So now, when you load a calendar, it comes up immediately.

This new system has been up and running for some days now, and apart from a few minor things that I had to smooth out, it seems to be working fine.

Shopping Mall

A new video from Kasie Cavanaugh - "Sexy and muscles", a domination video.

The Server Farm

August 10, I trundled the Valkmobile down to Watford with 11 new servers. Installation went a lot more smoothly than it might have. One computer needed a reboot to start up properly, another one had a cable come loose in transit, and I needed to open the case and push everything tight. Now I have 31 servers in place there, mostly Pentium-4 or Celeron-4. My previous generation of Pentium-3 servers is now all downgraded to humbler tasks.

But I also did some thinking about why the processing takes so long, and whether it could be speeded up with better algorithms. The first thing I looked at was my algorithm for decoding yenc-encoded files. Yenc is the most recent encoder, giving files somewhat smaller than mime or uuencode. The algorithm I was using to decode them was very clunky, looking at one byte at a time. But some googling gave me the following algorithm, which is so elegant, I think you'd like to see it.

s/=(.)/chr(ord($1)+256-64 & 255)/egosx;
tr[\000-\377][\326-\377\000-\325];

When I tried this on a sample of files, it cut processing time from 60 seconds down to 7.5, so it's about eight times faster. But yenc is only used maybe for 10% of files (mostly in the comics newsgroups, I think) because most files aren't big enough for the reduced download time of yenc-encoded files to make a difference. Still, it's a saving worth having!

So I took back all the obsolete (Celeron 333, for example) and non-working computers, and I started to check them out. With Carol, the problem was a CPU fan not working. Now listen to my rant.

Start of rant.

We have this electronic marvel of invention, with four 250 gb drives and one 20 gb system drive, with a gigabyte of memory and a motherboard stiff with chips. And we have a cpu fan. The cpu fan is a low-tech mechanical electric-motor type thing, but if it fails, the rest of the computer is just iron and silicon. It's a single point of failure too, you can't put in two in case one fails. And fail they do, fail in quantity, fail and fail. It's the weakest link, and it's an especially weak weak link. It's an appalling piece of design.

End of rant.

On Alice, the problem was a partial hard disk failure in one of the 250 gb drives, and a total failure in the boot drive. I replaced them.

On Fluff, I'm still a bit baffled. If I leave Fluff running for a week or two, I get a multicoloured psychedelic display. This means a near-total crash. But in what, the memory, the CPU or the motherboard? I'll have to investigate more to narrow it down. I'll swap out things until the crashing stops.

Within a couple of weeks of on-site testing, two of the new servers went pear-shaped, so Holly and Didie will have to come back for a bit of re-education. And just to help things along, Venus (my backup NNTP server) went down. So, three to swap out.

Understanding the internet - change of address

People change email address quite often. Sometimes it's a change of ISP, sometimes an old address gets totally spam-bound. Yes, change-of-address is a frequent thing. So, how should you handle it?

Sometimes, you'll be in the lucky situation of still having the old address for a while. If you do, then you can arrange for your email to be forwarded to your new address (if you're changing because of the spam flood, you won't be doing that, of course). It won't happen automatically, you'll have to ask for it to happen, and some ISP's won't do it.

The usual situation, is that it's up to you to tell people about your change of address. Of course, you don't have to. But if you don't, then all your friends won't be able to contact you.

Who are your friends? You might have a list in your mailer. If you do, then it's simple, you can just send an email to all of them. But it's at this point that a lot of people make a crucial mistake. They send an email that says "Hi, my new email address is john@newserver.com"

People like me are left wondering who just emailed me? How can I change my record of your email address, if I don't know who you are? The answer is simple. Your email should say "Hi, this is John Harrison, my old email address was john@oldserver.com (I also used to use john254@hotmail.com and john291@yahoo.com). My new email address is john@newserver.com"

Then, folks like me who know more than one John, can make the necessary change to our records.

You might think that this is obvious. But at least 80% of the change-of-address emails I get, are like the first one.

Cameras

A camcorder and a camera for Emma (Storm) James; we're working on a web site for her.

Also, I've improved the deal that I offer to FBBs for making, hosting, billing and administering web sites for them. Before, they didn't have to pay for anything, and we divided the revenue. Now, I've included a digital camera in the deal, suitable for taking pictures at 2048 by 1536 resolution, which is about twice the resolution needed for a web site. I can do this because the cost of these cameras has fallen so much. It's another way I can help to sponsor the athletes.

Spams of the Month

I don't make these up, although the comments on the spams are mine, of course. These are actual spams sent to me, which just strike me as funny. I don't include their contact details - go find your own spammers!

By the way, if you're using StoneColdMail (which is free to web site members) then you won't see most of these spams, they'll be delivered into your "Spam" folder.

Earn up to $00,000

I think I'll pass, thanks.

GET LAID WITH COUGARS

You don't think they'd be a bit ... catty?

Become a Lord or Lady today

Professor Higgins, cry your eyes out.

Post: We owe you $941390

Could you just PayPal it to me? Thanks.

Our Univsersity can offer you a Pre-Qualified degree

I'd feel better about that if you could spell "University".

Sponsorships

We've sponsored lots of the women; Nicole Bass, Andrulla Blanchette, Sheila Burgess, Christine Envall, Marilyn Perret, Peggy Schoolcraft, Larisa Hakobyan, Steph Parks.

We're also sponsoring individual events such as the Femsport Valkyrie Festival, and the New York Muscle Club, and funding athletes to go to events with grant dollars.

We're also doing free hosting and free bandwidth for many of our sponsored women. Bandwidth can mount up to a large bill when you're running a popular web site.

And we've sponsored Heather Foster, Kara Bohigian, Priscilla Ribic, KerryAnn Allen, Linda Cusmano and Jodi Miller. Also Anita Ramsey and Rhonda Dethlefs.

DtV Family web sites

A new site, Femuscle by Pellegrini Only three galleries there so far, but Pellegrini is working to expand it.

The Clubhouse

In the Chatroom

Chatter of the month

Member	Posts
tre1313	7258
boomerflex	6693
tkokidd0	6317
madman3579	6067
TomNine	5889
pangel004	5597
buffy18976	5269
jcc115	3843
zig563	3817
Jerroll	2400
ens1961	2129
Jabber	1988
ginny2442	1918
thegoat77	1880
hiram2000	1813
mit19237	1641
lpdorman69	1500
panda1016	1346
rainer0000	1241
albogrease	1197

It's Tre, then Boomer. Hey, why aren't you out getting more great material for Femflex and Boomerflex?

On the Message Boards

I've put up a new style of board; it includes threading, the possibility of moderation, and lots more. You can see what Tom Nine's Tussling Tenement looks like, we gave it a new coat of paint, and it looks great. Mixed wrestling sessions is a popular topic!

This month we had 2799 posts to the boards.

Most posted Board of the month

Poster of the month

Board	Posts
Politics and economics	649
Boomer's sports chat	297
Female bodybuilders	194
Sergeant Wick and PFC Kandor's Crush Camp	154
Diana the Valkyrie's message board	100
Sexuality	99
Rugman's Real Encounters	89
Scooby's Femme Fatale Forum, for mixed action	81
TwoPossums TV and Pictures	72
Wrestling	64

Member	Posts
steve333	157
Homoancient	150
boomerflex	118
tre1313	102
billwick714	91
femgrow007	80
davex	78
Jabber	77
Diana the Valkyrie	76
bill235	57

Its the US elections. So the Politics board has exploded, but in a nice way.

Steve has just squeaked past Homoancient, but I read that the Ancient One has been away for a while.

Board access

Mavis is counting the number of times the message list is checked for each board. This gives a very different picture from the one above.

Most listed Board of the month
Most read Board of the month

Board
Posts

Female bodybuilders 12361

Fistman's Finest photos 10757

TwoPossums TV and Pictures 10161

Rugman's Real Encounters 8251

Feats of strength 8110

Scooby's Femme Fatale Forum, for mixed action 7207

Female muscle 7016

Wrestling 5869

Boomer's celebrity flexing 5408

Biceps 5195

Board
Posts

It's all about FBBs and pictures. The Grinch got the stats.

Back Page

Happy Birthday, internet, now you're 35 years old.

I checked the site statistics that Sandra counts up each night.

At the end of August 2004, there were about 722,000 pictures (46 gigabytes), 153 gigabytes of video, 8600 text files (mostly stories) and a total of about 199 gigabytes. The Current Newsthumbs has 1.2 million pictures; there's about 100 million pictures altogether in Newsthumbs. How many web sites do you know that have 100 million pictures?

Diana the Valkyrie's Newsletter - September 2004

A hard man is good to beat

August, 2004

New and updated Galleries

The Library

Listen with Diana the Valkyrie

The Movie Theatre

Newsthumbs

The new Newsthumbs system

The advantages to you

The technical details

Shopping Mall

The Server Farm

Understanding the internet - change of address

Cameras

Spams of the Month

Sponsorships

DtV Family web sites

The Clubhouse

In the Chatroom

Chatter of the month

Member

Posts

On the Message Boards

Most posted Board of the month

Poster of the month

Board

Posts

Member

Posts

Board access

Most listed Board of the month

Most read Board of the month

Board

Posts

Board

Posts

Back Page

To the Magic Carpet