BitPoet's Jumbled Musings

Wednesday, March 19, 2014

Fuzzy Word Comparision in MySQL

As the topic has recently come up in a Usenet post, I've rummaged through my MySQL playground database and dug out an implementation of the Levenshtein distance algorithm as a MySQL stored function.

Levenshtein Distance

The Levenshtein distance is a measure for how much two words differ as a sum of changes to change one into the other. This means a distance of zero means that the two words are identical, and every character that needs to be added, removed or changed increases the difference by one. The first of the stored functions below, named LSD (which should by no means be taken as an endorsement to take drugs) returns the distance as an integer. The code is an adaption of the optimized iterative approach that can be found in the Wikipedia article on the Levenshtein algorithm.

Usage example

SELECT LSD('kitchen', 'kitten');
>> 2

To get from 'kitchen' to 'kitten', one needs two steps:

change the 'c' to 't'
remove the 'h'

Levenshtein Ratio

As a distance of 2 may have completely different meanings for long words of fifteen characters opposed two comparing three-character words, the distance on its own doesn't tell enough about the similarity of two words. That's what the LSD_RATIO function below is for, it calculates the percentual Levenshtein distance compared to the maximum word length.

SELECT LSD_RATIO('bad', 'bed');
>> 67

SELECT LSD_RATIO('beast', 'feast');
>> 80

MySQL DDL

DROP FUNCTION IF EXISTS LSD;

DELIMITER $$

CREATE FUNCTION LSD(inParamA VARCHAR(254), inParamB VARCHAR(254))
RETURNS INT
DETERMINISTIC
NO SQL
BEGIN
    DECLARE inpA VARCHAR(254);
    DECLARE inpB VARCHAR(254);
    DECLARE i INT DEFAULT 1;
    DECLARE j INT DEFAULT 1;
    DECLARE curMin INT DEFAULT 0;
    DECLARE cost INT DEFAULT 0;
    DECLARE v0 CHAR(64) DEFAULT REPEAT(CHAR(0), 255);
    DECLARE v1 CHAR(64) DEFAULT REPEAT(CHAR(0), 255);
    DECLARE aLen INT DEFAULT 0;
    DECLARE bLen INT DEFAULT 0;

    IF inParamA IS NULL OR inParamB IS NULL THEN
        RETURN NULL;
    END IF;

    IF inParamA = inParamB THEN
        RETURN 0;
    END IF;

    IF LENGTH(inParamA) = 0 THEN
        RETURN LENGTH(inParamB);
    END IF;

    IF LENGTH(inParamB) = 0 THEN
        RETURN LENGTH(inParamA);
    END IF;

    IF LENGTH(inParamA) < LENGTH(inParamB) THEN
        SET inpA = inParamA;
        SET inpB = inParamB;
    ELSE
        SET inpA = inParamB;
        SET inpB = inParamA;
    END IF;

    SET aLen = LENGTH(inpA);
    SET bLen = LENGTH(inpB);

    WHILE i <= bLen DO
        SET v0 = INSERT(v0, i, 1, CHAR(i-1));
        SET i = i + 1;
    END WHILE;

    SET i = 1;
    WHILE i <= aLen DO
        SET v1 = INSERT(v1, 1, CHAR(i), 1);

        SET j = 1;

        WHILE j <= bLen DO
            IF SUBSTRING(inpA, i, 1) = SUBSTRING(inpB, j, 1) THEN
                SET cost = 0;
            ELSE
                SET cost = 1;
            END IF;
            SET curMin = ORD(SUBSTRING(v1,j,1)) + 1;
            IF ORD(SUBSTRING(v0,j + 1,1)) + 1 < curMin THEN
                SET curMin = ORD(SUBSTRING(v0,j,1)) + 1;
            END IF;
            IF ORD(SUBSTRING(v0,j,1)) + cost < curMin THEN
                SET curMin = ORD(SUBSTRING(v0,j,1)) + cost;
            END IF;
            SET v1 = INSERT(v1, j + 1, 1, CHAR(curMin));
            SET j = j + 1;
        END WHILE;

        SET v0 = v1;

        SET i = i + 1;
    END WHILE;

    RETURN ORD(SUBSTRING(v1,bLen+1,1));
END
$$

DROP FUNCTION IF EXISTS LSD_RATIO
$$

CREATE FUNCTION LSD_RATIO(inParamA VARCHAR(254), inParamB VARCHAR(254))
RETURNS INT
DETERMINISTIC
NO SQL
BEGIN
    DECLARE maxLen INT DEFAULT 0;
    DECLARE lvDist INT;
    IF inParamA IS NULL OR inParamB IS NULL THEN
        return 0;
    END IF;
    SET maxLen = (CASE WHEN LENGTH(inParamA) > LENGTH(inParamB) THEN LENGTH(inParamA) ELSE LENGTH(inParamB) END);
    IF maxLen = 0 THEN
return 0;
    SET lvDist = LSD(inParamA, inParamB);
    RETURN (maxLen - lvDist) / maxLen * 100;
END
$$

Friday, April 2, 2010

The hidden Dangers of PDF files

Didier Stevens brought a weak spot in the PDF format specs to the public's eyes: PDF allows external applications to be launched automatically when opening the page, and there were readers out there which didn't even prompt the user for a confirmation.

Nevertheless, even the confirmation dialog Adobe Reader brings up is cluttered and the real command can be hidden and custom text displayed instead. Users that are used to clicking away the MS Word macro dialog will no doubt be inclined to simply accept the request. Even more, once a user has ticked the checkbox to not show the dialog again, commands will be run without further prompting.

Dangerous ground indeed, especially when it's quite simple to craft an exploit which creates and runs a binary executable file from within the PDF, no loading from the web neccessary - it's really that easy, I've made my own proof-of-concept PDF and compared notes with Didier.

So for the time being, don't use older Foxit readers to read foreign PDF files - they already have a patched version out which at least prompts you to confirm the application launch, check back at their website at http://www.foxitsoftware.com/. When using Adobe Reader, disallow the opening of external non-PDF content in the settings.

Read more on the PDF Launch attack at Didier Steven's Blog.

Tuesday, October 13, 2009

Thinking about Web 3.0

I'm in the mood for reflecting, and the topic my mind is currently hooked on is the future of the browser. We've seen lots of changes in basic technologies like HTML, XHTML, CSS and experienced how our beaten ugly child named Javascript has grown to be a proud, upright citizen in the country of recognized programming languages. It even got a child called XmlHttpRequest that people considered so beautiful that they changed the name of the internet to honor it and called it Web 2.0.

But the question that arises is 'where to now'? It's clear that the dividing line between the desktop and the browser is dissolving, you can count the number of GUI toolkits and offline APIs that have sprung from the ground over the last years if you're unsure. Almost every page you visit includes one or the other, starting from simple graphics effect libraries based on prototype.js or JQuery to full fledged widget collections with database abstraction layers like ExtJS.

The next logical step would be to integrate more and more of that into the browser itself, while laying out an open and investment secure standard. Just imagine writing fifty lines of code that show the user his calendar as neat as Outlook & Friends do, without him having to wait until the 1000kBytes of widgets have finished updating over the slow GSM mobile connection, and without you having to provide enough server power and network bandwidth to deliver those bytes to all your 1000+ users.

Making a local storage engine available and providing a complete offline runtime like Google Gears or Adobe Air do are in my opinion the first steps in the direction of bringing the graphics power of a desktop to the web. I'm curious whether my reasoning will meet up with reality, and if it does, then how long it will take.

Lightbox2 + Images + Automatic Scaling = Yay!

Recently I had the quick need for a floating image javascript and stumbled over Lightbox2. It's really easy to include in a web site and makes nice floating images with 'back' and 'next' buttons and smooth resizing between images.

However, the one feature that came up (and I found I'm not the first one with that need) is limiting the image size to something reasonable. So I've tweaked the code for the imgPreloader a bit to allow me to set a maxWidth and maxHeight property in the LightboxOptions.


        imgPreloader.onload = (function(){
            this.lightboxImage.src = this.imageArray[this.activeImage][0];

            var imgwidth = imgPreloader.width;
            var imgheight = imgPreloader.height;

            if( LightboxOptions.maxWidth ) {
             if( imgwidth > LightboxOptions.maxWidth ) {
              var factor = imgwidth / LightboxOptions.maxWidth;
              imgwidth = LightboxOptions.maxWidth;
              imgheight = imgheight / factor;
              this.lightboxImage.setStyle({
               width: '' + imgwidth + 'px',
               height: '' + imgheight + 'px'
              });
             }
            }

            if( LightboxOptions.maxHeight ) {
             if( imgheight > LightboxOptions.maxHeight ) {
              var factor = imgheight / LightboxOptions.maxHeight;
              imgheight = LightboxOptions.maxHeight;
              imgwidth = imgwidth / factor;
              this.lightboxImage.setStyle({
               width: '' + imgwidth + 'px',
               height: '' + imgheight + 'px'
              });
             }
            }

            this.lightboxImage.width = imgwidth;
            this.lightboxImage.height = imgheight;  
            this.resizeImageContainer(imgwidth, imgheight);
        }).bind(this);

Sunday, February 10, 2008

Apache2 + E_TOOMANYMODULES

Today I'm in the process of putting some updates on one of my root servers. I'm already used to doing most things by hand, as I want real version upgrades and not just security fixes. Doing a complete distribution upgrade is too much work when doing remotely and doesn't stand to reason.

So I started happily, downloaded everything I'm going to need and started compiling. The small bits and pieces went well (BerkeleyDB and other libraries), but with Apache2 I noticed for the first time consciously how many module names have changed, and how many things have been broken into separate pieces. Yet the configure script is still the same old one with "--enable-this" and "--disable-that". So to get everything in shape, I'll have to work my way through every little thing I need or don't. That's annyoing, dear Apache-folks!

I do hope that there's going to be something better soon. The number of apache-packages (be it .deb, .rpm or win32 installers) with missing or incomplete features shows that I'm not the only one who has a hard time keeping every little module needed in mind.

Saturday, October 6, 2007

Auto-Convert HTML to PDF in OpenOffice 2.3 on Windows

Figuring this out took me quite a while, but I finally managed to make OpenOffice 2.3 convert HTML pages to PDF from the command line. The big problem was that all the examples I found had slightly different requirements, so the code didn't work for me (especially when starting soffice with "-invisible"). Here's how to do it:

We need three macros, one for the document conversion, one as a helper to save further typing by easing the creation of OO PropertyValues and one to quit OpenOffice after that. So you need to open an OpenOffice app and select "Extras -> Macros -> Manage Macros -> OpenOffice.org Basic..." from the main menu. The window that pops up should already point to "My Macros / Standard / Module1", if not, navigate there. Click on the "Edit" button and insert the following in the end of the editor window:

Sub EndOO
StarDesktop.Terminate()
End Sub

Function MakePropertyValue( Optional cName As String, Optional uValue ) As com.sun.star.beans.PropertyValue
Dim oPropertyValue As New com.sun.star.beans.PropertyValue
If Not IsMissing( cName ) Then
oPropertyValue.Name = cName
EndIf
If Not IsMissing( uValue ) Then
oPropertyValue.Value = uValue
EndIf
MakePropertyValue() = oPropertyValue
End Function

sub PDFExporting( iFile As String, targetFile As String )
rem ----------------------------------------------------------------------
rem define variables
dim document as object
dim oDoc as object
dim dispatcher as object
dim oDesk as object
rem ----------------------------------------------------------------------
rem load the document

inURL = ConvertToURL( iFile )

oDesk = createUNOService("com.sun.star.frame.Desktop")

oDoc = oDesk.loadComponentFromURL( inURL, "_blank", 0, Array(_
MakePropertyValue( "Hidden", True ) ) )

rem get access to the document
document = oDoc.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")

rem ----------------------------------------------------------------------
dim args1(3) as new com.sun.star.beans.PropertyValue
args1(0).Name = "URL"
args1(0).Value = ConvertToURL(targetFile)
args1(1).Name = "FilterName"
args1(1).Value = "writer_pdf_Export"
args1(2).Name = "FilterData"
args1(2).Value = Array(_
Array("UseLosslessCompression",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("Quality",0,90,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("ReduceImageResolution",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("MaxImageResolution",0,600,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("UseTaggedPDF",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("ExportNotes",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("UseTransitionEffects",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("FormsType",0,1,com.sun.star.beans.PropertyState.DIRECT_VALUE),_
Array("",0,,com.sun.star.beans.PropertyState.DIRECT_VALUE))
args1(3).Name = "SelectionOnly"
args1(3).Value = true

rem the actual conversion
dispatcher.executeDispatch(document, ".uno:ExportToPDF", "", 0, args1())

end sub

Now just save the changes and close everything. The OpenOffice part is finished, now let's just put a small batch file in our path that does the calling and knows the path to OpenOffice. Open your text editor and insert the following without any newlines (make sure that the path to soffice.exe is correct):

"C:\Program Files\OpenOffice.org 2.3\program\soffice.exe" -invisible macro:///Standard.Module1.PDFExporting("%1","%2") macro:///Standard.Module1.EndOO()

Save the text file in your windows directory (usually C:\WINDOWS) with the name oo2pdf.cmd.

Now you can call it from the command line with the complete paths to the input html and output pdf as parameters:

oo2pdf "C:\Documents and Settings\MyUserName\MyDocuments\input.html" "C:\Documents and Settings\MyUserName\Desktop\output.pdf"

Added: With thanks to John (see the comment), here's the bash command line for *nix-ish operating systems:
soffice -invisible macro:///Standard.Module1.PDFExporting$infile.htm,outfile.pdf$ macro:///Standard.Module1.EndOO

Wednesday, May 9, 2007

New Chapters for my FanFics

If you have been following my Harry Potter 7th-year fanfic, you might be happy to hear that both the German and English version have a new Chapter. If you don't know my fics please have a look and let me know what you think.

English: Harry Potter and the Fifth Soul

German: Harry Potter und die Fünfte Seele

Tuesday, April 17, 2007

Yes, I'm still alive...

...though I'm suffering under a heavy wheather's-nice-but-have-to-work syndrome. On Sunday I was trecking again, together with two friends. We went to Gmunden (some of you may have heared of the famous Gmundener Porcelain) and from there up to the Traunstein. It's a really beautiful mountain with a breathtaking view over the Traun Lake from the top. The weather was unbelievable, it was 26°C in the shadow while we had to wade through a half meter of snow for the last 200 height meters. The treck starts 400 meters above zero and goes up to almost 1700 meters, with a lot of ladders and steel ropes. Nothing for trainer tourists :) I hope to get some photos this week from Peter (he shot like crazy, I'm still astounded that his memory stick didn't burst) to publish on picasa, then I'll put a link here.

If you happen to be in the area and aren't opposed to a bit of casual climbing, you should definitely try the Naturfreunde-Steig there.

Thursday, March 15, 2007

Mountain Trecking

Yes, I know mid March is a bit early, but the weather was just soo good and, seeing I didn't have anything better to do for my holydays, I went on a three day tour in the Bavarian Alps, close to Berchtesgaden. I really had a good time there, and as there weren't many tourists -- too little snow to do serious wintersports, too much for the unexperienced hiker -- it was really peaceful and relaxing. I made some photos that are worth taking a look at, here's the link to the Picasa galleries:

Alpine Trecking - Day1 - Wimbachklamm

Alpine Trecking - Day2 - Hirchbichl

Alpine Trecking - Day3 - Kueroint-Alm and Gruenstein

If I find the time, I'll write a bit more about the tour itself. In any case, if you happen to be in the area in spring and have sunny weather, each of those three tours can be easily made in a day and has its own beauty.

Saturday, March 3, 2007

From ff.net to ebook for the PDA

Hi there!

Ever had your back hurting from sitting in front of the computer for too long, reading stories on fanfiction.net or fictionpress.com? I for one had, so I decided to use my Pocket PC as an ebook reader. After a lot of tries and errors I found the best way to move the stories to my PocketPC is to download everything, combine the different chapters in a single file and export them in tagged PDF format.

For everyone not familiar with PDF, it's an open, print orientated document format. The "tagged" part means that paragraphs and pages have special markers, so the reader application knows which parts should stay together, and where it is allowed to enter line breaks. Normal PDFs don't have that kind of markup, which will have you scrolling left and right like a berserk, which seriously dulls the whole reading experience. Though the Adobe Reader for PocketPC/Windows Mobile can tag documents when you copy them to the device via Microsoft ActiveSync, this doesn't work very well (in fact, it often chokes on bigger docs) so I don't recommend this.

I've also found that just copying+pasting into a document often leads to strange font interpretations, which in turn lead to error messages or crashes on the PDA.

So what I do is the following:

Download all the chapters lokally;
Extract only the story parts, without all the fancy JavaScript parts
Convert everything to UTF-8 charset
Combine all chapters into a single HTML document
Open the HTML file with OpenOffice.org writer (Version 2)
Export to PDF with the "Create tagged PDF" option enabled

Of course, being a computer programmer, the first four steps were a lot too tedious for my taste, so I wrote a small Perl application that does it for me and only needs the story id and the output filename for the HTML. It can't be found online, but if you need it, let me know.