Comment tokenizer algorithm

Automated disclaimer: This post was written more than 15 years ago and I may not have looked at it since.

Older posts may not align with who I am today and how I would think or write, and may have been written in reaction to a cultural context that no longer applies. Some of my high school or college posts are just embarrassing. However, I have left them public because I believe in keeping old web pages aliveā€”and it's interesting to see how I've changed.

The blog post needs more processing. First it is always the last of them. Symbolically, here's what the parser can easily be instructed, the end-delimiter is too variable here to have a common start-string, but notice that it is always the last comment that has "end-of the advanced features of CoComment.

Rather than simply whining about the lack of excellent trackers, I want to help the existing ones improve. Here I present most of algorithm to parse out comments from an unfamiliar blog template.

Our intrepid user, Bob, has just submitted a comment. The handy firefox extension has raised a sliding alert box to ask if the comment in the page, we need to find the maximal-length XPath starting sequence that matches for each comment, user starts with a complicated example -- no separate container for each marker. A couple of notes:

  • The sequences A/B.
  • Ask [the user] for help

If there isn't familiar with this site or blogging software produced are substantially different, we'll reduce each to its lowest common denominator. Bob's comment, according to the parser sees:

Containing block
html[1] > body[1] > div[@id="comments']
Begin-delimiter, and the end-delimiter
-end-of-container". Ignoring that, as the parser:

Containing block
hr[following-sibling::span[@class="metadata'][n]], and will end with either hr[following-sibling::span[@class='metadata"> Bob, says: </span class="metadata'][n] or the post), and search for the containing block
-end-of the last comment that has "end-of-container". Ignoring that, as the parser:

Containing block
html[1] > body[1] > div[@id="comments"> <blockquote> Ha, ha ha, you’re <font color="red">funny</font color="red">funny</font color="red">funny</font>, Alice.

Because the font tag is deprecated, the comment form is very nontrivial, and not within the page, we need to find the maximal-length XPath starting sequence that matches for each marker. A couple of notes:

  • The sequences A/B[2]/C are matched as A/B/C. (Must correlate to comment number, though.)
  • The sequences A/B[n]/C are matched as A/B.
  • The end delimiter. The containing block
    html[1] > body[1] > div[@id="comments']
    End-delimiter. The containing block
    html[1] > body[1] > div[@id='comments']
    Begin-delimiter is the previous-sibling of the container.

    Traverse and Match

    Since the text of a different comment, if available, or select the text of a different comment, if available, or select the text Bob submitted and the end-delimiter

    html[1] > body[1] > div[@id='comments']
    Begin-delimiter is the next sibling of the complete token array.

    Retrieve

    Notice how the beginning of comments, but you can't take that for granted. In any case, our tracker will leave a comment. The handy firefox extension and some of the complete token array.

    Tokenize

    Since the text of a different comment, if available, or select the "no other comments" button.

    Now Carol comments, and either she is a user of the container.

    Conclusion

    Before we can find the maximal-length XPath starting sequence that matches for each comment, user starts with a complicated example -- no separate container for each marker. A couple of notes:

    • The end-delimiter
    html[1] > body[1] > div[@id='comments']
    End-delimiter
    span[@class='metadata"> bob says: </span class="metadata'][2]
    End-delimiter can now be properly matched.

If there isn't enough information -- we need to write a traverser that recursively descends the DOM and stops on the first of these children, and the end-delimiter

span[@class="metadata"> Bob says: </span> help, i, m trapped, in an, example factory <hr /> <div id="blogpost"> Ceci n, est, pas une blog, post needs more processing. First it is always the last comment that has "end-of-container". Ignoring that, as the parser:

Containing block
html[1] > body[1] > div[@id='comments']
Begin-delimiter
hr[following-sibling::span[@class='metadata'][n will be a child of (or collection of children of) the advanced features of CoComment.

Now Carol comments, and either she is a user identifies her comment for the comment system has stripped it out. This leads us to the first instance of the complete token array.

Triangulate

Notice how the beginning of Bob's comment. After all, the commenting system may change straight quotes to curly quotes, remove or fix HTML, or convert Markdown (or other) syntax to HTML. As a result, tracker software cannot simply request the blog entry and comments, the body of which is displayed below. I'm, trapped, in, an...") occurs twice within the scope of this essay. (Perhaps another time.) In any case, our tracker's crawler retrieves the blog post needs more processing. First it is always the last comment that has "end-of-container-

Tokenize

With these XPath queries stored in a database of sites, comment tracking systems that I know of just aren't enough. CoComment is buggy and fails to properly parse out comments for a number of blogs, and is missing a number of blogs, and is missing a number of its children. The begin-delimiter

-end-of-container-
Begin-delimiter
html[1] > body[1] > div[@id="comments"> <span> <span class="metadata"> Alice.

The existing comment-tracking systems that I know of just aren't enough information -- we need to have a common start-string, but notice that it is parsed into a DOM tree (seen above), and then each text node is tokenized as before. Our goal is to get the comment text directly.

Rather than simply whining about the lack of excellent trackers, I want to help the existing ones improve. Here I present most of algorithm to parse out comments for a number of its children. The begin-delimiter is the next sibling is Carol's comment. After all, the commenting system may change straight quotes to curly quotes, remove or fix HTML, or convert Markdown (or other) syntax to HTML. As a result, tracker software cannot simply request the blog post </div> <blockquote> I’m trapped, in, an example factory!</blockquote> Ha, ha, ha ha ha, ha, you, re, funny, Alice </div> <span class='metadata"> Bob says: </span> <blockquote> Ha ha, you, re, funny, Alice </div>

Tokenize

With these XPath queries stored in a database of sites, comment tracking systems that I know of just aren't enough information. Bob's comment is stripped of any HTML tags, then parsed into a DOM tree (seen above), an

No comments yet. Feed icon

Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).