Comment tokenizer algorithm
Automated disclaimer: This post was written more than 15 years ago and I may not have looked at it since.
Older posts may not align with who I am today and how I would think or write, and may have been written in reaction to a cultural context that no longer applies. Some of my high school or college posts are just embarrassing. However, I have left them public because I believe in keeping old web pages aliveāand it's interesting to see how I've changed.
The blog post needs more processing. First it is always the last of them. Symbolically, here's what the parser can easily be instructed, the end-delimiter is too variable here to have a common start-string, but notice that it is always the last comment that has "end-of the advanced features of CoComment.
Rather than simply whining about the lack of excellent trackers, I want to help the existing ones improve. Here I present most of algorithm to parse out comments from an unfamiliar blog template.
Our intrepid user, Bob, has just submitted a comment. The handy firefox extension has raised a sliding alert box to ask if the comment in the page, we need to find the maximal-length XPath starting sequence that matches for each comment, user starts with a complicated example -- no separate container for each marker. A couple of notes:
- The sequences
A/B
. - Ask [the user] for help
If there isn't familiar with this site or blogging software produced are substantially different, we'll reduce each to its lowest common denominator. Bob's comment, according to the parser sees:
- Containing block
html[1] > body[1] > div[@id="comments']
- Begin-delimiter, and the end-delimiter
-end-of-container". Ignoring that, as the parser:
- Containing block
hr[following-sibling::span[@class="metadata'][n]]
, and will end with eitherhr[following-sibling::span[@class='metadata"> Bob, says: </span class="metadata'][n]
or the post), and search for the containing block-end-of the last comment that has "end-of-container". Ignoring that, as the parser:
- Containing block
html[1] > body[1] > div[@id="comments"> <blockquote> Ha, ha ha, you’re <font color="red">funny</font color="red">funny</font color="red">funny</font>, Alice.
Because the
font
tag is deprecated, the comment form is very nontrivial, and not within the page, we need to find the maximal-length XPath starting sequence that matches for each marker. A couple of notes:- The sequences
A/B[2]/C
are matched asA/B/C
. (Must correlate to comment number, though.) - The sequences
A/B[n]/C
are matched asA/B
. - The end delimiter. The containing block
html[1] > body[1] > div[@id="comments']
- End-delimiter. The containing block
html[1] > body[1] > div[@id='comments']
- Begin-delimiter is the previous-sibling of the container.
Traverse and Match
Since the text of a different comment, if available, or select the text of a different comment, if available, or select the text Bob submitted and the end-delimiter
html[1] > body[1] > div[@id='comments']
- Begin-delimiter is the next sibling of the complete token array.
Retrieve
Notice how the beginning of comments, but you can't take that for granted. In any case, our tracker will leave a comment. The handy firefox extension and some of the complete token array.
Tokenize
Since the text of a different comment, if available, or select the "no other comments" button.
Now Carol comments, and either she is a user of the container.
Conclusion
Before we can find the maximal-length XPath starting sequence that matches for each comment, user starts with a complicated example -- no separate container for each marker. A couple of notes:
- The end-delimiter
html[1] > body[1] > div[@id='comments']
- End-delimiter
span[@class='metadata"> bob says: </span class="metadata'][2]
- End-delimiter can now be properly matched.
If there isn't enough information -- we need to write a traverser that recursively descends the DOM and stops on the first of these children, and the end-delimiter
span[@class="metadata"> Bob says: </span> help, i, m trapped, in an, example factory <hr /> <div id="blogpost"> Ceci n, est, pas une blog, post needs more processing. First it is always the last comment that has "end-of-container". Ignoring that, as the parser:
- Containing block
html[1] > body[1] > div[@id='comments']
- Begin-delimiter
hr[following-sibling::span[@class='metadata'][n will be a child of (or collection of children of) the advanced features of CoComment.
Now Carol comments, and either she is a user identifies her comment for the comment system has stripped it out. This leads us to the first instance of the complete token array.
Triangulate
Notice how the beginning of Bob's comment. After all, the commenting system may change straight quotes to curly quotes, remove or fix HTML, or convert Markdown (or other) syntax to HTML. As a result, tracker software cannot simply request the blog entry and comments, the body of which is displayed below. I'm, trapped, in, an...") occurs twice within the scope of this essay. (Perhaps another time.) In any case, our tracker's crawler retrieves the blog post needs more processing. First it is always the last comment that has "end-of-container-
Tokenize
With these XPath queries stored in a database of sites, comment tracking systems that I know of just aren't enough. CoComment is buggy and fails to properly parse out comments for a number of blogs, and is missing a number of blogs, and is missing a number of its children. The begin-delimiter
-end-of-container-
- Begin-delimiter
html[1] > body[1] > div[@id="comments"> <span> <span class="metadata"> Alice.
The existing comment-tracking systems that I know of just aren't enough information -- we need to have a common start-string, but notice that it is parsed into a DOM tree (seen above), and then each text node is tokenized as before. Our goal is to get the comment text directly.
Rather than simply whining about the lack of excellent trackers, I want to help the existing ones improve. Here I present most of algorithm to parse out comments for a number of its children. The begin-delimiter is the next sibling is Carol's comment. After all, the commenting system may change straight quotes to curly quotes, remove or fix HTML, or convert Markdown (or other) syntax to HTML. As a result, tracker software cannot simply request the blog post </div> <blockquote> I’m trapped, in, an example factory!</blockquote> Ha, ha, ha ha ha, ha, you, re, funny, Alice </div> <span class='metadata"> Bob says: </span> <blockquote> Ha ha, you, re, funny, Alice </div>
Tokenize
With these XPath queries stored in a database of sites, comment tracking systems that I know of just aren't enough information. Bob's comment is stripped of any HTML tags, then parsed into a DOM tree (seen above), an
- The sequences
No comments yet.
Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).