Re: Html parsing and inline elements
От | Tom Lane |
---|---|
Тема | Re: Html parsing and inline elements |
Дата | |
Msg-id | 21258.1460556589@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Html parsing and inline elements (Marcelo Zabani <mzabani@gmail.com>) |
Ответы |
Re: Html parsing and inline elements
|
Список | pgsql-hackers |
Marcelo Zabani <mzabani@gmail.com> writes: > I was here wondering whether HTML parsing should separate tokens that are > not separated by spaces in the original text, but are separated by an > inline element. Let me show you an example: > *SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are > <strong>n</strong>i<em>ce</em>')* > *Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"* > "Hello" and "neighbor" should really be separated, because *<p>* is a block > element, but "nice" should be a single word there, since there is no visual > separation when rendered (*<em>* and *<strong>* are inline elements). I can't imagine that we want to_tsvector to know that much about HTML. It doesn't, really, even have license to assume that its input *is* HTML. So even if you see things that look like <foo> and </foo> in the string, it could easily be XML or SGML or some other SGML-like markup format with different semantics for the markup keywords. Perhaps it'd be sane to do something like this as long as the HTML-specific behavior was broken out into a separate function. (Or maybe it could be done within to_tsvector as a separate parser or separate dictionary?) But I don't think it should be part of the default behavior. regards, tom lane
В списке pgsql-hackers по дате отправления: