Re: Html parsing and inline elements

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Html parsing and inline elements
Дата	13 апреля 2016 г. 14:09:56
Msg-id	21258.1460556589@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Html parsing and inline elements (Marcelo Zabani <mzabani@gmail.com>)
Ответы	Re: Html parsing and inline elements
Список	pgsql-hackers

Дерево обсуждения

Marcelo Zabani <mzabani@gmail.com> writes:
> I was here wondering whether HTML parsing should separate tokens that are
> not separated by spaces in the original text, but are separated by an
> inline element. Let me show you an example:

> *SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are
> <strong>n</strong>i<em>ce</em>')*
> *Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

> "Hello" and "neighbor" should really be separated, because *<p>* is a block
> element, but "nice" should be a single word there, since there is no visual
> separation when rendered (*<em>* and *<strong>* are inline elements).

I can't imagine that we want to_tsvector to know that much about HTML.
It doesn't, really, even have license to assume that its input *is*
HTML.  So even if you see things that look like <foo> and </foo> in the
string, it could easily be XML or SGML or some other SGML-like markup
format with different semantics for the markup keywords.

Perhaps it'd be sane to do something like this as long as the
HTML-specific behavior was broken out into a separate function.
(Or maybe it could be done within to_tsvector as a separate parser
or separate dictionary?)  But I don't think it should be part of
the default behavior.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Html parsing and inline elements