Logical replication supports replicating between tables with different
column order. But this failed for the initial table sync because of a
logic error in how the column list for the internal COPY command was
composed. Fix that and also add a test.
Also fix a minor omission in the column name mapping cache. When
creating the mapping list, it would not skip locally dropped columns.
So if a remote column had the same name as a locally dropped
column (...pg.dropped...), then the expected error would not occur.
Reduce some redundant messages to DEBUG1. Be clearer about the
distinction between apply workers and table synchronization workers.
Add subscription and table name where possible.
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Using flex's -i switch to achieve case-insensitivity is not a very safe
practice, because the scanner's behavior may then depend on the locale
that flex was invoked in. In the particular example at hand, that's
not academic: the possible matches for "FIRST" will be different in a
Turkish locale than elsewhere. Do it the hard way instead, as our
other scanners do.
Also, drop use of -b -CF -p, because this scanner is only used when
parsing the contents of a GUC variable. That's not done often, and
the amount of text to be parsed can be expected to be trivial, so
prioritizing scanner speed over code size seems like quite the wrong
tradeoff. Using flex's default optimization options reduces the
size of syncrep_gram.o by more than 50%.
The case-insensitivity problem is new in HEAD (cf commit 3901fd70c).
The poor choice of optimization flags exists also in 9.6, but it doesn't
seem important enough to back-patch.
Discussion: https://postgr.es/m/24403.1495225931@sss.pgh.pa.us
When creating a subscription with slot_name = NONE, we failed to check
that also create_slot = false and enabled = false were set. This
created an invalid subscription and could later lead to a crash if a
NULL slot name was accessed. Add more checks around that for
robustness.
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
We used to only check for a supported relkind on the subscriber during
replication, which is needed to ensure that the setup is valid and we
don't crash. But it's also useful to tell the user immediately when
CREATE or ALTER SUBSCRIPTION is executed that the relation being added
to the subscription is not of a supported relkind.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
Reformat various places in which pgindent will make a mess, and
fix a few small violations of coding style that I happened to notice
while perusing the diffs from a pgindent dry run.
There is one actual bug fix here: the need-to-enlarge-the-buffer code
path in icu_convert_case was obviously broken. Perhaps it's unreachable
in our usage? Or maybe this is just sadly undertested.
Before 955a684e0401 logical decoding snapshot maintenance needed to
cope with transactions it might not have seen in their entirety. For
such transactions we'd to assume they modified the catalog (could have
happened before we were watching), and thus a new snapshot had to be
built, and distributed to concurrently running transactions.
That's problematic because building a new snapshot isn't that cheap ,
especially as the the array of committed transactions needs to be
sorted. When creating a slot on a server with a lot of transactions,
this could make logical slot creation infeasibly expensive.
After 955a684e0401 there's no need to deal with transaction that
aren't guaranteed to be fully observable. That allows to avoid
building snapshots for transactions that haven't modified catalog,
even before reaching consistency.
While this isn't necessarily a bugfix, slot creation being impossible
in some production workloads, is severe enough to warrant
backpatching.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records. Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.
That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions. That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.
It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.
Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code. That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
we cannot use it to build the initial snapshot. Instead we have to
wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
the all transaction perceived as running based on commit/abort
records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.
Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release. As this is going to be backpatched, we have to hack
around a bit to keep on-disk compatibility. A later commit will
rejigger that on master.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
Lag tracking is called for each commit, but we introduce
a pacing delay to ensure we don't swamp the lag tracker.
Author: Petr Jelinek, with minor pacing delay code from me
Previously, the memory used by the logical replication apply worker for
processing messages would never be freed, so that could end up using a
lot of memory. To improve that, change the existing ApplyContext memory
context to ApplyMessageContext and reset that after every
message (similar to MessageContext used elsewhere). For consistency of
naming, rename the ApplyCacheContext to ApplyContext.
Author: Stas Kelvich <s.kelvich@postgrespro.ru>
It turned out this approach had problems, because a DROP command should
not have any options other than CASCADE and RESTRICT. Instead, always
attempt to drop the slot if there is one configured, but also add an
ALTER SUBSCRIPTION action to set the slot to NONE.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/29431.1493730652@sss.pgh.pa.us
This new arrangement ensures that statistics are reported right after
commit of transactions. The previous arrangement didn't get this quite
right and could lead to assertion failures.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Erik Rijkers <er@xs4all.nl>
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and throws
a PANIC if so. At that point, only walsenders are still active, so one
might think this could not happen, but walsenders can also generate WAL,
for instance in BASE_BACKUP and certain variants of
CREATE_REPLICATION_SLOT. So they can trigger this panic if such a
command is run while the shutdown checkpoint is being written.
To fix this, divide the walsender shutdown into two phases. First, the
postmaster sends a SIGUSR2 signal to all walsenders. The walsenders
then put themselves into the "stopping" state. In this state, they
reject any new commands. (For simplicity, we reject all new commands,
so that in the future we do not have to track meticulously which
commands might generate WAL.) The checkpointer waits for all walsenders
to reach this state before proceeding with the shutdown checkpoint.
After the shutdown checkpoint is done, the postmaster sends
SIGINT (previously unused) to the walsenders. This triggers the
existing shutdown behavior of sending out the shutdown checkpoint record
and then terminating.
Author: Michael Paquier <michael.paquier@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
After the logical replication launcher was told to wake up at
commit (for example, by a CREATE SUBSCRIPTION command), the flag to wake
up was not reset, so it would be woken up at every following commit as
well. So fix that by resetting the flag.
Also, we don't need to wake up anything if the transaction was rolled
back. Just reset the flag in that case.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
Thinko in commit de4389712: this warning message references the wrong
"LogicalRepWorker *" variable. This would often result in a core dump,
but if it didn't, the message would show the wrong subscription OID.
In passing, adjust the message text to format a subscription OID
similarly to how that's done elsewhere in the function; and fix
grammatical issues in some nearby messages.
Per Coverity testing.
Before restarting a tablesync worker for the same relation, wait
wal_retrieve_retry_interval (currently 5s by default). This avoids
restarting failing workers in a tight loop.
We keep the last start times in a hash table last_start_times that is
separate from the table_states list, because that list is cleared out on
syscache invalidation, which happens whenever a table finishes syncing.
The hash table is kept until all tables have finished syncing.
A future project might be to unify these two and keep everything in one
data structure, but for now this is a less invasive change to accomplish
the original purpose.
For the test suite, set wal_retrieve_retry_interval to its minimum
value, to not increase the test suite run time.
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Earlier commits (56e19d938dd14 and 2bef06d5164) make it cheaper to
create a logical slot if not exporting the initial snapshot. If
NOEXPORT_SNAPSHOT is specified, we can skip the overhead, not just
when creating a slot via sql (which can't export snapshots). As
NOEXPORT_SNAPSHOT has only recently been introduced, this shouldn't be
backpatched.
Logical decoding stores historical snapshots on disk, so that logical
decoding can restart without having to reconstruct a snapshot from
scratch (for which the resources are not guaranteed to be present
anymore). These serialized snapshots were also used when creating a
new slot via the walsender interface, which can export a "full"
snapshot (i.e. one that can read all tables, not just catalog ones).
The problem is that the serialized snapshots are only useful for
catalogs and not for normal user tables. Thus the use of such a
serialized snapshot could result in an inconsistent snapshot being
exported, which could lead to queries returning wrong data. This
would only happen if logical slots are created while another logical
slot already exists.
Author: Petr Jelinek
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
Previously the logical replication launcher stored the last timestamp
when it started the worker, in the local variable "last_start_time",
in order to check whether wal_retrive_retry_interval elapsed since
the last startup of worker. If it has elapsed, the launcher sees
pg_subscription and starts new worker if necessary. This is for
limitting the startup of worker to once a wal_retrieve_retry_interval.
The bug was that the variable "last_start_time" was defined and
always initialized with 0 at the beginning of the launcher's main loop.
So even if it's set to the last timestamp in later phase of the loop,
it's always reset to 0. Therefore the launcher could not check
correctly whether wal_retrieve_retry_interval elapsed since
the last startup.
This patch moves the variable "last_start_time" outside the main loop
so that it will not be reset.
Reviewed-by: Petr Jelinek
Discussion: http://postgr.es/m/CAHGQGwGJrPO++XM4mFENAwpy1eGXKsGdguYv43GUgLgU-x8nTQ@mail.gmail.com
The logical decoding machinery already preserved all the required
catalog tuples, which is sufficient in the course of normal logical
decoding, but did not guarantee that non-catalog tuples were preserved
during computation of the initial snapshot when creating a slot over
the replication protocol.
This could cause a corrupted initial snapshot being exported. The
time window for issues is usually not terribly large, but on a busy
server it's perfectly possible to it hit it. Ongoing decoding is not
affected by this bug.
To avoid increased overhead for the SQL API, only retain additional
tuples when a logical slot is being created over the replication
protocol. To do so this commit changes the signature of
CreateInitDecodingContext(), but it seems unlikely that it's being
used in an extension, so that's probably ok.
In a drive-by fix, fix handling of
ReplicationSlotsComputeRequiredXmin's already_locked argument, which
should only apply to ProcArrayLock, not ReplicationSlotControlLock.
Reported-By: Erik Rijkers
Analyzed-By: Petr Jelinek
Author: Petr Jelinek, heavily editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/9a897b86-46e1-9915-ee4c-da02e4ff6a95@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
Publisher relation can be incorrectly chosen, if there are more than
one relation in different schemas with the same name.
Author: Euler Taveira <euler@timbira.com.br>
The code was originally written with assumption that launcher is the
only process starting the worker. However that hasn't been true since
commit 7c4f52409 which failed to modify the worker management code
adequately.
This patch adds an in_use field to the LogicalRepWorker struct to
indicate whether the worker slot is being used and uses proper locking
everywhere this flag is set or read.
However if the parent process dies while the new worker is starting and
the new worker fails to attach to shared memory, this flag would never
get cleared. We solve this rare corner case by adding a sort of garbage
collector for in_use slots. This uses another field in the
LogicalRepWorker struct named launch_time that contains the time when
the worker was started. If any request to start a new worker does not
find free slot, we'll check for workers that were supposed to start but
took too long to actually do so, and reuse their slot.
In passing also fix possible race conditions when stopping a worker that
hasn't finished starting yet.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
In quorum-based synchronous replication, all the standbys listed in
synchronous_standby_names equally have chances to be chosen
as synchronous standbys. So they should have the same priority.
However, previously, quorum standbys whose names appear earlier
in the list were given higher priority values though the difference of
those priority values didn't affect the selection of synchronous standbys.
Users could see those "meaningless" priority values in pg_stat_replication
and this was confusing.
This commit gives all the quorum synchronous standbys the same
highest priority, i.e., 1, in order to remove such confusion.
Author: Fujii Masao
Reviewed-by: Masahiko Sawada, Kyotaro Horiguchi
Discussion: http://postgr.es/m/CAHGQGwEKOw=SmPLxJzkBsH6wwDBgOnVz46QjHbtsiZ-d-2RGUg@mail.gmail.com
This seems to be largely cosmetic, avoiding valgrind bleats and the
like. The uninitialized padding influences the CRC of the on-disk
entry, but because it's also used when verifying the CRC, that doesn't
cause spurious failures. Backpatch nonetheless.
It's a bit unfortunate that contrib/test_decoding/sql/replorigin.sql
doesn't exercise the checkpoint path, but checkpoints are fairly
expensive on weaker machines, and we'd have to stop/start for that to
be meaningful.
Author: Andres Freund
Discussion: https://postgr.es/m/20170422183123.w2jgiuxtts7qrqaq@alap3.anarazel.de
Backpatch: 9.5, where replication origins were introduced
As reported by buildfarm animal skink / valgrind, some of the
variables weren't always initialized. To avoid further mishaps use
memset to ensure the entire entry is initialized.
Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: https://postgr.es/m/20170422183123.w2jgiuxtts7qrqaq@alap3.anarazel.de
Backpatch: none, code new in master
Bug was masked by error in running 004_timeline_switch.pl that was
fixed recently in 7d68f2281a.
Detective work by Alvaro Herrera and Tom Lane
Author: Thomas Munro
Commit 7c4f524 allowed walsender to execute normal SQL commands
to support table sync feature in logical replication. Previously
while log_statement caused such SQL commands to be logged,
log_replication_commands caused them to be logged, too.
That is, such SQL commands were logged twice unexpectedly
when those settings were both enabled.
This commit forces log_replication_commands to log only replication
commands, to prevent normal SQL commands from being logged twice.
Author: Masahiko Sawada
Reviewed-by: Kyotaro Horiguchi
Reported-by: Fujii Masao
Discussion: http://postgr.es/m/CAHGQGwFDWh_Qr-q_GEMpD+qH=vYPMdVqw=ZOSY3kX_Pna9R9SA@mail.gmail.com
It's not safe to raise an error while holding spinlock. But previously
logical replication worker for table sync called the function which
reads the system catalog and may raise an error while it's holding
spinlock. Which could lead to the trouble where spinlock will never
be released and the server gets stuck infinitely.
Author: Petr Jelinek
Reviewed-by: Kyotaro Horiguchi and Fujii Masao
Reported-by: Fujii Masao
Discussion: http://postgr.es/m/CAHGQGwFDWh_Qr-q_GEMpD+qH=vYPMdVqw=ZOSY3kX_Pna9R9SA@mail.gmail.com
* Be sure to reset the launcher's pid (LogicalRepCtx->launcher_pid) to 0
even when the launcher emits an error.
* Declare ApplyLauncherWakeup() as a static function because it's called
only in launcher.c.
* Previously IsBackendPId() was used to check whether the launcher's pid
was valid. IsBackendPid() was necessary because there was the bug where
the launcher's pid was not reset to 0. But now it's fixed, so IsBackendPid()
is not necessary and this patch removes it.
Author: Masahiko Sawada
Reviewed-by: Kyotaro Horiguchi
Reported-by: Fujii Masao
Discussion: http://postgr.es/m/CAHGQGwFDWh_Qr-q_GEMpD+qH=vYPMdVqw=ZOSY3kX_Pna9R9SA@mail.gmail.com
CopyFrom() needs a range table for formatting certain errors for
constraint violations.
This changes the mechanism of how the range table is passed to the
CopyFrom() executor state. We used to generate the range table and one
entry for the relation manually inside DoCopy(). Now we use
addRangeTableEntryForRelation() to setup the range table and relation
entry for the ParseState, which is then passed down by BeginCopyFrom().
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Euler Taveira <euler@timbira.com.br>
Coverity complained because bgw.bgw_extra wasn't being filled in by
ApplyLauncherRegister(). The most future-proof fix is to memset the
whole BackgroundWorker struct to zeroes. While at it, let's apply the
same coding rule to other places that set up BackgroundWorker structs;
four out of five had the same or related issues.
When sending a tuple attribute, the previous coding erroneously sent the
length byte before encoding conversion, which would lead to protocol
failures on the receiving side if the length did not match the following
string.
To fix that, use pq_sendcountedtext() for sending tuple attributes,
which takes care of all of that internally. To match the API of
pq_sendcountedtext(), send even text values without a trailing zero byte
and have the receiving end put it in place instead. This matches how
the standard FE/BE protocol behaves.
Reported-by: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
All error messages use the American English spelling of recognize,
apply to the single one not doing so to be consistent.
Author: Daniel Gustafsson <daniel@yesql.se>