diff mbox series

[v4,3/8] diagnostics: Refactor class file_cache_slot

Message ID 20230809221414.2849878-4-lhyatt@gmail.com
State New
Headers show
Series diagnostics: libcpp: Overhaul locations for _Pragma tokens | expand

Commit Message

Lewis Hyatt Aug. 9, 2023, 10:14 p.m. UTC
Class file_cache_slot in input.cc is used to query specific lines of source
code from a file when needed by diagnostics infrastructure. This will be
extended in a subsequent patch to support obtaining the source code from
in-memory generated buffers rather than from a file. The present patch
refactors class file_cache_slot, putting most of the logic into a new base
class cache_data_source, in preparation for reusing that code in the next
patch. There is no change in functionality yet.

gcc/ChangeLog:

	* input.cc (class file_cache_slot): Refactor functionality into a
	new base class...
	(class cache_data_source): ...here.
	(file_cache::forcibly_evict_file): Adapt for refactoring.
	(file_cache_slot::evict): Renamed to...
	(file_cache_slot::reset): ...this, and partially refactored into
	base class...
	(cache_data_source::reset): ...here.
	(file_cache_slot::get_full_file_content): Moved into base class...
	(cache_data_source::get_full_file_content): ...here.
	(file_cache_slot::create): Adapt for refactoring.
	(file_cache_slot::file_cache_slot): Refactor partially into...
	(cache_data_source::cache_data_source): ...here.
	(file_cache_slot::~file_cache_slot): Refactor partially into...
	(cache_data_source::~cache_data_source): ...here.
	(file_cache_slot::needs_read_p): Remove.
	(file_cache_slot::needs_grow_p): Remove.
	(file_cache_slot::maybe_grow): Adapt for refactoring.
	(file_cache_slot::read_data): Refactored, along with...
	(file_cache_slot::maybe_read_data): this, into...
	(file_cache_slot::get_more_data): ...here.
	(find_end_of_line): Change interface to take a pair of pointers,
	rather than a pointer + length.
	(file_cache_slot::get_next_line): Refactored into...
	(cache_data_source::get_next_line): ...here.
	(file_cache_slot::goto_next_line): Refactored into...
	(cache_data_source::goto_next_line): ...here.
	(file_cache_slot::read_line_num): Refactored into...
	(cache_data_source::read_line_num): ...here.
	(location_get_source_line): Fix const-correctness as necessitated by
	new interface.
---
 gcc/input.cc | 513 +++++++++++++++++++++++----------------------------
 1 file changed, 235 insertions(+), 278 deletions(-)

Comments

David Malcolm Aug. 15, 2023, 3:43 p.m. UTC | #1
On Wed, 2023-08-09 at 18:14 -0400, Lewis Hyatt wrote:
> Class file_cache_slot in input.cc is used to query specific lines of source
> code from a file when needed by diagnostics infrastructure. This will be
> extended in a subsequent patch to support obtaining the source code from
> in-memory generated buffers rather than from a file. The present patch
> refactors class file_cache_slot, putting most of the logic into a new base
> class cache_data_source, in preparation for reusing that code in the next
> patch. There is no change in functionality yet.
> 
> gcc/ChangeLog:
> 
>         * input.cc (class file_cache_slot): Refactor functionality into a
>         new base class...
>         (class cache_data_source): ...here.
>         (file_cache::forcibly_evict_file): Adapt for refactoring.
>         (file_cache_slot::evict): Renamed to...
>         (file_cache_slot::reset): ...this, and partially refactored into
>         base class...
>         (cache_data_source::reset): ...here.
>         (file_cache_slot::get_full_file_content): Moved into base class...
>         (cache_data_source::get_full_file_content): ...here.
>         (file_cache_slot::create): Adapt for refactoring.
>         (file_cache_slot::file_cache_slot): Refactor partially into...
>         (cache_data_source::cache_data_source): ...here.
>         (file_cache_slot::~file_cache_slot): Refactor partially into...
>         (cache_data_source::~cache_data_source): ...here.
>         (file_cache_slot::needs_read_p): Remove.
>         (file_cache_slot::needs_grow_p): Remove.
>         (file_cache_slot::maybe_grow): Adapt for refactoring.
>         (file_cache_slot::read_data): Refactored, along with...
>         (file_cache_slot::maybe_read_data): this, into...
>         (file_cache_slot::get_more_data): ...here.
>         (find_end_of_line): Change interface to take a pair of pointers,
>         rather than a pointer + length.
>         (file_cache_slot::get_next_line): Refactored into...
>         (cache_data_source::get_next_line): ...here.
>         (file_cache_slot::goto_next_line): Refactored into...
>         (cache_data_source::goto_next_line): ...here.
>         (file_cache_slot::read_line_num): Refactored into...
>         (cache_data_source::read_line_num): ...here.
>         (location_get_source_line): Fix const-correctness as necessitated by
>         new interface.
> ---
>  gcc/input.cc | 513 +++++++++++++++++++++++----------------------------
>  1 file changed, 235 insertions(+), 278 deletions(-)
> 

I confess I had to reread both this and patch 4/8 to make sense of
this; this is probably one of those cases where it's harder to read in
patch form than as source, but I think I now understand the new
implementation.

Did you try testing this with valgrind (e.g. "make selftest-valgrind")?

I don't think we have any selftest coverage for "\r" in the line-break
handling; that would be good to add.

This patch is OK for trunk once the rest of the kit is approved.

Thanks
Dave
Lewis Hyatt Aug. 15, 2023, 5:58 p.m. UTC | #2
On Tue, Aug 15, 2023 at 11:43:05AM -0400, David Malcolm wrote:
> On Wed, 2023-08-09 at 18:14 -0400, Lewis Hyatt wrote:
> > Class file_cache_slot in input.cc is used to query specific lines of source
> > code from a file when needed by diagnostics infrastructure. This will be
> > extended in a subsequent patch to support obtaining the source code from
> > in-memory generated buffers rather than from a file. The present patch
> > refactors class file_cache_slot, putting most of the logic into a new base
> > class cache_data_source, in preparation for reusing that code in the next
> > patch. There is no change in functionality yet.
> > 
> > gcc/ChangeLog:
> > 
> >         * input.cc (class file_cache_slot): Refactor functionality into a
> >         new base class...
> >         (class cache_data_source): ...here.
> >         (file_cache::forcibly_evict_file): Adapt for refactoring.
> >         (file_cache_slot::evict): Renamed to...
> >         (file_cache_slot::reset): ...this, and partially refactored into
> >         base class...
> >         (cache_data_source::reset): ...here.
> >         (file_cache_slot::get_full_file_content): Moved into base class...
> >         (cache_data_source::get_full_file_content): ...here.
> >         (file_cache_slot::create): Adapt for refactoring.
> >         (file_cache_slot::file_cache_slot): Refactor partially into...
> >         (cache_data_source::cache_data_source): ...here.
> >         (file_cache_slot::~file_cache_slot): Refactor partially into...
> >         (cache_data_source::~cache_data_source): ...here.
> >         (file_cache_slot::needs_read_p): Remove.
> >         (file_cache_slot::needs_grow_p): Remove.
> >         (file_cache_slot::maybe_grow): Adapt for refactoring.
> >         (file_cache_slot::read_data): Refactored, along with...
> >         (file_cache_slot::maybe_read_data): this, into...
> >         (file_cache_slot::get_more_data): ...here.
> >         (find_end_of_line): Change interface to take a pair of pointers,
> >         rather than a pointer + length.
> >         (file_cache_slot::get_next_line): Refactored into...
> >         (cache_data_source::get_next_line): ...here.
> >         (file_cache_slot::goto_next_line): Refactored into...
> >         (cache_data_source::goto_next_line): ...here.
> >         (file_cache_slot::read_line_num): Refactored into...
> >         (cache_data_source::read_line_num): ...here.
> >         (location_get_source_line): Fix const-correctness as necessitated by
> >         new interface.
> > ---
> >  gcc/input.cc | 513 +++++++++++++++++++++++----------------------------
> >  1 file changed, 235 insertions(+), 278 deletions(-)
> > 
> 
> I confess I had to reread both this and patch 4/8 to make sense of
> this; this is probably one of those cases where it's harder to read in
> patch form than as source, but I think I now understand the new
> implementation.

Yes, sorry about that. I hope at least splitting into two patches here made it
a little easier.

> 
> Did you try testing this with valgrind (e.g. "make selftest-valgrind")?
>

Oh interesting, was not aware of this. I think it shows that new leaks were
not introduced with the patch series.

BEFORE patch series:
==1572278==
-fself-test: 7634593 pass(es) in 22.799240 seconds
==1572278==
==1572278== HEAP SUMMARY:
==1572278==     in use at exit: 1,083,255 bytes in 2,394 blocks
==1572278==   total heap usage: 2,704,869 allocs, 2,702,475 frees, 1,257,334,536 bytes allocated
==1572278==
==1572278== 8,032 bytes in 1 blocks are possibly lost in loss record 639 of 657
==1572278==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1572278==    by 0x21FE1CB: xmalloc (xmalloc.c:149)
==1572278==    by 0x21B02E0: new_buff (lex.cc:4767)
==1572278==    by 0x21B02E0: _cpp_get_buff (lex.cc:4800)
==1572278==    by 0x21ACC80: cpp_create_reader(c_lang, ht*, line_maps*) (init.cc:289)
==1572278==    by 0xA64282: c_common_init_options(unsigned int, cl_decoded_option*) (c-opts.cc:237)
==1572278==    by 0x95E479: toplev::main(int, char**) (toplev.cc:2241)
==1572278==    by 0x960B2D: main (main.cc:39)
==1572278==
==1572278== LEAK SUMMARY:
==1572278==    definitely lost: 0 bytes in 0 blocks
==1572278==    indirectly lost: 0 bytes in 0 blocks
==1572278==      possibly lost: 8,032 bytes in 1 blocks
==1572278==    still reachable: 1,075,223 bytes in 2,393 blocks
==1572278==         suppressed: 0 bytes in 0 blocks
==1572278== Reachable blocks (those to which a pointer was found) are not shown.
==1572278== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==1572278==
==1572278== For lists of detected and suppressed errors, rerun with: -s
==1572278== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

AFTER patch series:
==1594840==
-fself-test: 7638403 pass(es) in 23.671784 seconds
==1594840==
==1594840== HEAP SUMMARY:
==1594840==     in use at exit: 1,081,759 bytes in 2,367 blocks
==1594840==   total heap usage: 2,728,561 allocs, 2,726,194 frees, 1,272,214,526 bytes allocated
==1594840==
==1594840== 8,032 bytes in 1 blocks are possibly lost in loss record 609 of 628
==1594840==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1594840==    by 0x2200CCB: xmalloc (xmalloc.c:149)
==1594840==    by 0x21B2440: new_buff (lex.cc:4767)
==1594840==    by 0x21B2440: _cpp_get_buff (lex.cc:4800)
==1594840==    by 0x21AEDA0: cpp_create_reader(c_lang, ht*, line_maps*) (init.cc:289)
==1594840==    by 0xA64592: c_common_init_options(unsigned int, cl_decoded_option*) (c-opts.cc:237)
==1594840==    by 0x95E529: toplev::main(int, char**) (toplev.cc:2241)
==1594840==    by 0x960BDD: main (main.cc:39)
==1594840==
==1594840== LEAK SUMMARY:
==1594840==    definitely lost: 0 bytes in 0 blocks
==1594840==    indirectly lost: 0 bytes in 0 blocks
==1594840==      possibly lost: 8,032 bytes in 1 blocks
==1594840==    still reachable: 1,073,727 bytes in 2,366 blocks
==1594840==         suppressed: 0 bytes in 0 blocks

> I don't think we have any selftest coverage for "\r" in the line-break
> handling; that would be good to add.
> 
> This patch is OK for trunk once the rest of the kit is approved.

Thank you. To be clear, were you suggesting to add selftest coverage for \r
endings now, or in a follow up?

-Lewis
David Malcolm Aug. 15, 2023, 7:39 p.m. UTC | #3
On Tue, 2023-08-15 at 13:58 -0400, Lewis Hyatt wrote:
> On Tue, Aug 15, 2023 at 11:43:05AM -0400, David Malcolm wrote:
> > On Wed, 2023-08-09 at 18:14 -0400, Lewis Hyatt wrote:
> > > Class file_cache_slot in input.cc is used to query specific lines
> > > of source
> > > code from a file when needed by diagnostics infrastructure. This
> > > will be
> > > extended in a subsequent patch to support obtaining the source
> > > code from
> > > in-memory generated buffers rather than from a file. The present
> > > patch
> > > refactors class file_cache_slot, putting most of the logic into a
> > > new base
> > > class cache_data_source, in preparation for reusing that code in
> > > the next
> > > patch. There is no change in functionality yet.
> > > 

[...snip...]

> > 
> > I confess I had to reread both this and patch 4/8 to make sense of
> > this; this is probably one of those cases where it's harder to read
> > in
> > patch form than as source, but I think I now understand the new
> > implementation.
> 
> Yes, sorry about that. I hope at least splitting into two patches
> here made it
> a little easier.
> 
> > 
> > Did you try testing this with valgrind (e.g. "make selftest-
> > valgrind")?
> > 
> 
> Oh interesting, was not aware of this. I think it shows that new
> leaks were
> not introduced with the patch series.
> 

[...snip...]

> 
> 
> > I don't think we have any selftest coverage for "\r" in the line-
> > break
> > handling; that would be good to add.
> > 
> > This patch is OK for trunk once the rest of the kit is approved.
> 
> Thank you. To be clear, were you suggesting to add selftest coverage
> for \r
> endings now, or in a follow up?

The former, please, so that we can sure that the patch doesn't
introduce any buffer overreads etc.

Thanks
Dave
Lewis Hyatt Aug. 23, 2023, 9:22 p.m. UTC | #4
On Tue, Aug 15, 2023 at 03:39:40PM -0400, David Malcolm wrote:
> On Tue, 2023-08-15 at 13:58 -0400, Lewis Hyatt wrote:
> > On Tue, Aug 15, 2023 at 11:43:05AM -0400, David Malcolm wrote:
> > > On Wed, 2023-08-09 at 18:14 -0400, Lewis Hyatt wrote:
> > > > Class file_cache_slot in input.cc is used to query specific lines
> > > > of source
> > > > code from a file when needed by diagnostics infrastructure. This
> > > > will be
> > > > extended in a subsequent patch to support obtaining the source
> > > > code from
> > > > in-memory generated buffers rather than from a file. The present
> > > > patch
> > > > refactors class file_cache_slot, putting most of the logic into a
> > > > new base
> > > > class cache_data_source, in preparation for reusing that code in
> > > > the next
> > > > patch. There is no change in functionality yet.
> > > > 
> 
> [...snip...]
> 
> > > 
> > > I confess I had to reread both this and patch 4/8 to make sense of
> > > this; this is probably one of those cases where it's harder to read
> > > in
> > > patch form than as source, but I think I now understand the new
> > > implementation.
> > 
> > Yes, sorry about that. I hope at least splitting into two patches
> > here made it
> > a little easier.
> > 
> > > 
> > > Did you try testing this with valgrind (e.g. "make selftest-
> > > valgrind")?
> > > 
> > 
> > Oh interesting, was not aware of this. I think it shows that new
> > leaks were
> > not introduced with the patch series.
> > 
> 
> [...snip...]
> 
> > 
> > 
> > > I don't think we have any selftest coverage for "\r" in the line-
> > > break
> > > handling; that would be good to add.
> > > 
> > > This patch is OK for trunk once the rest of the kit is approved.
> > 
> > Thank you. To be clear, were you suggesting to add selftest coverage
> > for \r
> > endings now, or in a follow up?
> 
> The former, please, so that we can sure that the patch doesn't
> introduce any buffer overreads etc.
> 
> Thanks
> Dave
>

The following (incremental to patch 5/8 or after) adds selftest coverage for
alternate line endings. I hope things aren't too unclear this way; I can
resend updated versions of some or all of the patches from scratch, if useful.

AFAIK this is the current status of things:

Patch 1/8: Reviewed, updated version incorporating feedback has not been acked
yet, at: https://gcc.gnu.org/pipermail/gcc-patches/2023-August/627250.html

Patch 2/8: OKed, pending tweak to reject fixit hints in generated data, which
was sent incrementally here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/627405.html

Patch 3/8: OKed, pending new selftest attached to this email.

Patch 4/8: OKed, pending tweak to assert on non-NULL buffers which was sent
incrementally here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628283.html

Patch 5/8: OKed

Patch 6/8: OKed

Patch 7/8: Not reviewed yet

Patch 8/8: Waiting additional feedback from you, perhaps SARIF need not worry
about this for now and should just ignore generated data locations.

Thanks again for taking the time to go through this, I hope it will prove
worth it.

-Lewis

-- >8 --

gcc/ChangeLog:

	* input.cc (test_reading_source_line): Test additional cases,
	including generated data and alternate line endings.
	(input_cc_tests): Adapt to test_reading_source_line() changes.

diff --git a/gcc/input.cc b/gcc/input.cc
index 4c99df7a205..72274732c6c 100644
--- a/gcc/input.cc
+++ b/gcc/input.cc
@@ -2392,30 +2392,51 @@ test_make_location_nonpure_range_endpoints (const line_table_case &case_)
 /* Verify reading of input files (e.g. for caret-based diagnostics).  */
 
 static void
-test_reading_source_line ()
+test_reading_source_line (bool generated, const char *e1, const char *e2)
 {
   /* Create a tempfile and write some text to it.  */
+  const char *line1 = "01234567890123456789";
+  const char *line2 = "This is the test text";
+  const char *line3 = "This is the 3rd line";
+  char content[72];
+  const int content_len = snprintf (content, sizeof (content),
+				    "%s%s%s%s%s",
+				    line1, e1, line2, e2, line3);
+  ASSERT_LT (content_len, (int)sizeof (content));
   temp_source_file tmp (SELFTEST_LOCATION, ".txt",
-			"01234567890123456789\n"
-			"This is the test text\n"
-			"This is the 3rd line");
+			content, content_len, generated);
 
-  /* Read back a specific line from the tempfile.  */
-  char_span source_line = location_get_source_line (tmp.get_filename (), 3);
+  /* Read back some specific lines from the tempfile, not all in order.  */
+  const source_id src = generated
+    ? source_id (tmp.content_buf, tmp.content_len)
+    : source_id (tmp.get_filename ());
+
+  char_span source_line = location_get_source_line (src, 1);
+  ASSERT_TRUE (source_line);
+  ASSERT_TRUE (source_line.get_buffer () != NULL);
+  /* N.B. If the line terminator is \r\n, the returned char_span will include
+     the \r as part of the line.  */
+  const size_t off1 = strlen (e1) - 1;
+  ASSERT_EQ (20 + off1, source_line.length ());
+  ASSERT_TRUE (!strncmp (line1, source_line.get_buffer (),
+			 source_line.length () - off1));
+
+  source_line = location_get_source_line (src, 3);
   ASSERT_TRUE (source_line);
   ASSERT_TRUE (source_line.get_buffer () != NULL);
   ASSERT_EQ (20, source_line.length ());
-  ASSERT_TRUE (!strncmp ("This is the 3rd line",
-			 source_line.get_buffer (), source_line.length ()));
+  ASSERT_TRUE (!strncmp (line3, source_line.get_buffer (),
+			 source_line.length ()));
 
-  source_line = location_get_source_line (tmp.get_filename (), 2);
+  source_line = location_get_source_line (src, 2);
   ASSERT_TRUE (source_line);
   ASSERT_TRUE (source_line.get_buffer () != NULL);
-  ASSERT_EQ (21, source_line.length ());
-  ASSERT_TRUE (!strncmp ("This is the test text",
-			 source_line.get_buffer (), source_line.length ()));
+  const size_t off2 = strlen (e2) - 1;
+  ASSERT_EQ (21 + off2, source_line.length ());
+  ASSERT_TRUE (!strncmp (line2, source_line.get_buffer (),
+			 source_line.length () - off2));
 
-  source_line = location_get_source_line (tmp.get_filename (), 4);
+  source_line = location_get_source_line (src, 4);
   ASSERT_FALSE (source_line);
   ASSERT_TRUE (source_line.get_buffer () == NULL);
 }
@@ -4311,7 +4332,11 @@ input_cc_tests ()
   for_each_line_table_case (test_lexer_string_locations_raw_string_unterminated);
   for_each_line_table_case (test_lexer_char_constants);
 
-  test_reading_source_line ();
+  const char *const line_endings[] = {"\n", "\r", "\r\n"};
+  for (bool generated : {false, true})
+    for (const char *e1 : line_endings)
+      for (const char *e2: line_endings)
+	test_reading_source_line (generated, e1, e2);
 
   test_line_offset_overflow ();
diff mbox series

Patch

diff --git a/gcc/input.cc b/gcc/input.cc
index c2559614a99..9377020b460 100644
--- a/gcc/input.cc
+++ b/gcc/input.cc
@@ -55,34 +55,88 @@  file_cache::initialize_input_context (diagnostic_input_charset_callback ccb,
   in_context.should_skip_bom = should_skip_bom;
 }
 
-/* This is a cache used by get_next_line to store the content of a
-   file to be searched for file lines.  */
-class file_cache_slot
+/* This is an abstract interface for a class that provides data which we want to
+   look up by line number.  Concrete implementations will follow, which handle
+   the cases of reading the data from the input source files, or of reading it
+   from in-memory generated data buffers.  The design is driven with reading
+   from files in mind, in particular it is desirable to read only as much of a
+   file from disk as necessary.  It works like a simplified std::istream, i.e.
+   virtual function calls are only needed when we need to retrieve more data
+   from the underlying source.  */
+
+class cache_data_source
 {
-public:
-  file_cache_slot ();
-  ~file_cache_slot ();
 
-  bool read_line_num (size_t line_num,
-		      char ** line, ssize_t *line_len);
-
-  /* Accessors.  */
-  const char *get_file_path () const { return m_file_path; }
+public:
+  bool read_line_num (size_t line_num, const char **line, ssize_t *line_len);
   unsigned get_use_count () const { return m_use_count; }
+  void inc_use_count () { m_use_count++; }
+  bool get_next_line (const char **line, ssize_t *line_len);
+  bool goto_next_line ();
   bool missing_trailing_newline_p () const
   {
     return m_missing_trailing_newline;
   }
   char_span get_full_file_content ();
+  bool unused () const { return !m_data_begin; }
+  virtual void reset ();
+
+protected:
+  cache_data_source ();
+  virtual ~cache_data_source ();
+
+  /* These pointers delimit the data that we are processing.  They are
+     maintained by the derived classes, we only ask for more by calling
+     get_more_data().  That function should return TRUE if more data was
+     obtained.  Calling get_more_data () may invalidate these pointers
+     (i.e. reallocating them to a larger buffer).  */
+  const char *m_data_begin;
+  const char *m_data_end;
+  virtual bool get_more_data () = 0;
+
+  /* This is to be called by the derived classes when this object is
+     being activated.  */
+  void on_create (unsigned int use_count, size_t total_lines)
+  {
+    m_use_count = use_count;
+    m_total_lines = total_lines;
+  }
 
-  void inc_use_count () { m_use_count++; }
+private:
+  /* Non-copyable.  */
+  cache_data_source (const cache_data_source &) = delete;
+  cache_data_source& operator= (const cache_data_source &) = delete;
 
-  bool create (const file_cache::input_context &in_context,
-	       const char *file_path, FILE *fp, unsigned highest_use_count);
-  void evict ();
+  /* The number of times this data has been accessed.  This is used to designate
+     which entry to evict from the cache array when needed.  */
+  unsigned m_use_count;
 
- private:
-  /* These are information used to store a line boundary.  */
+  /* Could this file be missing a trailing newline on its final line?
+     Initially true (to cope with empty files), set to true/false
+     as each line is read.  */
+  bool m_missing_trailing_newline;
+
+  /* This is the total number of lines in the current data.  At the
+     moment, we try to get this information from the line map
+     subsystem.  Note that this is just a hint.  When using the C++
+     front-end, this hint is correct because the input file is then
+     completely tokenized before parsing starts; so the line map knows
+     the number of lines before compilation really starts.  For e.g,
+     the C front-end, it can happen that we start emitting diagnostics
+     before the line map has seen the end of the file.  */
+  size_t m_total_lines;
+
+  /* The number of the previous lines read.  This starts at 1.  Zero
+     means we've read no line so far.  */
+  size_t m_line_num;
+
+  /* The index of the beginning of the current line.  */
+  size_t m_line_start_idx;
+
+  /* These are information used to store a line boundary.  Here and below, we
+     store always byte offsets, not pointers, since the underlying buffer may be
+     reallocated by the derived implementation unbeknownst to us after calling
+     get_more_data().  */
   class line_info
   {
   public:
@@ -90,13 +144,12 @@  public:
     size_t line_num;
 
     /* The position (byte count) of the beginning of the line,
-       relative to the file data pointer.  This starts at zero.  */
+       relative to M_DATA_BEGIN.  This starts at zero.  */
     size_t start_pos;
 
-    /* The position (byte count) of the last byte of the line.  This
-       normally points to the '\n' character, or to one byte after the
-       last byte of the file, if the file doesn't contain a '\n'
-       character.  */
+    /* The position (byte count) of the last byte of the line.  This normally
+       points to the '\n' character, or to M_DATA_END, if the data doesn't end
+       with a '\n' character.  */
     size_t end_pos;
 
     line_info (size_t l, size_t s, size_t e)
@@ -104,91 +157,54 @@  public:
     {}
 
     line_info ()
-      :line_num (0), start_pos (0), end_pos (0)
+      : line_num (0), start_pos (0), end_pos (0)
     {}
   };
 
-  bool needs_read_p () const;
-  bool needs_grow_p () const;
-  void maybe_grow ();
-  bool read_data ();
-  bool maybe_read_data ();
-  bool get_next_line (char **line, ssize_t *line_len);
-  bool read_next_line (char ** line, ssize_t *line_len);
-  bool goto_next_line ();
-
-  static const size_t buffer_size = 4 * 1024;
-  static const size_t line_record_size = 100;
-
-  /* The number of time this file has been accessed.  This is used
-     to designate which file cache to evict from the cache
-     array.  */
-  unsigned m_use_count;
-
-  /* The file_path is the key for identifying a particular file in
-     the cache.
-     For libcpp-using code, the underlying buffer for this field is
-     owned by the corresponding _cpp_file within the cpp_reader.  */
-  const char *m_file_path;
-
-  FILE *m_fp;
-
-  /* This points to the content of the file that we've read so
-     far.  */
-  char *m_data;
-
-  /* The allocated buffer to be freed may start a little earlier than DATA,
-     e.g. if a UTF8 BOM was skipped at the beginning.  */
-  int m_alloc_offset;
-
-  /*  The size of the DATA array above.*/
-  size_t m_size;
-
-  /* The number of bytes read from the underlying file so far.  This
-     must be less (or equal) than SIZE above.  */
-  size_t m_nb_read;
-
-  /* The index of the beginning of the current line.  */
-  size_t m_line_start_idx;
-
-  /* The number of the previous line read.  This starts at 1.  Zero
-     means we've read no line so far.  */
-  size_t m_line_num;
-
-  /* This is the total number of lines of the current file.  At the
-     moment, we try to get this information from the line map
-     subsystem.  Note that this is just a hint.  When using the C++
-     front-end, this hint is correct because the input file is then
-     completely tokenized before parsing starts; so the line map knows
-     the number of lines before compilation really starts.  For e.g,
-     the C front-end, it can happen that we start emitting diagnostics
-     before the line map has seen the end of the file.  */
-  size_t m_total_lines;
-
-  /* Could this file be missing a trailing newline on its final line?
-     Initially true (to cope with empty files), set to true/false
-     as each line is read.  */
-  bool m_missing_trailing_newline;
-
   /* This is a record of the beginning and end of the lines we've seen
      while reading the file.  This is useful to avoid walking the data
      from the beginning when we are asked to read a line that is
-     before LINE_START_IDX above.  Note that the maximum size of this
+     before M_LINE_START_IDX.  Note that the maximum size of this
      record is line_record_size, so that the memory consumption
      doesn't explode.  We thus scale total_lines down to
      line_record_size.  */
   vec<line_info, va_heap> m_line_record;
+  static const size_t line_record_size = 100;
+};
 
-  void offset_buffer (int offset)
-  {
-    gcc_assert (offset < 0 ? m_alloc_offset + offset >= 0
-		: (size_t) offset <= m_size);
-    gcc_assert (m_data);
-    m_alloc_offset += offset;
-    m_data += offset;
-    m_size -= offset;
-  }
+/* This is the implementation of cache_data_source for ordinary
+   source files.  */
+class file_cache_slot final : public cache_data_source
+{
+
+public:
+  file_cache_slot ();
+  ~file_cache_slot ();
+
+  const char *get_file_path () const { return m_file_path; }
+  bool create (const file_cache::input_context &in_context,
+	       const char *file_path, FILE *fp, unsigned highest_use_count);
+  void reset () override;
+
+protected:
+  bool get_more_data () override;
 
+private:
+  /* The file_path is the key for identifying a particular file in the cache.
+     For libcpp-using code, the underlying buffer for this field is owned by the
+     corresponding _cpp_file within the cpp_reader.  */
+  const char *m_file_path;
+
+  FILE *m_fp;
+
+  /* The base class M_DATA_BEGIN and M_DATA_END delimit the bytes that are ready
+     to process.  These two pointers here track a growable memory buffer, owned
+     by this object, where we store data as we read it from the file; we arrange
+     for the base class pointers to point to the right place within this
+     buffer.  */
+  char *m_buf_begin;
+  char *m_buf_end;
+  void maybe_grow ();
 };
 
 /* Current position in real source file.  */
@@ -391,26 +407,10 @@  file_cache::forcibly_evict_file (const char *file_path)
     /* Not found.  */
     return;
 
-  r->evict ();
+  r->reset ();
 }
 
-void
-file_cache_slot::evict ()
-{
-  m_file_path = NULL;
-  if (m_fp)
-    fclose (m_fp);
-  m_fp = NULL;
-  m_nb_read = 0;
-  m_line_start_idx = 0;
-  m_line_num = 0;
-  m_line_record.truncate (0);
-  m_use_count = 0;
-  m_total_lines = 0;
-  m_missing_trailing_newline = true;
-}
-
-/* Return the file cache that has been less used, recently, or the
+/* Return the cache that has been less used, recently, or the
    first empty one.  If HIGHEST_USE_COUNT is non-null,
    *HIGHEST_USE_COUNT is set to the highest use count of the entries
    in the cache table.  */
@@ -473,14 +473,14 @@  file_cache::add_file (const char *file_path)
    as decoded according to the input charset, encoded as UTF-8.  */
 
 char_span
-file_cache_slot::get_full_file_content ()
+cache_data_source::get_full_file_content ()
 {
-  char *line;
+  const char *line;
   ssize_t line_len;
   while (get_next_line (&line, &line_len))
     {
     }
-  return char_span (m_data, m_nb_read);
+  return char_span (m_data_begin, m_data_end - m_data_begin);
 }
 
 /* Populate this slot for use on FILE_PATH and FP, dropping any
@@ -491,22 +491,12 @@  file_cache_slot::create (const file_cache::input_context &in_context,
 			 const char *file_path, FILE *fp,
 			 unsigned highest_use_count)
 {
+  reset ();
+  on_create (highest_use_count + 1, total_lines_num (source_id {file_path}));
+  m_data_begin = m_buf_begin;
+  m_data_end = m_buf_begin;
   m_file_path = file_path;
-  if (m_fp)
-    fclose (m_fp);
   m_fp = fp;
-  if (m_alloc_offset)
-    offset_buffer (-m_alloc_offset);
-  m_nb_read = 0;
-  m_line_start_idx = 0;
-  m_line_num = 0;
-  m_line_record.truncate (0);
-  /* Ensure that this cache entry doesn't get evicted next time
-     add_file_to_cache_tab is called.  */
-  m_use_count = ++highest_use_count;
-  m_total_lines = total_lines_num (file_path);
-  m_missing_trailing_newline = true;
-
 
   /* Check the input configuration to determine if we need to do any
      transformations, such as charset conversion or BOM skipping.  */
@@ -519,20 +509,17 @@  file_cache_slot::create (const file_cache::input_context &in_context,
 	= cpp_get_converted_source (file_path, input_charset);
       if (!cs.data)
 	return false;
-      if (m_data)
-	XDELETEVEC (m_data);
-      m_data = cs.data;
-      m_nb_read = m_size = cs.len;
-      m_alloc_offset = cs.data - cs.to_free;
+      XDELETEVEC (m_buf_begin);
+      m_buf_begin = cs.to_free;
+      m_buf_end = cs.data + cs.len;
+      m_data_begin = cs.data;
+      m_data_end = m_buf_end;
     }
-  else if (in_context.should_skip_bom)
+  else if (in_context.should_skip_bom && get_more_data ())
     {
-      if (read_data ())
-	{
-	  const int offset = cpp_check_utf8_bom (m_data, m_nb_read);
-	  offset_buffer (offset);
-	  m_nb_read -= offset;
-	}
+      const int offset = cpp_check_utf8_bom (m_data_begin,
+					     m_data_end - m_data_begin);
+      m_data_begin += offset;
     }
 
   return true;
@@ -567,55 +554,60 @@  file_cache::lookup_or_add_file (const char *file_path)
   return r;
 }
 
-/* Default constructor for a cache of file used by caret
-   diagnostic.  */
-
-file_cache_slot::file_cache_slot ()
-: m_use_count (0), m_file_path (NULL), m_fp (NULL), m_data (0),
-  m_alloc_offset (0), m_size (0), m_nb_read (0), m_line_start_idx (0),
-  m_line_num (0), m_total_lines (0), m_missing_trailing_newline (true)
+cache_data_source::cache_data_source ()
+: m_data_begin (nullptr), m_data_end (nullptr),
+  m_use_count (0),
+  m_missing_trailing_newline (true),
+  m_total_lines (0),
+  m_line_num (0),
+  m_line_start_idx (0)
 {
   m_line_record.create (0);
 }
 
-/* Destructor for a cache of file used by caret diagnostic.  */
-
-file_cache_slot::~file_cache_slot ()
+cache_data_source::~cache_data_source ()
 {
-  if (m_fp)
-    {
-      fclose (m_fp);
-      m_fp = NULL;
-    }
-  if (m_data)
-    {
-      offset_buffer (-m_alloc_offset);
-      XDELETEVEC (m_data);
-      m_data = 0;
-    }
   m_line_record.release ();
 }
 
-/* Returns TRUE iff the cache would need to be filled with data coming
-   from the file.  That is, either the cache is empty or full or the
-   current line is empty.  Note that if the cache is full, it would
-   need to be extended and filled again.  */
-
-bool
-file_cache_slot::needs_read_p () const
+void
+cache_data_source::reset ()
 {
-  return m_fp && (m_nb_read == 0
-	  || m_nb_read == m_size
-	  || (m_line_start_idx >= m_nb_read - 1));
+  m_data_begin = nullptr;
+  m_data_end = nullptr;
+  m_use_count = 0;
+  m_missing_trailing_newline = true;
+  m_total_lines = 0;
+  m_line_num = 0;
+  m_line_start_idx = 0;
+  m_line_record.truncate (0);
 }
 
-/*  Return TRUE iff the cache is full and thus needs to be
-    extended.  */
+file_cache_slot::file_cache_slot ()
+: m_file_path (nullptr), m_fp (nullptr),
+  m_buf_begin (nullptr), m_buf_end (nullptr)
+{}
+
+file_cache_slot::~file_cache_slot ()
+{
+  if (m_fp)
+    fclose (m_fp);
+  XDELETEVEC (m_buf_begin);
+}
 
-bool
-file_cache_slot::needs_grow_p () const
+void
+file_cache_slot::reset ()
 {
-  return m_nb_read == m_size;
+  cache_data_source::reset ();
+  m_file_path = NULL;
+  if (m_fp)
+    {
+      fclose (m_fp);
+      m_fp = NULL;
+    }
+
+  /* Do not free the buffer here, we intend to reuse it the next time this
+     slot is activated.  */
 }
 
 /* Grow the cache if it needs to be extended.  */
@@ -623,22 +615,23 @@  file_cache_slot::needs_grow_p () const
 void
 file_cache_slot::maybe_grow ()
 {
-  if (!needs_grow_p ())
-    return;
-
-  if (!m_data)
+  if (!m_buf_begin)
     {
-      gcc_assert (m_size == 0 && m_alloc_offset == 0);
-      m_size = buffer_size;
-      m_data = XNEWVEC (char, m_size);
+      const size_t buffer_size = 4 * 1024;
+      m_buf_begin = XNEWVEC (char, buffer_size);
+      m_buf_end = m_buf_begin + buffer_size;
+      m_data_begin = m_buf_begin;
+      m_data_end = m_data_begin;
     }
-  else
+  else if (m_data_end == m_buf_end)
     {
-      const int offset = m_alloc_offset;
-      offset_buffer (-offset);
-      m_size *= 2;
-      m_data = XRESIZEVEC (char, m_data, m_size);
-      offset_buffer (offset);
+      const auto new_size = 2 * (m_buf_end - m_buf_begin);
+      const auto data_offset = m_data_begin - m_buf_begin;
+      const auto data_size = m_data_end - m_data_begin;
+      m_buf_begin = XRESIZEVEC (char, m_buf_begin, new_size);
+      m_buf_end = m_buf_begin + new_size;
+      m_data_begin = m_buf_begin + data_offset;
+      m_data_end = m_data_begin + data_size;
     }
 }
 
@@ -646,45 +639,28 @@  file_cache_slot::maybe_grow ()
     Returns TRUE iff new data could be read.  */
 
 bool
-file_cache_slot::read_data ()
+file_cache_slot::get_more_data ()
 {
-  if (feof (m_fp) || ferror (m_fp))
+  if (!m_fp || feof (m_fp) || ferror (m_fp))
     return false;
-
   maybe_grow ();
-
-  char * from = m_data + m_nb_read;
-  size_t to_read = m_size - m_nb_read;
-  size_t nb_read = fread (from, 1, to_read, m_fp);
-
-  if (ferror (m_fp))
-    return false;
-
-  m_nb_read += nb_read;
-  return !!nb_read;
-}
-
-/* Read new data iff the cache needs to be filled with more data
-   coming from the file FP.  Return TRUE iff the cache was filled with
-   mode data.  */
-
-bool
-file_cache_slot::maybe_read_data ()
-{
-  if (!needs_read_p ())
+  char *const dest = m_buf_begin + (m_data_end - m_buf_begin);
+  const auto nb_read = fread (dest, 1, m_buf_end - dest, m_fp);
+  if (ferror (m_fp) || !nb_read)
     return false;
-  return read_data ();
+  m_data_end += nb_read;
+  return true;
 }
 
-/* Helper function for file_cache_slot::get_next_line (), to find the end of
+/* Helper function for cache_data_source::get_next_line (), to find the end of
    the next line.  Returns with the memchr convention, i.e. nullptr if a line
    terminator was not found.  We need to determine line endings in the same
    manner that libcpp does: any of \n, \r\n, or \r is a line ending.  */
 
-static char *
-find_end_of_line (char *s, size_t len)
+static const char *
+find_end_of_line (const char *s, const char *end)
 {
-  for (const auto end = s + len; s != end; ++s)
+  for (; s != end; ++s)
     {
       if (*s == '\n')
 	return s;
@@ -707,41 +683,38 @@  find_end_of_line (char *s, size_t len)
   return nullptr;
 }
 
-/* Read a new line from file FP, using C as a cache for the data
-   coming from the file.  Upon successful completion, *LINE is set to
-   the beginning of the line found.  *LINE points directly in the
-   line cache and is only valid until the next call of get_next_line.
-   *LINE_LEN is set to the length of the line.  Note that the line
-   does not contain any terminal delimiter.  This function returns
-   true if some data was read or process from the cache, false
-   otherwise.  Note that subsequent calls to get_next_line might
-   make the content of *LINE invalid.  */
+/* Read a new line from the data source.  Upon successful completion, *LINE is
+   set to the beginning of the line found.  *LINE points directly in the line
+   cache and is only valid until the next call of get_next_line.  *LINE_LEN is
+   set to the length of the line.  Note that the line does not contain any
+   terminal delimiter.  This function returns true if some data was read or
+   processed from the cache, false otherwise.  Note that subsequent calls to
+   get_next_line might make the content of *LINE invalid.  */
 
 bool
-file_cache_slot::get_next_line (char **line, ssize_t *line_len)
+cache_data_source::get_next_line (const char **line, ssize_t *line_len)
 {
-  /* Fill the cache with data to process.  */
-  maybe_read_data ();
+  const char *line_start = m_data_begin + m_line_start_idx;
 
-  size_t remaining_size = m_nb_read - m_line_start_idx;
-  if (remaining_size == 0)
-    /* There is no more data to process.  */
-    return false;
-
-  char *line_start = m_data + m_line_start_idx;
+  /* Check if we are all done reading the file.  */
+  if (line_start == m_data_end)
+    {
+      if (!get_more_data ())
+	return false;
+      line_start = m_data_begin + m_line_start_idx;
+    }
 
-  char *next_line_start = NULL;
-  size_t len = 0;
-  char *line_end = find_end_of_line (line_start, remaining_size);
+  /* Find the end of the current line.  */
+  const char *next_line_start = NULL;
+  const char *line_end = find_end_of_line (line_start, m_data_end);
   if (line_end == NULL)
     {
       /* We haven't found an end-of-line delimiter in the cache.
 	 Fill the cache with more data from the file and look again.  */
-      while (maybe_read_data ())
+      while (get_more_data ())
 	{
-	  line_start = m_data + m_line_start_idx;
-	  remaining_size = m_nb_read - m_line_start_idx;
-	  line_end = find_end_of_line (line_start, remaining_size);
+	  line_start = m_data_begin + m_line_start_idx;
+	  line_end = find_end_of_line (line_start, m_data_end);
 	  if (line_end != NULL)
 	    {
 	      next_line_start = line_end + 1;
@@ -758,8 +731,8 @@  file_cache_slot::get_next_line (char **line, ssize_t *line_len)
 
 	     If the file ends in a \r, we didn't identify it as a line
 	     terminator above, so do that now instead.  */
-	  line_end = m_data + m_nb_read;
-	  if (m_nb_read && line_end[-1] == '\r')
+	  line_end = m_data_end;
+	  if (line_end != m_data_begin && line_end[-1] == '\r')
 	    {
 	      --line_end;
 	      m_missing_trailing_newline = false;
@@ -776,18 +749,11 @@  file_cache_slot::get_next_line (char **line, ssize_t *line_len)
       m_missing_trailing_newline = false;
     }
 
-  if (m_fp && ferror (m_fp))
-    return false;
-
   /* At this point, we've found the end of the of line.  It either points to
      the line terminator or to one byte after the last byte of the file.  */
-  gcc_assert (line_end != NULL);
-
-  len = line_end - line_start;
-
-  if (m_line_start_idx < m_nb_read)
-    *line = line_start;
-
+  const auto len = line_end - line_start;
+  *line = line_start;
+  *line_len = len;
   ++m_line_num;
 
   /* Before we update our line record, make sure the hint about the
@@ -809,7 +775,7 @@  file_cache_slot::get_next_line (char **line, ssize_t *line_len)
 	m_line_record.safe_push
 	  (file_cache_slot::line_info (m_line_num,
 				       m_line_start_idx,
-				       line_end - m_data));
+				       line_end - m_data_begin));
       else if (m_total_lines > line_record_size)
 	{
 	  /* ... otherwise, we just scale total_lines down to
@@ -820,23 +786,14 @@  file_cache_slot::get_next_line (char **line, ssize_t *line_len)
 	    m_line_record.safe_push
 	      (file_cache_slot::line_info (m_line_num,
 					   m_line_start_idx,
-					   line_end - m_data));
+					   line_end - m_data_begin));
 	}
     }
 
   /* Update m_line_start_idx so that it points to the next line to be
      read.  */
-  if (next_line_start)
-    m_line_start_idx = next_line_start - m_data;
-  else
-    /* We didn't find any terminal '\n'.  Let's consider that the end
-       of line is the end of the data in the cache.  The next
-       invocation of get_next_line will either read more data from the
-       underlying file or return false early because we've reached the
-       end of the file.  */
-    m_line_start_idx = m_nb_read;
-
-  *line_len = len;
+  m_line_start_idx
+    = (next_line_start ? next_line_start : m_data_end) - m_data_begin;
 
   return true;
 }
@@ -848,15 +805,15 @@  file_cache_slot::get_next_line (char **line, ssize_t *line_len)
    completion.  */
 
 bool
-file_cache_slot::goto_next_line ()
+cache_data_source::goto_next_line ()
 {
-  char *l;
+  const char *l;
   ssize_t len;
 
   return get_next_line (&l, &len);
 }
 
-/* Read an arbitrary line number LINE_NUM from the file cached in C.
+/* Read an arbitrary line number LINE_NUM from the data cache.
    If the line was read successfully, *LINE points to the beginning
    of the line in the file cache and *LINE_LEN is the length of the
    line.  *LINE is not nul-terminated, but may contain zero bytes.
@@ -864,8 +821,8 @@  file_cache_slot::goto_next_line ()
    This function returns bool if a line was read.  */
 
 bool
-file_cache_slot::read_line_num (size_t line_num,
-		       char ** line, ssize_t *line_len)
+cache_data_source::read_line_num (size_t line_num,
+				  const char ** line, ssize_t *line_len)
 {
   gcc_assert (line_num > 0);
 
@@ -873,7 +830,7 @@  file_cache_slot::read_line_num (size_t line_num,
     {
       /* We've been asked to read lines that are before m_line_num.
 	 So lets use our line record (if it's not empty) to try to
-	 avoid re-reading the file from the beginning again.  */
+	 avoid re-scanning the data from the beginning again.  */
 
       if (m_line_record.is_empty ())
 	{
@@ -882,7 +839,7 @@  file_cache_slot::read_line_num (size_t line_num,
 	}
       else
 	{
-	  file_cache_slot::line_info *i = NULL;
+	  line_info *i = NULL;
 	  if (m_total_lines <= line_record_size)
 	    {
 	      /* In languages where the input file is not totally
@@ -918,7 +875,7 @@  file_cache_slot::read_line_num (size_t line_num,
 	  if (i && i->line_num == line_num)
 	    {
 	      /* We have the start/end of the line.  */
-	      *line = m_data + i->start_pos;
+	      *line = m_data_begin + i->start_pos;
 	      *line_len = i->end_pos - i->start_pos;
 	      return true;
 	    }
@@ -957,7 +914,7 @@  file_cache_slot::read_line_num (size_t line_num,
 char_span
 location_get_source_line (const char *file_path, int line)
 {
-  char *buffer = NULL;
+  const char *buffer = NULL;
   ssize_t len;
 
   if (line == 0)