Automatic template extraction
		 History /
		 Edit /
		 PDF /
		 EPUB /
		 BIB /
		
		
Created: July 14, 2016 / Updated: August 30, 2025 / Status: in progress / 3 min read (~508 words)
					
			
	Created: July 14, 2016 / Updated: August 30, 2025 / Status: in progress / 3 min read (~508 words)
- To extract patterns, group them by starting character, then test how many have the same following character
 - Grammar induction
 - Compression
- Compression can be a tool for automatic template extraction, however we would most likely want to priority semantics of the extracted template over better compression
 
 - Diff/match/patch
 - Fragment extraction, then wildcard pattern generation
 - Lexer-like that will replace a whole sequence if it is already in the grammar instead of doing character by character replacement like sequitur
 
- Extract textual templates from any language (basically tries to find repetitions/patterns)
 - Min/max length (characters)
 - Discovery of syntax
 - Hierarchical/meta extraction
 
<...> is a placeholder (can be replaced/is variable)
public function getClassification()
{
    return $this->classification;
}
public function setClassification($classification)
{
    $this->classification = $classification;
    return $this;
}
public function get()
{
    return $this->;
}
public function set()
{
    $this-> = ;
    return $this;
}
      
class SomeClass extends Model {
    protected $table = 'some_table';
    protected $fillable = ['field_1', 'field_2'];
}
class  extends Model {
    protected $table = '';
    protected $fillable = [];
}
   
if ($someCondition) {
    // Do something
}
if ()
    
  
- Create a dictionary of all seen characters
 - Create a dictionary of characters -> index
 - Define some sort of relative threshold for which to ignore patterns
 - You have a single string, you want to extract patterns out of it
 - You have two strings, you want to extract patterns out of them
 
- How to extract simple constructs such as if/elseif/else/while/do/for/foreach?
 - 
How to compress aaaabbbb into an expanding aCb -> aaCbb -> aaaCbbb -> aaaabbbb vs AB -> aaaaB -> aaaabbbb
- 
aaaabbbb -> aaaCbbb -> aaDbb -> aEb -> F
- C := ab
 - D := aCb
 - E := aDb
 - 
F := aEb
 - C := aCb
-> This is a context-free grammar 
 
 - 
 - Do we want to prioritize short rules such as S -> Sa such that they can be repeated many times, or rules that contains a lot of symbols such as S -> aSa
- Probably want to minimize the number of rules/productions
 - Probably want to minimize the rule length
 
 - From [^1]
- p1: no pair of adjacent symbols appears more than once in the grammar;
 - p2: every rule is used more than once.
 
 - How can we prefer 
public function <>(<>) {<>}over} public function <>(<>) {?- If we refer to an explicit grammar, we can give more weight to the first one because it is likely a construct/production in the grammar, while the second one is the concatenation of two productions
 
 
- http://www.sequitur.info/
 - Identifying Hierarchical Structure in Sequences: A linear-time algorithm
 - https://en.wikipedia.org/wiki/Three-address_code
 - https://en.wikipedia.org/wiki/Optimizing_compiler
 - https://en.wikipedia.org/wiki/Intermediate_representation
 - https://en.wikipedia.org/wiki/Abstract_syntax_tree