Summary of data storage issues for Java interview points

2023-04-14 23:09:31

Data storage

1) Precautions for MySQL Index Usage

(1). Index does not contain NULL columns

As long as the column contains a NULL value, it will not be included in the index, as long as there is a column with a NULL value in the composite index, then this column is invalid for this match index.

(2). Use a short index

Indexes the list, and if it can, it should specify a prefix length. For example, if you have a column with char(255) and if multiple values â€‹â€‹are unique within the first 10 or 20 characters, do not index the entire column. Short indexes not only increase query speed but also save disk space and I/O operations.

(3). Index column sort

The mysql query uses only one index, so if the index has been used in the where clause, then the column in order by will not use the index. Therefore, do not use the sort operation if the default sorting of the database can meet the requirements. Try not to include the sorting of multiple columns. If necessary, it is better to build a composite index for these columns.

Summary of data storage issues for Java interview points

(4).like statement operation

Under normal circumstances is not encouraged to use like operation, if not use it, pay attention to the correct use. Like '%aaa%' does not use an index, and like 'aaa%' can use an index.

(5). Do not perform operations on columns

(6). Do not use NOT IN, "",! = Operation, but ", "=, =,"," =, BETWEEN, IN is available for indexing

(7). The index is built on fields that are often subjected to select operations.

This is because if these columns are rarely used, the presence or absence of an index does not significantly change the query speed. On the contrary, due to the increase of the index, the maintenance speed of the system is reduced and the space requirement is increased.

(8). The index is based on a field whose value is unique.

(9). Columns that are defined as text, image, and bit data types should not be indexed. Because the data volume of these columns is either quite large or has very little value.

(10). The columns that appear in where and join need to be indexed.

(11).where there is an inequality in the query condition (where column != ...), mysql will not be able to use the index.

(12). If the query in the where clause uses a function (eg: where DAY(column)=...), mysql will not be able to use the index.

(13). In the join operation (when data needs to be extracted from multiple data tables), mysql can use the index only when the data types of the primary key and the foreign key are the same, otherwise the index will be established in time.

2) Talk about anti-pattern design

In simple terms, anti-patterns refer to inefficiencies, defects, or design patterns/methods that are often used to solve problems that are often faced. Even anti-patterns can be a wrong development idea/idea. Here I give a simple example: In the object-oriented design / programming, there is a very important principle, the single responsibility principle (Single responsibility principle). The central idea is that for a module, or a class, this module or class should only be responsible for a function of the system/software, and the responsibility should be completely encapsulated by the class. When a developer needs to modify a function of the system, this module/class is the main modification place. The corresponding anti-pattern is the God Class. Generally speaking, this class controls many other classes and also depends on many other classes. The entire class is not only responsible for its own main single function, but also responsible for many other functions, including some auxiliary functions. Many developers who maintain old programs may have encountered such classes. There are several thousand lines of code in a class and many functions, but the responsibilities are not clear. Unit test programs have also become complicated. The time to maintain/modify this class far exceeds that of other classes. In many cases, it is not the intention of the developer to form this situation. In many cases, it is mainly due to the fact that with the years of the system, changes in demand, pressure on the resources of the project, the flow of personnel in the project team, and the changes in the system structure, some of the original smaller ones, which conformed to the single principle class, slowly became bloated. Finally, when this class becomes a maintenance nightmare (especially after the formerly familiar developer leaves the company), refactoring the class becomes an easy project.

3) Talk about sub-libraries and sub-table design

The vertical table is more common in daily development and design. The popular term is called "large table decapitulation," and the split is based on the "column" (field) in the relational database. Normally, there are more fields in a table. You can create a new "extension table" to split the fields that are not often used or have a large length and put them in the "extension table". In the case of many fields, splitting is indeed more convenient for development and maintenance (I have seen a legacy system that contains more than 100 columns in a large table). In a sense, it can also avoid "cross-page" problems (MySQL and MSSQL are stored under the "data page" at the bottom, and the "cross-page" problem may cause additional performance overhead. The operation of splitting the fields is recommended in the database. If you are splitting the development process, you need to rewrite the previous query, which will bring additional costs and risks.

Vertical sub-libraries have become very popular today with the prevalence of "micro-services". The basic idea is to divide different databases according to business modules, instead of putting all data tables into the same database as before. System-level "service-based" split operations can solve the coupling and performance bottlenecks at the business system level, and are conducive to the system's extended maintenance. The division of the database level, the truth is also connected. Similar to the "governance" and "downgrade" mechanisms of services, we can also "classify" management, maintenance, monitoring, and extension of data of different business types.

As we all know, the database is often the easiest bottleneck for the application system, and the database itself is "stateful." Compared to Web and application servers, it is more difficult to achieve "horizontal expansion." Database connection resources are invaluable and stand-alone processing capabilities are limited. In a high concurrency scenario, vertical branch libraries can break through the bottlenecks of IO, connections, and stand-alone hardware resources to a certain degree. This is an important method for optimizing database architecture in large-scale distributed systems. .

Then, many people did not fundamentally figure out why they should be split, nor did they master the principles and techniques of splitting. They just simply imitated the practices of big manufacturers. Caused many problems after splitting (for example: cross-database joins, distributed transactions, etc.).

Horizontal table is also called horizontal table. It is easy to understand that different data rows in the table are distributed to different database tables according to certain rules (these tables are stored in the same database), so as to reduce the amount of data in a single table. , optimize query performance. The most common way is to perform Hash and modulo-split through fields such as primary key or time. Horizontal table, can reduce the amount of data in a single table, to a certain extent, can alleviate the query performance bottlenecks. But in essence these tables are also stored in the same library, so the library level still has IO bottlenecks. Therefore, this practice is generally not recommended.

The horizontal split table is the same as the horizontal split table discussed above. The only difference is that these split tables are stored in different data. This is also the practice chosen by many large Internet companies. In a sense, some systems use "hot and cold data separation" (some of them use less historical data to migrate to other databases. In business functions, they usually only provide hot data query by default). Similar practice. In the context of high concurrency and massive data, the sub-library sub-table can effectively alleviate the performance bottlenecks and pressures of single-machine and single-bank, breaking the bottleneck of IO, connection number, and hardware resources. Of course, the hardware cost of input will also be higher. At the same time, this will also bring some complicated technical problems and challenges (for example: complex queries across shards, cross-segmentation transactions, etc.).

4) Distributed dilemmas and countermeasures brought by sub-banks and sub-tables

Data Migration and Capacity Expansion

The preceding introduction to the horizontal table strategy summarizes the two situations of random table and continuous table. There may be data hot spots in the continuous table, and some tables may be frequently queried to cause great pressure. The table of hot data becomes the bottleneck of the entire library, and some tables may have historical data, rarely need to be stored. Was queried. Another advantage of continuous table splitting is that it is relatively easy, no need to consider the migration of old data, only need to add the table to automatically expand capacity. The data of randomized tables is relatively uniform, and hotspots and concurrent access bottlenecks do not easily occur. However, table expansion requires the migration of old data.

Designing for a horizontal table is crucial. It is necessary to evaluate the growth rate of the business in the short-to-medium-term period, to carry out capacity planning for the current data volume, and to estimate the number of shards that are required. For the data migration problem, the general practice is to read the data first through the program, and then write the data to each sub-table according to the specified sub-table strategy.

Table related issues

In the case of a single library table, the union of queries is very easy. However, with the evolution of sub-databases and sub-tables, joint queries encounter cross-database relationships and cross-table relationships. At the beginning of design, joint queries should be avoided as much as possible, and they can be assembled in the program or avoided by anti-paradigm design.

Pagination and sorting issues

In general, when sorting pages, you need to sort by the specified fields. Paging and sorting is also very easy in the case of a single library sheet. However, with the evolution of sub-libraries and sub-tables, cross-database sorting and cross-table sorting problems will also be encountered. For the accuracy of the final result, the data needs to be sorted and returned in different sub-tables, and the result sets returned by different sub-tables are summarized and sorted again, and finally returned to the user.

Distributed transaction problems

With the evolution of sub-libraries and sub-tables, we will encounter distributed transaction problems. How to ensure data consistency becomes a must-have issue. At present, distributed transactions do not have a good solution, and it is difficult to meet the strong consistency of data. Under normal circumstances, the stored data is as consistent as possible with the user, ensuring that the system recovers and corrects after a short period of time, and the data eventually reaches the end. Consistent.

Distributed globally unique ID

In the case of single-single table, it is relatively simple to directly use the database self-enhancement feature to generate the primary key ID. In the sub-database environment, the data is distributed on different sub-tables, and it cannot rely on the self-growth characteristics of the database. You need to use a globally unique ID such as UUID, GUID, and so on. How to choose the appropriate global unique ID, I will introduce in the following chapters.

Excerpt from: http://blog.csdn.net/jiangpingjiangping/arTIcle/details/78069480

5) Talk about SQL optimization

(a), some common SQL practices

(1) Negative criteria query cannot use index

Select from order where status! =0 and stauts! =1

Not in/not exists is not a good habit

Can be optimized for in queries:

Select from order where status in(2,3)

(2) The leading fuzzy query cannot use the index

Select from order where desc like '%XX'

Instead of leading fuzzy queries you can:

Select from order where desc like 'XX%'

(3) Fields with little discriminatory data should not use indexes

Select from user where sex=1

Reason: There are only male and female genders, and there is little data to filter out each time. It is not appropriate to use the index.

Empirically, indexes can be used when 80% of the data can be filtered. For the order status, if the status value is small, it is not appropriate to use the index. If there are many status values â€‹â€‹that can filter a large amount of data, an index should be established.

(4) The calculation on the attribute can not hit the index

Select from order where YEAR(date) = '2017'

Even if an index is created on date, the full table scan will be optimized for value calculations:

Select from order where date = CURDATE()

or:

Select from order where date = "2017-01-01'

(B), not well-known SQL practice

(5) If the business is mostly a single query, the use of Hash index performance is better, such as user center

Select from user where uid=? Select from user where login_name=?

Reason: The time complexity of the B-Tree index is O(log(n)); the time complexity of the Hash index is O(1)

(6) Allow null columns to query potential pits

A single-column index does not have a null value. A compound index does not contain all null values. If the column is allowed to be null, you may get a result set that does not meet expectations.

Select from user where name ! = 'shenjian'

If name is allowed to be null, the index does not store a null value and the result set does not contain these records.

So use the not null constraint and the default value.

(7) The left-most prefix of the compound index is not the same as the compound index.

The user center has created a composite index (login_name, passwd):

Select from user where login_name=? And passwd=?

Select from user where passwd=? And login_name=?

Can all hit the index

Select from user where login_name=?

Can also hit the index, satisfying the leftmost prefix of the composite index

Select from user where passwd=?

Cannot hit the index, does not satisfy the leftmost prefix of the compound index

(8) Use ENUM instead of string

ENUM saves TINYINT. Don't use enumerations to create strings such as "China," "Beijing," and "Technology." Strings are large and inefficient.

(C), niche but useful SQL practice

(9) Limit 1 can improve efficiency if it is known that only one result is returned

Select from user where login_name=?

Can be optimized to:

Select from user where login_name=? Limit 1

Cause: You know that there is only one result, but the database does not know, explicitly tells it to let it actively stop cursor movement

(10) Put the calculations into the business layer instead of the database layer, in addition to the CPU to save data, there are unexpected query cache optimization effects

Select from order where date = CURDATE()

This is not a good SQL practice and should be optimized to:

$curDate = date('Ym-d');

$res = mysqlquery(

'select from order where date = =curdate');

the reason:

Released the CPU of the database

Multiple calls, the same incoming SQL, can use the query cache

(11) Mandatory Type Translation Full Table Scan

Select from user where phone=13800001234

Do you think it will hit the phone index? What's wrong, how should this statement change?

At the end, add one more item. Do not use select * (subtext word, SQL disqualification of the article ==), and only return the required columns, which can greatly save the data transfer volume and the memory usage of the database.

6) Deadlock problems encountered by MySQL

The four necessary conditions for producing a deadlock:

(1) Exclusive conditions: A resource can only be used by one process at a time.

(2) Request and hold conditions: When a process blocks due to a request for a resource, it holds it to the acquired resource.

(3) No deprivation conditions: The resources that the process has obtained cannot be forcibly deprived until the end of its use.

(4) The loop waiting condition: A process of waiting for resources is formed between several processes.

These four conditions are the necessary conditions for deadlock. As long as the system is deadlocked, these conditions must be established. As long as one of the above conditions is not satisfied, deadlock does not occur.

The following methods can help minimize deadlocks:

(1) Access objects in the same order.

(2) Avoid user interactions in the transaction.

(3) Keep the transaction brief and in a batch process.

(4) Use a low isolation level.

(5) Use bound connections.

7) InnoDB storage engine and MyISAM

(1).InnoDB does not support FULLTEXT type index.

(2).InnoDB does not save the specific number of rows in the table, that is, when executing select count() from table, InnoDB will scan the entire table to calculate how many rows, but MyISAM simply reads out the saved rows. Number can be. Note that when the count() statement contains a where condition, the operation of the two tables is the same.

(3). For a field of type AUTO_INCREMENT, InnoDB must contain only the index of the field, but in the MyISAM table, a joint index can be established with other fields.

(4).DELETE FROM table, InnoDB will not re-create the table, but deleted row by row.

(5).LOAD TABLE FROM MASTER operation does not work for InnoDB, the solution is to first change the InnoDB table to MyISAM table, import the data and then change to InnoDB table, but for the use of additional InnoDB features (such as foreign keys ) The table does not apply.

In addition, the row lock of the InnoDB table is not absolute, if the MySQL can not determine the scope to scan when executing a SQL statement, the InnoDB table also locks the entire table, for example, update table set num=1 where name like â€œ%aaa%â€

8) The principle of database indexing

The database index is a sort of data structure in the database management system to help quickly query and update the data in the database table. The implementation of the index usually uses the B-tree and its variant B+ tree.

9) Why use B-tree

In general, the index itself is also very large, it is impossible to store all in memory, so the index is often stored on the disk in the form of an index file. In this case, the disk I/O consumption will be generated in the index search process. Compared with memory access, the I/O access consumption will be several orders of magnitude higher. Therefore, the most important indicator for evaluating the merits of a data structure as an index is The progressive complexity of disk I/O operations during the lookup process. In other words, the structure of the index is organized to minimize the number of disk I/O accesses during the search.

10) The difference between a clustered index and a non-clustered index

(1). There can be only one table for a clustered index, instead of a clustered index. A table can exist more than one

(2). The clustered index storage records are physically continuous, non-clustered indexes are logically contiguous, physical storage is not continuous

(3). Clustered index: physical storage is sorted according to the index; clustered index is an index organization form, the logical order of the key of the index determines the physical storage order of the table data rows

Non-clustered index: physical storage is not sorted by index; non-clustered index is an ordinary index, just to create a corresponding index on the data column does not affect the physical storage order of the entire table.

(4). The index is described by the data structure of the binary tree. We can understand the clustered index as follows: The leaf node of the index is the data node. Leaf nodes other than the clustered index are still inodes, except that there is a pointer to the corresponding data block.

11) How to solve limit 20000 loading slow

Mysql performance is low because the database to scan N + M records, and then give up the previous N records, very expensive

Solve the strategy:

(1), front end plus cache, or other methods to reduce the query operations that fall to the library, for example, data in some systems is backed up in the search engine, you can search using es, etc.

(2) Using delayed association, that is, the common limit first obtains the index field of the required data, and then obtains the required data by associating the original table with the index field.

Select a.* from a,(select id from table_1 where is_deleted='N' limit 100000,20) b where a.id = b.id

(3), from the business, do not page so much, for example, can only page the first 100 pages, the back of the do not allow to check

(4) Instead of using limit N, M, use limit N, which is to convert offset into a where condition.

12) Choosing a suitable distributed primary key scheme

Database self-growth sequence or field

UUID

Using UUID to Int64

Redis generate ID

Twitter's snowflake algorithm

Use zookeeper to generate a unique ID

MonjeDB's ObjecTId

13) Choosing the right data storage scheme

Relational database MySQL

MySQL is one of the most popular relational databases and is widely used in Internet products. Under normal circumstances, the MySQL database is the first option of choice, and basically 80% ~ 90% of the scenarios are based on MySQL database. Because it requires a relational database for management, in addition, there are many transactional operations in the business, and it is necessary to ensure strong consistency of transactions. At the same time, there may be some complicated SQL queries. It is worth noting that, as much as possible, the joint query of the table is reduced in the early stage, so that the database can be divided into sub-tables in the case that the amount of data in the later period is increased.

In-memory database Redis

With the increase in the amount of data, MySQL has not been able to meet the needs of large-scale Internet applications. Therefore, Redis stores data based on memory, which can greatly improve query performance and complement the product architecture. For example, to increase the access speed of the server interface, hotspot data with high read frequency is stored in Redis as much as possible. This is a very typical space-for-time strategy that uses more memory for CPU resources and speeds up the program by increasing system memory consumption.

In some scenarios, you can make full use of the features of Redis to greatly improve efficiency. These scenarios include caching, session caching, timeliness, access frequency, counters, social lists, record user decision information, intersections, unions and differences, hot lists and rankings, and updates.

When using Redis as a cache, you need to consider cache usage issues such as data inconsistencies and dirty reads, cache update mechanisms, cache availability, cache service degradation, cache penetration, and cache warming.

Document Database MongoDB

MongoDB is a supplement to traditional relational databases. It is well-suited for highly scalable scenarios. It is an extensible table structure. Based on this, the MySQL table structure whose table structure may be continuously expanded within the expected range may be stored by MongoDB, which guarantees the scalability of the table structure.

In addition, the amount of data in the log system is particularly large. If you use the MongoDB database to store these data, the use of fragmented clusters to support massive data, and the ability to use aggregate analysis and MapReduce, is a good choice.

MongoDB is also suitable for storing large-size data. The GridFS storage solution is based on MongoDB's distributed file storage system.

Column family database HBase

HBase is suitable for mass data storage and high-performance real-time query. It runs on the HDFS file system and serves as the target database for MapReduce distributed processing to support offline analysis applications. It has played an increasing role in data warehousing, data marts, business intelligence and other fields, supporting the application of a large number of big data analytics scenarios in thousands of companies.

Full-text search engine ElasTIcSearch

In general, the fuzzy queries of relational databases are all like queries. Among them, like "value%" can use the index, but for like "%value%" this way, perform a full table query, this table in the small amount of data, there is no performance problem, but for massive data, full table scan is Very scary thing. ElasticSearch is a real-time distributed search and analysis engine based on the full-text search engine Apache Lucene. It is suitable for processing real-time search application scenarios. In addition, using ElasticSearch full-text search engine, you can also support advanced features such as multi-entry query, matching degree and weight, automatic association, and spelling correction. Therefore, ElasticSearch can be used as a complement to the full-text search of relational databases, and the data to be full-text searched can be cached to ElasticSearch to achieve the purpose of processing complex services and improving query speed.

ElasticSearch is not only suitable for search scenarios, it is also ideal for log processing and analysis scenarios. The well-known ELK log processing solution consists of three components: ElasticSearch, Logstash, and Kibana, including log collection, aggregation, multi-dimensional query, and visual display.

14) ObjectId rule

[0,1,2,3] [4,5,6] [7,8] [9,10,11]

Time Stamp | Machine Code | PID | Counter

The first four digits are timestamps that provide uniqueness at the second level.

The next three are the unique identifiers of the hosts on which they are located, usually the hash value of the machine's host name.

The next two bits are the PID that generated the ObjectId, ensuring that the concurrent ObjectId generated on the same machine is unique.

The first nine bits guarantee that the unique objectId generated by different processes of different machines in the same second.

The last three bits are the increment counters, which ensure that the same ObjectId generated by the same process is unique for the same process.

15) Talk about MongoDB usage scenarios

Highly scalable scene

MongoDB is ideal for highly scalable scenarios. It is a scalable table structure. Based on this, the MySQL table structure whose table structure may be continuously expanded within the expected range may be stored by MongoDB, which guarantees the scalability of the table structure.

Log system scenario

The amount of data in the log system is particularly large. If you use the MongoDB database to store this data, the use of fragmented clusters to support massive data, and the ability to use aggregate analysis and MapReduce, is a good choice.

Distributed file storage

MongoDB is also suitable for storing large-size data. The previously described GridFS storage solution is based on MongoDB's distributed file storage system.

16) Inverted Index

Inverted index (Inverted index), also often referred to as inverted index, placed file, or reversed file, is an indexing method that is used to store a word or a group of words under a full-text search. Map of storage locations in the document. It is the most commonly used data structure in document retrieval systems.

There are two different inverted index forms:

The horizontal inverted index (or reverse archive index) of a record contains a list of documents for each referenced word.

The horizontal inverted index (or fully inverted index) of a word contains the position of each word in a document.

17) Talk about ElasticSearch usage scenarios

Full text search, this is the most used. With segmentation plug-in, Pinyin plug-in what can make a powerful full-text search engine.

Database, very curious usage, because the same number of ES storage data, more space-consuming, but really good, because of his powerful statistical analysis and summary capabilities, coupled with distributed P2P expansion capabilities, and now the hardware is so cheap, so some people take Come to the database.

Online statistical analysis engine, log system. Logstash, do not explain it. Dynamic analysis of data in real time is very cool.

VEIIK Vape Pod Kit

VEIIK Vape Pod Kit is so convenient, portable, and small volume, you just need to take them out of your pocket and take a puff, feel the cloud
of smoke, and the fragrance of fruit surrounding you. It's so great.
We are the distributor of the VEIIK Vape brand, we sell veiik e cigarette, veiik vape pen, veiik disposable vaporizer, and so on.
We are also China's leading manufacturer and supplier of Disposable Vapes puff bars, disposable vape kit, e-cigarette, vape pens, and e-cigarette kit,
and we specialize in disposable vapes, e-cigarette vape pens, e-cigarette kits, etc.

veiik vape pod mod kit,veiik vape pod system,veiik vape pod kit with seed starting system,veiik vape pod kit plus,veiik vape pod kit disposable

Ningbo Autrends International Trade Co.,Ltd. , https://www.supervapebar.com