Skip to main content

2 posts tagged with "data"

View All Tags

Announcing Apache Pinot 0.10

· 5 min read
Apache Pinot Engineering Team
Apache Pinot Engineering Team

We are excited to announce the release this week of Apache Pinot 0.10. Apache Pinot is a real-time distributed datastore designed to answer OLAP queries with high throughput and low latency.

This release is cut from commit fd9c58a11ed16d27109baefcee138eea30132ad3. You can find a full list of everything included in the release notes.

Let’s have a look at some of the changes, with the help of the batch QuickStart configuration.

Query Plans#

Amrish Lal implemented the EXPLAIN PLAN clause, which returns the execution plan that will be chosen by the Pinot Query Engine. This lets us see what the query is likely to do without actually having to run it.

EXPLAIN PLAN FORSELECT *FROM baseballStatsWHERE league = 'NL'

If we run this query, we'll see the following results:

OperatorOperator_IdParent_Id
BROKER_REDUCE(limit:10)0-1
COMBINE_SELECT10
SELECT(selectList:AtBatting, G_old, baseOnBalls, caughtStealing, doules, groundedIntoDoublePlays, hits, hitsByPitch, homeRuns, intentionalWalks, league, numberOfGames, numberOfGamesAsBatter, playerID, playerName, playerStint, runs, runsBattedIn, sacrificeFlies, sacrificeHits, stolenBases, strikeouts, teamID, tripples, yearID)21
TRANSFORM_PASSTHROUGH(AtBatting, G_old, baseOnBalls, caughtStealing, doules, groundedIntoDoublePlays, hits, hitsByPitch, homeRuns, intentionalWalks, league, numberOfGames, numberOfGamesAsBatter, playerID, playerName, playerStint, runs, runsBattedIn, sacrificeFlies, sacrificeHits, stolenBases, strikeouts, teamID, tripples, yearID)32
PROJECT(homeRuns, playerStint, groundedIntoDoublePlays, numberOfGames, AtBatting, stolenBases, tripples, hitsByPitch, teamID, numberOfGamesAsBatter, strikeouts, sacrificeFlies, caughtStealing, baseOnBalls, playerName, doules, league, yearID, hits, runsBattedIn, G_old, sacrificeHits, intentionalWalks, runs, playerID)43
FILTER_FULL_SCAN(operator:EQ,predicate:league = 'NL')54

FILTER Clauses for Aggregates#

Atri Sharma added the filter clause for aggregates. This feature makes it possible to write queries like this:

SELECT SUM(homeRuns) FILTER(WHERE league = 'NL') AS nlHomeRuns,       SUM(homeRuns) FILTER(WHERE league = 'AL') AS alHomeRunsFROM baseballStats

If we run this query, we'll see the following output:

nlHomeRunsalHomeRuns
135486135990

greatest and least#

Richard Startin added the greatest and least functions:

SELECT playerID,       least(5.0, max(homeRuns)) AS homeRuns,       greatest(5.0, max(hits)) AS hitsFROM baseballStatsWHERE league = 'NL' AND teamID = 'SFN'GROUP BY playerIDLIMIT 5

If we run this query, we'll see the following output:

playerIDhomeRunshits
ramirju0105
milneed01454
testani0105
shawbo0108
vogelry01012

DistinctCountSmartHLL#

Xiaotian (Jackie) Jiang added the DistinctCountSmartHLL aggregation function, which automatically converts the Set to HyperLogLog if the set size grows too big to protect the servers from running out of memory:

SELECT DISTINCTCOUNTSMARTHLL(homeRuns, 'hllLog2m=8;hllConversionThreshold=10')FROM baseballStats

If we run this query, we'll see the following output:

distinctcountsmarthll(homeRuns)
66

UI updates#

There were also a bunch of updates to the Pinot Data Explorer, by Sanket Shah and Johan Adami.

The display of reported size and estimated size is now in a human readable format:

Human readable sizes

Fixes for the following issues:

  • Error messages weren't showing on the UI when an invalid operation is attempted:

A backwards incompatible attempted schema change

  • Query console goes blank on syntax error.
  • Query console cannot show query result when multiple columns have the same name.
  • Adding extra fields after SELECT * would throw a NullPointerException.
  • Some queries were returning -- instead of 0.
  • Query console couldn't show the query result if multiple columns had the same name.
  • Pinot Dashboard tenant view showing the incorrect amount of servers and brokers.

RealTimeToOffline Task#

Xiaotian (Jackie) Jiang made some fixes to the RealTimeToOffline job to handle time gaps and proceed to the next time window when no segment matches the current one.

Empty QuickStart#

Kenny Bastani added an empty QuickStart command, which lets you quickly spin up an empty Pinot cluster:

docker run \  -p 8000:8000 \  -p 9000:9000 \  apachepinot/pinot:0.10.0 QuickStart \  -type empty

You can then ingest your own dataset without needing to worry about spinning up each of the Pinot components individually.

Data Ingestion#

  • Richard Startin fixed some issues with real-time ingestion where consumption of messages would stop if a bad batch of messages was consumed from Kafka.

  • Mohemmad Zaid Khan added the BoundedColumnValue partition function, which partitions segments based on column values.

  • Xiaobing Li added the fixed name segment generator, which can be used when you want to replace a specific existing segment.

Other changes#

  • Richard Startin set LZ4 compression as the default for all metrics fields.
  • Mark Needham added the ST_Within geospatial function.
  • Rong Rong fixed a bug where query stats wouldn't show if there was an error processing the query (e.g. if the query timed out).
  • Prashant Pandey fixed the query engine to handle extra columns added to a SELECT * statement.
  • Richard Startin added support for forward indexes on JSON columns.
  • Rong Rong added the GRPC broker request handler so that data can be streamed back from the server to the broker when processing queries.
  • deemoliu made it possible to add a default strategy when using the partial upsert feature.
  • Jeff Moszuti added support for the TIMESTAMP data type in the configuration recommendation engine.

Dependency updates#

The following dependencies were updated:

  • async-http-client because the library moved to a different organization.
  • RoaringBitmap to 0.9.25
  • JsonPath to 2.7.0
  • Kafka to 2.8.1
  • Prometheus to 0.16.1

Resources#

If you want to try out Apache Pinot, the following resources will help you get started:

Text analytics on LinkedIn Talent Insights using Apache Pinot

· One min read
LinkedIn
LinkedIn Engineering Team

LinkedIn Talent Insights (LTI) is a platform that helps organizations understand the external labor market and their internal workforce, and enables the long term success of their employees. Users of LTI have the flexibility to construct searches using the various facets of the LinkedIn Economic Graph (skills, titles, location, company, etc.).

Read More at https://engineering.linkedin.com/blog/2021/text-analytics-on-linkedin-talent-insights-using-apache-pinot

Text analytics on LinkedIn Talent Insights using Apache Pinot